Streaming read-ready sequential scan code

Started by Melanie Plagemanalmost 2 years ago52 messages
#1Melanie Plageman
melanieplageman@gmail.com
4 attachment(s)

Hi,

Last year, David and I worked on a round of refactoring for
heapgettup() and heapgettup_pagemode() [1]/messages/by-id/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com. Now that the streaming
read API has been proposed [2]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com, there is a bit more refactoring that
can be done on master to prepare sequential scan to support streaming
reads.

Patches 0001 and 0002 in the attached patchset do this new round of
refactoring. 0003 is the remainder of the streaming read API that is
not yet in master. 0004 is the sequential scan streaming read user.

The primary change needed to be able to drop in streaming read support
was that heapgettup() and heapgettup_pagemode() have to wait for there
to be no more valid buffers instead of waiting until there were no
more valid BlockNumbers to know that the relation has been entirely
processed. Naturally, streaming reads prefetch ahead of the block
being currently processed by the scan, so all blocks should have been
requested long before all blocks have been processed.

To change this, I split up heapgetpage() into two functions -- one
responsible for getting blocks into buffers and the other for
processing a page (pruning, checking tuple visibility, etc). As a
consequence, I had to change the other caller of heapgetpage() (sample
scans). Since I was doing this anyway, I made a few changes there. It
is arguable that those changes could be split up differently between
0001 and 0004. However, I wanted 0004 to be *only* the sequential scan
streaming read user code.

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans (see TODO in 0004).
However, I thought I would keep this thread focused on 0001 and 0002.

Though logically the performance with 0001 and 0002 should be the same
as master (no new non-inline function calls, no additional looping),
I've done a bit of profiling anyway. I created a large multi-GB table,
read it all into shared buffers (disabling the large sequential scan
bulkread optimization), and did a sequential SELECT count(*) from the
table. From the profiles below, you'll notice that master and the
patch are basically the same. Actual percentages vary from run-to-run.
Execution time is the same.

patch
15.49% postgres postgres [.] ExecInterpExpr
11.03% postgres postgres [.] heapgettup_pagemode
10.85% postgres postgres [.] ExecStoreBufferHeapTuple
9.14% postgres postgres [.] heap_getnextslot
8.39% postgres postgres [.] heapbuildvis
6.47% postgres postgres [.] SeqNext

master
14.16% postgres postgres [.] ExecInterpExpr
11.54% postgres postgres [.] heapgettup_pagemode
10.63% postgres postgres [.] ExecStoreBufferHeapTuple
10.22% postgres postgres [.] heap_getnextslot
8.53% postgres postgres [.] heapgetpage
5.35% postgres postgres [.] SeqNext

- Melanie

[1]: /messages/by-id/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ@mail.gmail.com
[2]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Attachments:

v1-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload
From a3176b7a8fce30097d8452bbb6d89a5d87a0b764 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v1 2/4] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 75 ++++++++++++++------------------
 1 file changed, 33 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 608d43c4efd..9e3e6d8b52b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -87,6 +87,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -467,12 +470,10 @@ heapbuildvis(TableScanDesc sscan)
  * returns with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(TableScanDesc sscan, BlockNumber block)
+heapfetchbuf(TableScanDesc sscan, ScanDirection dir)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
 
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -487,10 +488,19 @@ heapfetchbuf(TableScanDesc sscan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (!scan->rs_inited)
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -500,7 +510,7 @@ heapfetchbuf(TableScanDesc sscan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -640,7 +650,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -738,23 +748,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -764,9 +764,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf((TableScanDesc) scan, block);
+		heapfetchbuf((TableScanDesc) scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -788,7 +791,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -818,9 +821,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -853,22 +853,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -885,9 +876,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf((TableScanDesc) scan, block);
+		heapfetchbuf((TableScanDesc) scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -907,7 +901,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -918,9 +912,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.37.2

v1-0003-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Streaming-Read-API.patchDownload
From ee81c2f74d09b8883da322adf13fc8a2186a7094 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 3/4] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae641..018ebbcbaae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2

v1-0004-Sequential-scans-support-streaming-read.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Sequential-scans-support-streaming-read.patchDownload
From 1a87519763a0fa67433a0049dcb3f9f021bd5e11 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 12:22:38 -0500
Subject: [PATCH v1 4/4] Sequential scans support streaming read

Add streaming read support for sequential scans. Do so by implementing
the streaming read callback to get the next block and save this block in
the scan descriptor.

The PgStreamingRead object is allocated in initscan(). This means it
will be freed and reallocated on rescan. Implementing a streaming read
reset function is a TODO for that API.

Currently, only forwards scans are supported by the streaming read API;
so, if a scan switches from forwards to backwards, the PgStreamingRead
object will need to be freed. This also means that if a scan switches
from backwards to forwards, it will not use streaming reads.

Distinguishing between a scan that has yet to be initialized, one that
doesn't support streaming reads and one that has switched scan
directions is one reason why it is difficult to wait until
heapfetchbuf() to allocate the PgStreamingRead object.
---
 src/backend/access/heap/heapam.c | 106 ++++++++++++++++++++++++++++---
 src/include/access/heapam.h      |   3 +
 2 files changed, 100 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9e3e6d8b52b..cc20e0f972c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -65,6 +65,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -228,6 +229,27 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+
+	/* Only forward scans support streaming reads */
+	if (!scan->rs_inited)
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan,
+														   ForwardScanDirection);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   ForwardScanDirection);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -345,6 +367,36 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	 */
 	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
 		pgstat_count_heap_scan(scan->rs_base.rs_rd);
+
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
+	/* pgsr is freed and reallocated on rescan */
+	if (scan->pgsr)
+		pg_streaming_read_free(scan->pgsr);
+	scan->pgsr = NULL;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
+	/*
+	 * This streaming read cannot be allocated in the per tuple memory context
+	 * which is the current memory context during heapgettup[_pagemode](), as
+	 * the per tuple context is often reset before the end of the query. There
+	 * was discussion of allocating the pgsr when rs_inited is false. We could
+	 * switch into a memory context that doesn't get reset to allocate it
+	 * there, but 1) we probably want to reuse the pgsr across rescans and 2)
+	 * we have to free the pgsr if the scan changes from forwards to a
+	 * backwards scan anyway, so we better just allocate it here.
+	 */
+	if (!RelationUsesLocalBuffers(scan->rs_base.rs_rd) &&
+		(scan->rs_base.rs_flags & SO_TYPE_SEQSCAN))
+	{
+		scan->pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+													scan,
+													0,
+													scan->rs_strategy,
+													BMR_REL(scan->rs_base.rs_rd),
+													MAIN_FORKNUM,
+													heap_pgsr_next_single);
+	}
 }
 
 /*
@@ -488,19 +540,41 @@ heapfetchbuf(TableScanDesc sscan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (!scan->rs_inited)
+	/*
+	 * Backwards scans aren't supported with streaming read. At the time of
+	 * allocation, the scan direction is not determined. Note that this means
+	 * that if the scan switches from backwards to forwards, the forward scan
+	 * will not use streaming reads
+	 */
+	if (!ScanDirectionIsForward(dir) && scan->pgsr)
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		pg_streaming_read_free(scan->pgsr);
+		scan->pgsr = NULL;
+		scan->rs_prefetch_block = InvalidBlockNumber;
+	}
+
+	if (scan->pgsr)
+	{
+		scan->rs_cbuf = pg_streaming_read_buffer_get_next(scan->pgsr, NULL);
+		if (BufferIsValid(scan->rs_cbuf))
+			scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 	}
 	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+	{
+		if (!scan->rs_inited)
+		{
+			scan->rs_cblock = heapgettup_initial_block(scan, dir);
+			Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+			scan->rs_inited = true;
+		}
+		else
+			scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+		/* read block if valid */
+		if (BlockNumberIsValid(scan->rs_cblock))
+			scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+											   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	}
 }
 
 /*
@@ -1001,6 +1075,15 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	else
 		scan->rs_parallelworkerdata = NULL;
 
+	/*
+	 * TODO: implement pg_streaming_read_reset(), then allocate the streaming
+	 * reads here. Currently, they are allocated in initscan() which will free
+	 * and reallocate the pgsr on each rescan. Fixing this is especially
+	 * important for nested loop join. For now, set this to NULL to ensure the
+	 * streaming read is allocated in initscan().
+	 */
+	scan->pgsr = NULL;
+
 	/*
 	 * we do this here instead of in initscan() because heap_rescan also calls
 	 * initscan() and we don't want to allocate memory again
@@ -1065,6 +1148,11 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->pgsr)
+		pg_streaming_read_free(scan->pgsr);
+	scan->pgsr = NULL;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4a3a017c33a..8e702454367 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -59,6 +59,7 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	OffsetNumber rs_coffset;	/* current offset # in non-page-at-a-time mode */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+	BlockNumber rs_prefetch_block;	/* block being prefetched */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
@@ -72,6 +73,8 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	struct PgStreamingRead *pgsr;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
-- 
2.37.2

v1-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Split-heapgetpage-into-two-parts.patchDownload
From 9b8547083774007caf60a5bf1a538e4c94634690 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v1 1/4] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 76 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..608d43c4efd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -367,17 +367,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -385,31 +386,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
-
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -482,6 +460,39 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(TableScanDesc sscan, BlockNumber block)
+{
+	HeapScanDesc scan = (HeapScanDesc) sscan;
+
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -755,7 +766,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf((TableScanDesc) scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -876,7 +887,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf((TableScanDesc) scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..75bdd341e7a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2278,11 +2278,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2324,20 +2327,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 932ec0d6f2b..4a3a017c33a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -246,7 +246,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.37.2

#2David Rowley
dgrowleyml@gmail.com
In reply to: Melanie Plageman (#1)
Re: Streaming read-ready sequential scan code

On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
<melanieplageman@gmail.com> wrote:

Though logically the performance with 0001 and 0002 should be the same
as master (no new non-inline function calls, no additional looping),
I've done a bit of profiling anyway. I created a large multi-GB table,
read it all into shared buffers (disabling the large sequential scan
bulkread optimization), and did a sequential SELECT count(*) from the
table. From the profiles below, you'll notice that master and the
patch are basically the same. Actual percentages vary from run-to-run.
Execution time is the same.

Can you also run a test on a Seqscan with a filter that filters out
all tuples? There's less overhead in other parts of the executor with
such a query.

David

#3Melanie Plageman
melanieplageman@gmail.com
In reply to: David Rowley (#2)
Re: Streaming read-ready sequential scan code

On Mon, Jan 29, 2024 at 4:24 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
<melanieplageman@gmail.com> wrote:

Though logically the performance with 0001 and 0002 should be the same
as master (no new non-inline function calls, no additional looping),
I've done a bit of profiling anyway. I created a large multi-GB table,
read it all into shared buffers (disabling the large sequential scan
bulkread optimization), and did a sequential SELECT count(*) from the
table. From the profiles below, you'll notice that master and the
patch are basically the same. Actual percentages vary from run-to-run.
Execution time is the same.

Can you also run a test on a Seqscan with a filter that filters out
all tuples? There's less overhead in other parts of the executor with
such a query.

Yes, of course. Thank you so much for taking a look!

While I was at it, I changed the table schema to be entirely composed
of INT type columns and regenerated the data. Note that, both in this
example and my previous example, I ensured that the table was vacuumed
beforehand (and autovacuum disabled for the table) so there wasn't any
on-access pruning happening (heapgetpage() does that in pagemode).

This is the schema
CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g
INT) with (autovacuum_enabled=false);

I added 46000000 rows to the table, making it 2.6 GB. Shared buffers
is double that. Before profiling, I did a SELECT * from the table with
the large sequential scan bulkread optimization disabled. Then I
vacuumed the table. Finally, I turned up parallel_setup_cost high
enough to disable query parallelism.

The query I profiled was:
SELECT * FROM foo WHERE id = 0;
With the data I generated, 0 rows match that condition.

Profiles below. Execution time essentially the same.

patch:
17.08% postgres postgres [.] ExecInterpExpr
11.17% postgres postgres [.] tts_buffer_heap_getsomeattrs
10.64% postgres postgres [.] ExecStoreBufferHeapTuple
9.82% postgres postgres [.] heap_getnextslot
9.13% postgres postgres [.] heapgettup_pagemode
8.98% postgres postgres [.] heapbuildvis
5.40% postgres postgres [.] HeapCheckForSerializableConflictOut
5.16% postgres postgres [.] SeqNext

master:
17.89% postgres postgres [.] ExecInterpExpr
12.28% postgres postgres [.] tts_buffer_heap_getsomeattrs
10.54% postgres postgres [.] ExecStoreBufferHeapTuple
10.11% postgres postgres [.] heapgettup_pagemode
8.52% postgres postgres [.] heapgetpage
8.28% postgres postgres [.] heap_getnextslot
5.00% postgres postgres [.] HeapCheckForSerializableConflictOut
4.71% postgres postgres [.] SeqNext

- Melanie

#4Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#1)
Re: Streaming read-ready sequential scan code

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option B) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context. It
only allocates the streaming read object if it is a forwards scan. It
frees the streaming read object if the scan direction is later
changed.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option C) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context.
- All scan directions support streaming. To do this, the scan
direction has to be tracked and we must check if the direction has
changed on every heapgettup[_pagemode]() invocation to avoid returning
wrong results.
- No "fallback" codepath as all heap sequential scans will use the
streaming read API

As you can see, each option has pros and cons. I'm interested in what
others think about which we should choose.

- Melanie

#5Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#4)
Re: Streaming read-ready sequential scan code

On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

It's weird to me that the prospect of changing the scan direction
causes such complexity. I mean, why doesn't a streaming read object
have a forget_all_my_previous_requests() method or somesuch?

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Melanie Plageman
melanieplageman@gmail.com
In reply to: Robert Haas (#5)
Re: Streaming read-ready sequential scan code

On Tue, Feb 20, 2024 at 6:13 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

It's weird to me that the prospect of changing the scan direction
causes such complexity. I mean, why doesn't a streaming read object
have a forget_all_my_previous_requests() method or somesuch?

Basically, that is what pg_streaming_read_free() does. It goes through
and releases the buffers it had pinned and frees any per-buffer data
allocated.

The complexity with the sequential scan streaming read user and scan
direction is just that it has to detect when the scan direction
changes and do the releasing/freeing and reallocation. The scan
direction is passed to heapgettup[_pagemode](), so this is something
that can change on a tuple-to-tuple basis.

It is less that doing this is complicated and more that it is annoying
and distracting to have to check for and handle a very unimportant and
uncommon case in the main path of the common case.

- Melanie

#7Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#4)
5 attachment(s)
Re: Streaming read-ready sequential scan code

On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

- Melanie

Attachments:

v2-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload
From ee93bf41b41aefd6fe20c41be15d2fd5e5d75181 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v2 2/5] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 79 ++++++++++++++------------------
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 449221da6ac..e0fe3d9c326 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -87,6 +87,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -463,14 +466,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -485,10 +486,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (!scan->rs_inited)
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -498,7 +508,7 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -638,7 +648,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -736,23 +746,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -762,9 +762,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -786,7 +789,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -816,9 +819,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -851,22 +851,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -883,9 +874,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -905,7 +899,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -916,9 +910,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.37.2

v2-0004-Add-pg_streaming_read_reset.patchtext/x-patch; charset=US-ASCII; name=v2-0004-Add-pg_streaming_read_reset.patchDownload
From 38cc779724b92514a1b63ba27057fe438b06df29 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:33:39 -0500
Subject: [PATCH v2 4/5] Add pg_streaming_read_reset

For rescan, we want to reuse the streaming read object and simply
release the buffers that were pinned by the streaming read
infrastructure.
---
 src/backend/storage/aio/streaming_read.c | 18 ++++++++++++++++++
 src/include/storage/streaming_read.h     |  1 +
 2 files changed, 19 insertions(+)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 19605090fea..4479f101abe 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -433,3 +433,21 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 		pfree(pgsr->per_buffer_data);
 	pfree(pgsr);
 }
+
+
+/*
+ * Reset a streaming read object by releasing all of the buffers. Note that
+ * max_ios is not recalculated, so any changes to maintenance_io_concurrency and
+ * effective_io_concurrency will have no effect.
+ */
+void
+pg_streaming_read_reset(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+	pgsr->finished = false;
+}
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index 40c3408c541..1acfab20ff5 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -41,5 +41,6 @@ extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
 extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
 
 #endif
-- 
2.37.2

v2-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Split-heapgetpage-into-two-parts.patchDownload
From 77de7d5a2560ec4c9ddad4fd7bd1e0783cf81729 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v2 1/5] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 74 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..449221da6ac 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -367,17 +367,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -385,31 +386,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
-
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -482,6 +460,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -755,7 +764,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -876,7 +885,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d15a02b2be7..75bdd341e7a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2278,11 +2278,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2324,20 +2327,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 932ec0d6f2b..4a3a017c33a 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -246,7 +246,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.37.2

v2-0005-Sequential-scans-support-streaming-read.patchtext/x-patch; charset=US-ASCII; name=v2-0005-Sequential-scans-support-streaming-read.patchDownload
From 56c2033318efd0062d4d0690f12a59654112cb37 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:41:32 -0500
Subject: [PATCH v2 5/5] Sequential scans support streaming read

Implementing streaming read support for heap sequential scans includes
three parts:

Allocate the streaming read object in initscan(). On rescan, free the
existing streaming read object.

Implement a callback returning the next block to prefetch to the
streaming read API.

Invoke the streaming read API when a new page is needed and streaming
reads are enabled. If the scan direction is not forwards, the streaming
read object must be freed so that the fallback method can be used.
Streaming will not be re-enabled for a scan even if it changes back to
forwards.

ci-os-only:
---
 src/backend/access/heap/heapam.c | 96 +++++++++++++++++++++++++++++---
 src/include/access/heapam.h      | 10 ++++
 2 files changed, 97 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e0fe3d9c326..c88b4221864 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -65,6 +65,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -228,6 +229,29 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+
+	/*
+	 * Hard-code ScanDirection to ForwardScanDirection since only forward
+	 * scans support streaming reads.
+	 */
+	if (!scan->rs_inited)
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, ForwardScanDirection);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   ForwardScanDirection);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -329,6 +353,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	ItemPointerSetInvalid(&scan->rs_ctup.t_self);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
@@ -486,19 +511,44 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (!scan->rs_inited)
+	/*
+	 * Only forward scans stream reads. The scan direction is not yet
+	 * determined when the streaming read object is allocated. Thus, all
+	 * sequential scans start out with a streaming object and if they request
+	 * blocks in any direction besides forwards, the streaming read object is
+	 * freed and the scan will not stream reads again. Even if the scan
+	 * changes to forwards from another direction, it will not resume
+	 * streaming reading.
+	 */
+	if (!ScanDirectionIsForward(dir) && scan->rs_pgsr)
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		pg_streaming_read_free(scan->rs_pgsr);
+		scan->rs_pgsr = NULL;
+		scan->rs_prefetch_block = InvalidBlockNumber;
+	}
+
+	if (scan->rs_pgsr)
+	{
+		scan->rs_cbuf = pg_streaming_read_buffer_get_next(scan->rs_pgsr, NULL);
+		if (BufferIsValid(scan->rs_cbuf))
+			scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 	}
 	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+	{
+		if (!scan->rs_inited)
+		{
+			scan->rs_cblock = heapgettup_initial_block(scan, dir);
+			Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+			scan->rs_inited = true;
+		}
+		else
+			scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+		/* read block if valid */
+		if (BlockNumberIsValid(scan->rs_cblock))
+			scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+											   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	}
 }
 
 /*
@@ -999,6 +1049,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	else
 		scan->rs_parallelworkerdata = NULL;
 
+	scan->rs_pgsr = NULL;
+
 	/*
 	 * we do this here instead of in initscan() because heap_rescan also calls
 	 * initscan() and we don't want to allocate memory again
@@ -1010,6 +1062,21 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	/*
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the streaming read object will be freed.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
+	{
+		scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+													   scan,
+													   0,
+													   scan->rs_strategy,
+													   BMR_REL(scan->rs_base.rs_rd),
+													   MAIN_FORKNUM,
+													   heap_pgsr_next_single);
+	}
+
 	return (TableScanDesc) scan;
 }
 
@@ -1048,6 +1115,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	 * reinitialize scan descriptor
 	 */
 	initscan(scan, key, true);
+
+	/*
+	 * The streaming read object is reset on rescan. This must be done after
+	 * initscan(), as some state referred to by pg_streaming_read_reset() is
+	 * reset in initscan().
+	 */
+	if (scan->rs_pgsr)
+		pg_streaming_read_reset(scan->rs_pgsr);
 }
 
 void
@@ -1063,6 +1138,9 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->rs_pgsr)
+		pg_streaming_read_free(scan->rs_pgsr);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4a3a017c33a..fe4ab40e18d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -59,6 +59,7 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	OffsetNumber rs_coffset;	/* current offset # in non-page-at-a-time mode */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+	BlockNumber rs_prefetch_block;	/* block being prefetched */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 
@@ -72,6 +73,15 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	/*
+	 * Streaming read object allocated for each sequential scan and
+	 * reallocated on rescan. If a sequential scan ends up requesting tuples
+	 * in any direction except ForwardScan, the streaming read object is freed
+	 * and the scan falls back to a standard, synchronous block reading
+	 * method.
+	 */
+	struct PgStreamingRead *rs_pgsr;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
-- 
2.37.2

v2-0003-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Streaming-Read-API.patchDownload
From cf57c54e52c1a9c879ffbe9e7532907e7135266d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v2 3/5] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/access/transam/xlogutils.c   |   2 +-
 src/backend/postmaster/bgwriter.c        |   8 +-
 src/backend/postmaster/checkpointer.c    |  15 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 435 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 560 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/backend/storage/smgr/smgr.c          |  49 +-
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/smgr.h               |   4 +-
 src/include/storage/streaming_read.h     |  45 ++
 src/include/utils/rel.h                  |   6 -
 src/tools/pgindent/typedefs.list         |   2 +
 17 files changed, 986 insertions(+), 238 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..9617bf130bd 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd10..8775b5789be 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -657,7 +657,7 @@ XLogDropDatabase(Oid dbid)
 	 * This is unnecessarily heavy-handed, as it will close SMgrRelation
 	 * objects for other databases as well. DROP DATABASE occurs seldom enough
 	 * that it's not worth introducing a variant of smgrclose for just this
-	 * purpose. XXX: Or should we rather leave the smgr entries dangling?
+	 * purpose.
 	 */
 	smgrcloseall();
 
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..13e5376619e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -246,10 +246,12 @@ BackgroundWriterMain(void)
 		if (FirstCallSinceLastCheckpoint())
 		{
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the bgwriter does
+			 * not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 		}
 
 		/*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5e949fc885b..5d843b61426 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -469,10 +469,12 @@ CheckpointerMain(void)
 				ckpt_performed = CreateRestartPoint(flags);
 
 			/*
-			 * After any checkpoint, close all smgr files.  This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
+			 * After any checkpoint, free all smgr objects.  Otherwise we
+			 * would never do so for dropped relations, as the checkpointer
+			 * does not process shared invalidation messages or call
+			 * AtEOXact_SMgr().
 			 */
-			smgrcloseall();
+			smgrdestroyall();
 
 			/*
 			 * Indicate checkpoint completion to any waiting backends.
@@ -958,11 +960,8 @@ RequestCheckpoint(int flags)
 		 */
 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
 
-		/*
-		 * After any checkpoint, close all smgr files.  This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
+		/* Free all smgr objects, as CheckpointerMain() normally would. */
+		smgrdestroyall();
 
 		return;
 	}
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..19605090fea
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,435 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios);
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6dd..2157a97b973 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -501,7 +501,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -1002,7 +997,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1016,18 +1011,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1046,175 +1034,339 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
+
+	/* At this point we do NOT hold any locks. */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+#ifdef USE_ASSERT_CHECKING
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!CompleteReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
@@ -1228,11 +1380,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1291,19 +1440,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1368,19 +1508,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1412,15 +1543,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1774,7 +1899,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2043,7 +2168,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2066,7 +2191,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2381,7 +2506,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2390,7 +2520,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3458,7 +3588,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -4845,6 +4975,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
@@ -5197,9 +5367,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5212,6 +5388,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1be4f4f8daf..717b8f58daf 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 563a0be5c74..0d7272e796e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -147,7 +147,9 @@ smgrshutdown(int code, Datum arg)
 /*
  * smgropen() -- Return an SMgrRelation object, creating it if need be.
  *
- * This does not attempt to actually open the underlying file.
+ * This does not attempt to actually open the underlying files.  The returned
+ * object remains valid at least until AtEOXact_SMgr() is called, or until
+ * smgrdestroy() is called in non-transaction backends.
  */
 SMgrRelation
 smgropen(RelFileLocator rlocator, BackendId backend)
@@ -259,10 +261,10 @@ smgrexists(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- * smgrclose() -- Close and delete an SMgrRelation object.
+ * smgrdestroy() -- Delete an SMgrRelation object.
  */
 void
-smgrclose(SMgrRelation reln)
+smgrdestroy(SMgrRelation reln)
 {
 	SMgrRelation *owner;
 	ForkNumber	forknum;
@@ -289,12 +291,14 @@ smgrclose(SMgrRelation reln)
 }
 
 /*
- * smgrrelease() -- Release all resources used by this object.
+ * smgrclose() -- Release all resources used by this object.
  *
- * The object remains valid.
+ * The object remains valid, but is moved to the unknown list where it will
+ * be destroyed by AtEOXact_SMgr().  It may be re-owned if it is accessed by a
+ * relation before then.
  */
 void
-smgrrelease(SMgrRelation reln)
+smgrclose(SMgrRelation reln)
 {
 	for (ForkNumber forknum = 0; forknum <= MAX_FORKNUM; forknum++)
 	{
@@ -302,15 +306,20 @@ smgrrelease(SMgrRelation reln)
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 	}
 	reln->smgr_targblock = InvalidBlockNumber;
+
+	if (reln->smgr_owner)
+	{
+		*reln->smgr_owner = NULL;
+		reln->smgr_owner = NULL;
+		dlist_push_tail(&unowned_relns, &reln->node);
+	}
 }
 
 /*
- * smgrreleaseall() -- Release resources used by all objects.
- *
- * This is called for PROCSIGNAL_BARRIER_SMGRRELEASE.
+ * smgrcloseall() -- Close all objects.
  */
 void
-smgrreleaseall(void)
+smgrcloseall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -322,14 +331,17 @@ smgrreleaseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrrelease(reln);
+		smgrclose(reln);
 }
 
 /*
- * smgrcloseall() -- Close all existing SMgrRelation objects.
+ * smgrdestroyall() -- Destroy all SMgrRelation objects.
+ *
+ * It must be known that there are no pointers to SMgrRelations, other than
+ * those registered with smgrsetowner().
  */
 void
-smgrcloseall(void)
+smgrdestroyall(void)
 {
 	HASH_SEQ_STATUS status;
 	SMgrRelation reln;
@@ -341,7 +353,7 @@ smgrcloseall(void)
 	hash_seq_init(&status, SMgrRelationHash);
 
 	while ((reln = (SMgrRelation) hash_seq_search(&status)) != NULL)
-		smgrclose(reln);
+		smgrdestroy(reln);
 }
 
 /*
@@ -733,7 +745,8 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
  * AtEOXact_SMgr
  *
  * This routine is called during transaction commit or abort (it doesn't
- * particularly care which).  All transient SMgrRelation objects are closed.
+ * particularly care which).  All transient SMgrRelation objects are
+ * destroyed.
  *
  * We do this as a compromise between wanting transient SMgrRelations to
  * live awhile (to amortize the costs of blind writes of multiple blocks)
@@ -747,7 +760,7 @@ AtEOXact_SMgr(void)
 	dlist_mutable_iter iter;
 
 	/*
-	 * Zap all unowned SMgrRelations.  We rely on smgrclose() to remove each
+	 * Zap all unowned SMgrRelations.  We rely on smgrdestroy() to remove each
 	 * one from the list.
 	 */
 	dlist_foreach_modify(iter, &unowned_relns)
@@ -757,7 +770,7 @@ AtEOXact_SMgr(void)
 
 		Assert(rel->smgr_owner == NULL);
 
-		smgrclose(rel);
+		smgrdestroy(rel);
 	}
 }
 
@@ -768,6 +781,6 @@ AtEOXact_SMgr(void)
 bool
 ProcessBarrierSmgrRelease(void)
 {
-	smgrreleaseall();
+	smgrcloseall();
 	return true;
 }
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 527cd2a0568..d8ffe397faf 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -85,8 +85,8 @@ extern void smgrclearowner(SMgrRelation *owner, SMgrRelation reln);
 extern void smgrclose(SMgrRelation reln);
 extern void smgrcloseall(void);
 extern void smgrcloserellocator(RelFileLocatorBackend rlocator);
-extern void smgrrelease(SMgrRelation reln);
-extern void smgrreleaseall(void);
+extern void smgrdestroy(SMgrRelation reln);
+extern void smgrdestroyall(void);
 extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..40c3408c541
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,45 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index a584b1ddff3..6636cc82c09 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -561,12 +561,6 @@ typedef struct ViewOptions
  *
  * Very little code is authorized to touch rel->rd_smgr directly.  Instead
  * use this function to fetch its value.
- *
- * Note: since a relcache flush can cause the file handle to be closed again,
- * it's unwise to hold onto the pointer returned by this function for any
- * long period.  Recommended practice is to just re-execute RelationGetSmgr
- * each time you need to access the SMgrRelation.  It's quite cheap in
- * comparison to whatever an smgr function is going to do.
  */
 static inline SMgrRelation
 RelationGetSmgr(Relation rel)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 29fd1cae641..018ebbcbaae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2089,6 +2089,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.37.2

#8Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#7)
5 attachment(s)
Re: Streaming read-ready sequential scan code

On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:

On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]/messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.

The streaming read is allocated in heap_beginscan() and then reset on
rescan and when the scan direction changes. I only check if the scan
direction changes when a new page is needed. This implementation means
no fallback method is needed, so we can remove the non-streaming read
code for heap sequential scans.

Because heapgettup() and heapgettup_pagemode() are also used for TID
range scans, this patch also happens to implement streaming reads for
TID range scans.

- Melanie

[1]: /messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com

Attachments:

v3-0001-Split-heapgetpage-into-two-parts.patchtext/x-diff; charset=us-asciiDownload
From 9227b1b621c473ab189ac0cefba1b9aaed02aa09 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v3 1/5] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 74 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..449221da6ac 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -367,17 +367,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -385,31 +386,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
-
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -482,6 +460,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -755,7 +764,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -876,7 +885,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..8dc137995f0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2276,11 +2276,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2322,20 +2325,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..e2b1b2a3ad9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -246,7 +246,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.37.2

v3-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-diff; charset=us-asciiDownload
From 4b6d9059aa32625e91d38f1d414ec514d4073197 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v3 2/5] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 79 ++++++++++++++------------------
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 449221da6ac..e0fe3d9c326 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -87,6 +87,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -463,14 +466,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -485,10 +486,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (!scan->rs_inited)
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -498,7 +508,7 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -638,7 +648,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -736,23 +746,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -762,9 +762,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -786,7 +789,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -816,9 +819,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -851,22 +851,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -883,9 +874,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -905,7 +899,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -916,9 +910,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.37.2

v3-0003-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload
From ca3b34481f5d1404ff0fdb60c65ec0b7befd86fb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v3 3/5] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 612 ++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 641 ++++++++++++++++-------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/include/storage/bufmgr.h             |  45 ++
 src/include/storage/streaming_read.h     |  52 ++
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1218 insertions(+), 211 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..1cc84bcb0c2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..71f2c4a70b6
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,612 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_wait;
+	bool		advice_issued;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index;
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+	ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			ramp_up_pin_limit;
+	int			ramp_up_pin_stall;
+	bool		finished;
+	bool		advice_enabled;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs might be set to 0, but we want to allow at
+	 * least one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We start off building small ranges, but double that quickly, for the
+	 * benefit of users that don't know how far ahead they'll read.  This can
+	 * be disabled by users that already know they'll read all the way.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->ramp_up_pin_limit = INT_MAX;
+	else
+		pgsr->ramp_up_pin_limit = 1;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space for the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	int			result;
+
+	/*
+	 * Find slot in the circular buffer of per-buffer data, without using the
+	 * expensive % operator.
+	 */
+	result = range->per_buffer_data_index + n;
+	if (result >= pgsr->max_pinned_buffers)
+		result -= pgsr->max_pinned_buffers;
+	Assert(result == (range->per_buffer_data_index + n) % pgsr->max_pinned_buffers);
+
+	return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+	return (char *) pgsr->per_buffer_data +
+		pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	return get_per_buffer_data_by_index(pgsr,
+										get_per_buffer_data_index(pgsr,
+																  range,
+																  n));
+}
+
+/*
+ * Start reading the head range, and create a new head range.  The new head
+ * range is returned.  It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+	PgStreamingReadRange *new_head_range;
+	int			nblocks_pinned;
+	int			flags;
+
+	/* Caller should make sure we never exceed max_ios. */
+	Assert(pgsr->ios_in_progress < pgsr->max_ios);
+
+	/* Should only call if the head range has some blocks to read. */
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (pgsr->advice_enabled && head_range->blocknum != pgsr->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+
+	/* Start reading as many blocks as we can from the head range. */
+	nblocks_pinned = head_range->nblocks;
+	head_range->need_wait =
+		StartReadBuffers(pgsr->bmr,
+						 head_range->buffers,
+						 pgsr->forknum,
+						 head_range->blocknum,
+						 &nblocks_pinned,
+						 pgsr->strategy,
+						 flags,
+						 &head_range->operation);
+
+	/* Did that start an I/O? */
+	if (head_range->need_wait && (flags & READ_BUFFERS_ISSUE_ADVICE))
+	{
+		head_range->advice_issued = true;
+		pgsr->ios_in_progress++;
+		Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+	}
+
+	/*
+	 * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+	 * but always at least one.
+	 */
+	Assert(nblocks_pinned <= head_range->nblocks);
+	Assert(nblocks_pinned >= 1);
+	pgsr->pinned_buffers += nblocks_pinned;
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time.
+	 */
+	pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+	/*
+	 * Create a new head range.  There must be space, because we have enough
+	 * elements for every range to hold just one block, up to the pin limit.
+	 */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	new_head_range = &pgsr->ranges[pgsr->head];
+	new_head_range->nblocks = 0;
+	new_head_range->advice_issued = false;
+
+	/*
+	 * If we didn't manage to start the whole read above, we split the range,
+	 * moving the remainder into the new head range.
+	 */
+	if (nblocks_pinned < head_range->nblocks)
+	{
+		int			nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+		head_range->nblocks = nblocks_pinned;
+
+		new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+		new_head_range->nblocks = nblocks_remaining;
+	}
+
+	/* The new range has per-buffer data starting after the previous range. */
+	new_head_range->per_buffer_data_index =
+		get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+	return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = pgsr->unget_blocknum;
+		pgsr->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * pg_streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+	pgsr->unget_blocknum = blocknum;
+	pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *range;
+
+	/*
+	 * If we're still ramping up, we may have to stall to wait for buffers to
+	 * be consumed first before we do any more prefetching.
+	 */
+	if (pgsr->ramp_up_pin_stall > 0)
+	{
+		Assert(pgsr->pinned_buffers > 0);
+		return;
+	}
+
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BlockNumber blocknum;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		range = &pgsr->ranges[pgsr->head];
+		if (range->nblocks == lengthof(range->buffers))
+		{
+			/* Start as much of it as we can. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/* If we're now at the I/O limit, stop here. */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+				return;
+
+			/*
+			 * If we couldn't form a full range, then stop here to avoid
+			 * creating small I/O.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				return;
+
+			/*
+			 * That might have only been partially started, but always
+			 * processes at least one so that'll do for now.
+			 */
+			Assert(range->nblocks < lengthof(range->buffers));
+		}
+
+		/* Find per-buffer data slot for the next block. */
+		per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			pgsr->finished = true;
+			break;
+		}
+
+		/*
+		 * Is there a head range that we cannot extend, because the requested
+		 * block is not consecutive?
+		 */
+		if (range->nblocks > 0 &&
+			range->blocknum + range->nblocks != blocknum)
+		{
+			/* Yes.  Start it, so we can begin building a new one. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * It's possible that it was only partially started, and we have a
+			 * new range with the remainder.  Keep starting I/Os until we get
+			 * it all out of the way, or we hit the I/O limit.
+			 */
+			while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+				range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * We have to 'unget' the block returned by the callback if we
+			 * don't have enough I/O capacity left to start something.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+			{
+				pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* If we have a new, empty range, initialize the start block. */
+		if (range->nblocks == 0)
+		{
+			range->blocknum = blocknum;
+		}
+
+		/* This block extends the range by one. */
+		Assert(range->blocknum + range->nblocks == blocknum);
+		range->nblocks++;
+
+	} while (pgsr->pinned_buffers + range->nblocks < pgsr->max_pinned_buffers &&
+			 pgsr->pinned_buffers + range->nblocks < pgsr->ramp_up_pin_limit);
+
+	/* If we've hit the ramp-up limit, insert a stall. */
+	if (pgsr->pinned_buffers + range->nblocks >= pgsr->ramp_up_pin_limit)
+	{
+		/* Can't get here if an earlier stall hasn't finished. */
+		Assert(pgsr->ramp_up_pin_stall == 0);
+		/* Don't do any more prefetching until these buffers are consumed. */
+		pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+		/* Double it.  It will soon be out of the way. */
+		pgsr->ramp_up_pin_limit *= 2;
+	}
+
+	/* Start as much as we can. */
+	while (range->nblocks > 0)
+	{
+		range = pg_streaming_read_start_head_range(pgsr);
+		if (pgsr->ios_in_progress == pgsr->max_ios)
+			break;
+	}
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_wait)
+		{
+			WaitReadBuffers(&tail_range->operation);
+			tail_range->need_wait = false;
+
+			/*
+			 * We don't really know if the kernel generated a physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished now because we've performed the read.
+			 */
+			if (tail_range->advice_issued)
+			{
+				Assert(pgsr->ios_in_progress > 0);
+				pgsr->ios_in_progress--;
+			}
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (pgsr->ramp_up_pin_stall > 0)
+				pgsr->ramp_up_pin_stall--;
+
+			if (per_buffer_data)
+				*per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+
+		/*
+		 * If tail crashed into head, and head is not empty, then it is time
+		 * to start that range.
+		 */
+		if (pgsr->tail == pgsr->head &&
+			pgsr->ranges[pgsr->head].nblocks > 0)
+			pg_streaming_read_start_head_range(pgsr);
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	pgsr->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ios_in_progress == 0);
+
+	/* Release memory. */
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4dc..3b1b0ad99df 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -472,10 +477,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -501,7 +505,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -782,7 +786,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -795,15 +798,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -823,13 +820,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -995,35 +991,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1042,181 +1071,404 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
 
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	operation->bmr = bmr;
+	operation->forknum = forkNum;
+	operation->blocknum = blockNum;
+	operation->buffers = buffers;
+	operation->nblocks = actual_nblocks;
+	operation->strategy = strategy;
+	operation->flags = flags;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	operation->io_buffers_len = 0;
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		bool		found;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		buffers[i] = PrepareReadBuffer(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			operation->io_buffers_len++;
+		}
+	}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+	if (operation->io_buffers_len > 0)
+	{
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
+			/*
+			 * In theory we should only do this if PrepareReadBuffers() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
 	}
+	else
+	{
+		return false;
+	}
+}
 
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
 	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
 	}
 	else
 	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for StartReadBuffers.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1224,11 +1476,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1287,19 +1535,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1364,19 +1603,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1408,15 +1638,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1770,7 +1994,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2035,7 +2259,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2058,7 +2282,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2373,7 +2597,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2382,7 +2611,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3450,7 +3679,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5185,9 +5414,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5200,6 +5435,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1f02fed250e..6956d4e5b49 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..c4d3892bb26
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf2..cfb58cf4836 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
@@ -2267,6 +2269,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.37.2

v3-0004-Add-pg_streaming_read_reset.patchtext/x-diff; charset=us-asciiDownload
From 614195777b3ee675d74d98953b086e0f8a4a494d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:33:39 -0500
Subject: [PATCH v3 4/5] Add pg_streaming_read_reset

For rescan, we want to reuse the streaming read object and simply
release the buffers that were pinned by the streaming read
infrastructure.
---
 src/backend/storage/aio/streaming_read.c | 18 ++++++++++++++++++
 src/include/storage/streaming_read.h     |  1 +
 2 files changed, 19 insertions(+)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 71f2c4a70b6..70f3ef051f8 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -610,3 +610,21 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 
 	pfree(pgsr);
 }
+
+
+/*
+ * Reset a streaming read object by releasing all of the buffers. Note that
+ * max_ios is not recalculated, so any changes to maintenance_io_concurrency and
+ * effective_io_concurrency will have no effect.
+ */
+void
+pg_streaming_read_reset(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+	pgsr->finished = false;
+}
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index c4d3892bb26..63cef719e42 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -48,5 +48,6 @@ extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
 extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
 
 #endif
-- 
2.37.2

v3-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-diff; charset=us-asciiDownload
From 2e3c2890ff7aa13c2b33f66ed1cfb8c3b58ea21c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:41:32 -0500
Subject: [PATCH v3 5/5] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the streaming read object in heap_beginscan(). On rescan, reset
the streaming read by release all pinned buffers and resetting the
prefetch block.

Implement a callback returning the next block to prefetch to the
streaming read API.

Invoke the streaming read API when a new page is needed and streaming
reads are enabled. When the scan direction changes, reset the streaming
read.
---
 src/backend/access/heap/heapam.c | 88 ++++++++++++++++++++++++++++----
 src/include/access/heapam.h      | 12 +++++
 2 files changed, 90 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e0fe3d9c326..7143a3efb9a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -65,6 +65,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -228,6 +229,29 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_pgsr_next_single(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+
+	/*
+	 * Hard-code ScanDirection to ForwardScanDirection since only forward
+	 * scans support streaming reads.
+	 */
+	if (!scan->rs_inited)
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -330,6 +354,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -472,6 +503,8 @@ heapbuildvis(TableScanDesc sscan)
 static inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_pgsr);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -486,19 +519,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (!scan->rs_inited)
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (scan->rs_dir != dir && scan->rs_pgsr)
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		pg_streaming_read_reset(scan->rs_pgsr);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	scan->rs_dir = dir;
+
+	scan->rs_cbuf = pg_streaming_read_buffer_get_next(scan->rs_pgsr, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -827,6 +864,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -917,6 +955,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -999,6 +1038,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	else
 		scan->rs_parallelworkerdata = NULL;
 
+	scan->rs_pgsr = NULL;
+
 	/*
 	 * we do this here instead of in initscan() because heap_rescan also calls
 	 * initscan() and we don't want to allocate memory again
@@ -1010,6 +1051,22 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	/*
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the streaming read object will be freed.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+													   scan,
+													   0,
+													   scan->rs_strategy,
+													   BMR_REL(scan->rs_base.rs_rd),
+													   MAIN_FORKNUM,
+													   heap_pgsr_next_single);
+	}
+
 	return (TableScanDesc) scan;
 }
 
@@ -1048,6 +1105,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	 * reinitialize scan descriptor
 	 */
 	initscan(scan, key, true);
+
+	/*
+	 * The streaming read object is reset on rescan. This must be done after
+	 * initscan(), as some state referred to by pg_streaming_read_reset() is
+	 * reset in initscan().
+	 */
+	if (scan->rs_pgsr)
+		pg_streaming_read_reset(scan->rs_pgsr);
 }
 
 void
@@ -1063,6 +1128,9 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->rs_pgsr)
+		pg_streaming_read_free(scan->rs_pgsr);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e2b1b2a3ad9..0f8f51ce326 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,6 +72,18 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
+	/*
+	 * Fields used for streaming reads by sequential scans and TID range
+	 * scans. The streaming read object is allocated at the beginning of the
+	 * scan and reset on rescan or when the scan direction changes. The scan
+	 * direction is saved each time a new page is requested. If the scan
+	 * direction changes from one page to the next, the streaming read object
+	 * releases all previously pinned buffers and resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	struct PgStreamingRead *rs_pgsr;
+	BlockNumber rs_prefetch_block;
+
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
-- 
2.37.2

#9Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#8)
5 attachment(s)
Re: Streaming read-ready sequential scan code

On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:

On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.

While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.

- Melanie

Attachments:

v4-0004-Add-pg_streaming_read_reset.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Add-pg_streaming_read_reset.patchDownload
From 614195777b3ee675d74d98953b086e0f8a4a494d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:33:39 -0500
Subject: [PATCH v4 4/5] Add pg_streaming_read_reset

For rescan, we want to reuse the streaming read object and simply
release the buffers that were pinned by the streaming read
infrastructure.
---
 src/backend/storage/aio/streaming_read.c | 18 ++++++++++++++++++
 src/include/storage/streaming_read.h     |  1 +
 2 files changed, 19 insertions(+)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 71f2c4a70b6..70f3ef051f8 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -610,3 +610,21 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 
 	pfree(pgsr);
 }
+
+
+/*
+ * Reset a streaming read object by releasing all of the buffers. Note that
+ * max_ios is not recalculated, so any changes to maintenance_io_concurrency and
+ * effective_io_concurrency will have no effect.
+ */
+void
+pg_streaming_read_reset(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+	pgsr->finished = false;
+}
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index c4d3892bb26..63cef719e42 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -48,5 +48,6 @@ extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
 extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
 
 #endif
-- 
2.40.1

v4-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload
From 4b6d9059aa32625e91d38f1d414ec514d4073197 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v4 2/5] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 79 ++++++++++++++------------------
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 449221da6ac..e0fe3d9c326 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -87,6 +87,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -463,14 +466,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -485,10 +486,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (!scan->rs_inited)
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -498,7 +508,7 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -638,7 +648,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -736,23 +746,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -762,9 +762,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -786,7 +789,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -816,9 +819,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -851,22 +851,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (scan->rs_inited)
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -883,9 +874,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -905,7 +899,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -916,9 +910,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.40.1

v4-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Split-heapgetpage-into-two-parts.patchDownload
From 9227b1b621c473ab189ac0cefba1b9aaed02aa09 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v4 1/5] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 74 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..449221da6ac 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -367,17 +367,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -385,31 +386,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
-
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -482,6 +460,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -755,7 +764,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -876,7 +885,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..8dc137995f0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2276,11 +2276,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2322,20 +2325,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..e2b1b2a3ad9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -246,7 +246,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.40.1

v4-0003-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Streaming-Read-API.patchDownload
From ca3b34481f5d1404ff0fdb60c65ec0b7befd86fb Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v4 3/5] Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 612 ++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 641 ++++++++++++++++-------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/include/storage/bufmgr.h             |  45 ++
 src/include/storage/streaming_read.h     |  52 ++
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1218 insertions(+), 211 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..1cc84bcb0c2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..71f2c4a70b6
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,612 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_wait;
+	bool		advice_issued;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index;
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+	ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			ramp_up_pin_limit;
+	int			ramp_up_pin_stall;
+	bool		finished;
+	bool		advice_enabled;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs might be set to 0, but we want to allow at
+	 * least one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We start off building small ranges, but double that quickly, for the
+	 * benefit of users that don't know how far ahead they'll read.  This can
+	 * be disabled by users that already know they'll read all the way.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->ramp_up_pin_limit = INT_MAX;
+	else
+		pgsr->ramp_up_pin_limit = 1;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space for the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	int			result;
+
+	/*
+	 * Find slot in the circular buffer of per-buffer data, without using the
+	 * expensive % operator.
+	 */
+	result = range->per_buffer_data_index + n;
+	if (result >= pgsr->max_pinned_buffers)
+		result -= pgsr->max_pinned_buffers;
+	Assert(result == (range->per_buffer_data_index + n) % pgsr->max_pinned_buffers);
+
+	return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+	return (char *) pgsr->per_buffer_data +
+		pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	return get_per_buffer_data_by_index(pgsr,
+										get_per_buffer_data_index(pgsr,
+																  range,
+																  n));
+}
+
+/*
+ * Start reading the head range, and create a new head range.  The new head
+ * range is returned.  It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+	PgStreamingReadRange *new_head_range;
+	int			nblocks_pinned;
+	int			flags;
+
+	/* Caller should make sure we never exceed max_ios. */
+	Assert(pgsr->ios_in_progress < pgsr->max_ios);
+
+	/* Should only call if the head range has some blocks to read. */
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (pgsr->advice_enabled && head_range->blocknum != pgsr->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+
+	/* Start reading as many blocks as we can from the head range. */
+	nblocks_pinned = head_range->nblocks;
+	head_range->need_wait =
+		StartReadBuffers(pgsr->bmr,
+						 head_range->buffers,
+						 pgsr->forknum,
+						 head_range->blocknum,
+						 &nblocks_pinned,
+						 pgsr->strategy,
+						 flags,
+						 &head_range->operation);
+
+	/* Did that start an I/O? */
+	if (head_range->need_wait && (flags & READ_BUFFERS_ISSUE_ADVICE))
+	{
+		head_range->advice_issued = true;
+		pgsr->ios_in_progress++;
+		Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+	}
+
+	/*
+	 * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+	 * but always at least one.
+	 */
+	Assert(nblocks_pinned <= head_range->nblocks);
+	Assert(nblocks_pinned >= 1);
+	pgsr->pinned_buffers += nblocks_pinned;
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time.
+	 */
+	pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+	/*
+	 * Create a new head range.  There must be space, because we have enough
+	 * elements for every range to hold just one block, up to the pin limit.
+	 */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	new_head_range = &pgsr->ranges[pgsr->head];
+	new_head_range->nblocks = 0;
+	new_head_range->advice_issued = false;
+
+	/*
+	 * If we didn't manage to start the whole read above, we split the range,
+	 * moving the remainder into the new head range.
+	 */
+	if (nblocks_pinned < head_range->nblocks)
+	{
+		int			nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+		head_range->nblocks = nblocks_pinned;
+
+		new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+		new_head_range->nblocks = nblocks_remaining;
+	}
+
+	/* The new range has per-buffer data starting after the previous range. */
+	new_head_range->per_buffer_data_index =
+		get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+	return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = pgsr->unget_blocknum;
+		pgsr->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * pg_streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+	pgsr->unget_blocknum = blocknum;
+	pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *range;
+
+	/*
+	 * If we're still ramping up, we may have to stall to wait for buffers to
+	 * be consumed first before we do any more prefetching.
+	 */
+	if (pgsr->ramp_up_pin_stall > 0)
+	{
+		Assert(pgsr->pinned_buffers > 0);
+		return;
+	}
+
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BlockNumber blocknum;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		range = &pgsr->ranges[pgsr->head];
+		if (range->nblocks == lengthof(range->buffers))
+		{
+			/* Start as much of it as we can. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/* If we're now at the I/O limit, stop here. */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+				return;
+
+			/*
+			 * If we couldn't form a full range, then stop here to avoid
+			 * creating small I/O.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				return;
+
+			/*
+			 * That might have only been partially started, but always
+			 * processes at least one so that'll do for now.
+			 */
+			Assert(range->nblocks < lengthof(range->buffers));
+		}
+
+		/* Find per-buffer data slot for the next block. */
+		per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			pgsr->finished = true;
+			break;
+		}
+
+		/*
+		 * Is there a head range that we cannot extend, because the requested
+		 * block is not consecutive?
+		 */
+		if (range->nblocks > 0 &&
+			range->blocknum + range->nblocks != blocknum)
+		{
+			/* Yes.  Start it, so we can begin building a new one. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * It's possible that it was only partially started, and we have a
+			 * new range with the remainder.  Keep starting I/Os until we get
+			 * it all out of the way, or we hit the I/O limit.
+			 */
+			while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+				range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * We have to 'unget' the block returned by the callback if we
+			 * don't have enough I/O capacity left to start something.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+			{
+				pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* If we have a new, empty range, initialize the start block. */
+		if (range->nblocks == 0)
+		{
+			range->blocknum = blocknum;
+		}
+
+		/* This block extends the range by one. */
+		Assert(range->blocknum + range->nblocks == blocknum);
+		range->nblocks++;
+
+	} while (pgsr->pinned_buffers + range->nblocks < pgsr->max_pinned_buffers &&
+			 pgsr->pinned_buffers + range->nblocks < pgsr->ramp_up_pin_limit);
+
+	/* If we've hit the ramp-up limit, insert a stall. */
+	if (pgsr->pinned_buffers + range->nblocks >= pgsr->ramp_up_pin_limit)
+	{
+		/* Can't get here if an earlier stall hasn't finished. */
+		Assert(pgsr->ramp_up_pin_stall == 0);
+		/* Don't do any more prefetching until these buffers are consumed. */
+		pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+		/* Double it.  It will soon be out of the way. */
+		pgsr->ramp_up_pin_limit *= 2;
+	}
+
+	/* Start as much as we can. */
+	while (range->nblocks > 0)
+	{
+		range = pg_streaming_read_start_head_range(pgsr);
+		if (pgsr->ios_in_progress == pgsr->max_ios)
+			break;
+	}
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_wait)
+		{
+			WaitReadBuffers(&tail_range->operation);
+			tail_range->need_wait = false;
+
+			/*
+			 * We don't really know if the kernel generated a physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished now because we've performed the read.
+			 */
+			if (tail_range->advice_issued)
+			{
+				Assert(pgsr->ios_in_progress > 0);
+				pgsr->ios_in_progress--;
+			}
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (pgsr->ramp_up_pin_stall > 0)
+				pgsr->ramp_up_pin_stall--;
+
+			if (per_buffer_data)
+				*per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+
+		/*
+		 * If tail crashed into head, and head is not empty, then it is time
+		 * to start that range.
+		 */
+		if (pgsr->tail == pgsr->head &&
+			pgsr->ranges[pgsr->head].nblocks > 0)
+			pg_streaming_read_start_head_range(pgsr);
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	pgsr->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ios_in_progress == 0);
+
+	/* Release memory. */
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4dc..3b1b0ad99df 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -472,10 +477,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -501,7 +505,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -782,7 +786,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -795,15 +798,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -823,13 +820,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -995,35 +991,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1042,181 +1071,404 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
 
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	operation->bmr = bmr;
+	operation->forknum = forkNum;
+	operation->blocknum = blockNum;
+	operation->buffers = buffers;
+	operation->nblocks = actual_nblocks;
+	operation->strategy = strategy;
+	operation->flags = flags;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	operation->io_buffers_len = 0;
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		bool		found;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		buffers[i] = PrepareReadBuffer(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			operation->io_buffers_len++;
+		}
+	}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+	if (operation->io_buffers_len > 0)
+	{
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
+			/*
+			 * In theory we should only do this if PrepareReadBuffers() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
 	}
+	else
+	{
+		return false;
+	}
+}
 
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
 	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
 	}
 	else
 	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for StartReadBuffers.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1224,11 +1476,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1287,19 +1535,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1364,19 +1603,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1408,15 +1638,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1770,7 +1994,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2035,7 +2259,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2058,7 +2282,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2373,7 +2597,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2382,7 +2611,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3450,7 +3679,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5185,9 +5414,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5200,6 +5435,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1f02fed250e..6956d4e5b49 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..c4d3892bb26
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf2..cfb58cf4836 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
@@ -2267,6 +2269,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.40.1

v4-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-patch; charset=US-ASCII; name=v4-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchDownload
From 6091d8d7307883d76bb7cf830b69fd06ad1462b0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:41:32 -0500
Subject: [PATCH v4 5/5] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the streaming read object in heap_beginscan(). On rescan, reset
the streaming read by release all pinned buffers and resetting the
prefetch block.

Implement a callback returning the next block to prefetch to the
streaming read API.

Invoke the streaming read API when a new page is needed and streaming
reads are enabled. When the scan direction changes, reset the streaming
read.

ci-os-only:
---
 src/backend/access/heap/heapam.c | 88 ++++++++++++++++++++++++++++----
 src/include/access/heapam.h      | 16 +++++-
 2 files changed, 93 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e0fe3d9c326..c4c184ac2cc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -65,6 +65,7 @@
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
@@ -228,6 +229,29 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+
+	/*
+	 * Hard-code ScanDirection to ForwardScanDirection since only forward
+	 * scans support streaming reads.
+	 */
+	if (!scan->rs_inited)
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -330,6 +354,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -472,6 +503,8 @@ heapbuildvis(TableScanDesc sscan)
 static inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_pgsr);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -486,19 +519,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (!scan->rs_inited)
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (scan->rs_dir != dir && scan->rs_pgsr)
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		pg_streaming_read_reset(scan->rs_pgsr);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	scan->rs_dir = dir;
+
+	scan->rs_cbuf = pg_streaming_read_buffer_get_next(scan->rs_pgsr, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -827,6 +864,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -917,6 +955,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -999,6 +1038,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	else
 		scan->rs_parallelworkerdata = NULL;
 
+	scan->rs_pgsr = NULL;
+
 	/*
 	 * we do this here instead of in initscan() because heap_rescan also calls
 	 * initscan() and we don't want to allocate memory again
@@ -1010,6 +1051,22 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	/*
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the streaming read object will be freed.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+													   scan,
+													   0,
+													   scan->rs_strategy,
+													   BMR_REL(scan->rs_base.rs_rd),
+													   MAIN_FORKNUM,
+													   heap_scan_pgsr_next);
+	}
+
 	return (TableScanDesc) scan;
 }
 
@@ -1048,6 +1105,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	 * reinitialize scan descriptor
 	 */
 	initscan(scan, key, true);
+
+	/*
+	 * The streaming read object is reset on rescan. This must be done after
+	 * initscan(), as some state referred to by pg_streaming_read_reset() is
+	 * reset in initscan().
+	 */
+	if (scan->rs_pgsr)
+		pg_streaming_read_reset(scan->rs_pgsr);
 }
 
 void
@@ -1063,6 +1128,9 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->rs_pgsr)
+		pg_streaming_read_free(scan->rs_pgsr);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e2b1b2a3ad9..a476ae62a8f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,9 +72,23 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
-	/* these fields only used in page-at-a-time mode and for bitmap scans */
+	/* only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
+
+	/*
+	 * Fields used for streaming reads by sequential scans and TID range
+	 * scans. The streaming read object is allocated at the beginning of the
+	 * scan and reset on rescan or when the scan direction changes. The scan
+	 * direction is saved each time a new page is requested. If the scan
+	 * direction changes from one page to the next, the streaming read object
+	 * releases all previously pinned buffers and resets the prefetch block.
+	 */
+	BlockNumber rs_prefetch_block;
+	ScanDirection rs_dir;
+	struct PgStreamingRead *rs_pgsr;
+
+	/* only used in page-at-a-time mode and for bitmap scans */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
-- 
2.40.1

#10Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#9)
5 attachment(s)
Re: Streaming read-ready sequential scan code

On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:

On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:

On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.

While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.

I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.

Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.

- Melanie

Attachments:

v5-0001-Split-heapgetpage-into-two-parts.patchtext/x-diff; charset=us-asciiDownload
From 29827c23a11061846a7c145f430aa4712c1c30f3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v5 1/5] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 74 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625f..aef90d28473 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -363,17 +363,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -381,31 +382,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
-
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -478,6 +456,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -751,7 +760,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -872,7 +881,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..8dc137995f0 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2276,11 +2276,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2322,20 +2325,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4b133f68593..e2b1b2a3ad9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -246,7 +246,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.40.1

v5-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-diff; charset=us-asciiDownload
From a2ead3ce07f88023a573a8be86f21b6e7ed13047 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v5 2/5] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 79 ++++++++++++++------------------
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index aef90d28473..adde61fca60 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -83,6 +83,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -459,14 +462,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -481,10 +482,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -494,7 +504,7 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -634,7 +644,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -732,23 +742,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -758,9 +758,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -782,7 +785,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -812,9 +815,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -847,22 +847,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -879,9 +870,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -901,7 +895,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -912,9 +906,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.40.1

v5-0003-fixup-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload
From 3bf107c6bac885ce65253850abb357eedbe94ad8 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v5 3/5] fixup: Streaming Read API

---
 contrib/pg_prewarm/pg_prewarm.c          |  40 +-
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 611 +++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 641 ++++++++++++++++-------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/include/storage/bufmgr.h             |  45 ++
 src/include/storage/streaming_read.h     |  52 ++
 src/tools/pgindent/typedefs.list         |   3 +
 11 files changed, 1217 insertions(+), 211 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c
 create mode 100644 src/include/storage/streaming_read.h

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 8541e4d6e46..1cc84bcb0c2 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
+#include "storage/streaming_read.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -38,6 +39,25 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
+struct pg_prewarm_streaming_read_private
+{
+	BlockNumber blocknum;
+	int64		last_block;
+};
+
+static BlockNumber
+pg_prewarm_streaming_read_next(PgStreamingRead *pgsr,
+							   void *pgsr_private,
+							   void *per_buffer_data)
+{
+	struct pg_prewarm_streaming_read_private *p = pgsr_private;
+
+	if (p->blocknum <= p->last_block)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -183,18 +203,36 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
+		struct pg_prewarm_streaming_read_private p;
+		PgStreamingRead *pgsr;
+
 		/*
 		 * In buffer mode, we actually pull the data into shared_buffers.
 		 */
+
+		/* Set up the private state for our streaming buffer read callback. */
+		p.blocknum = first_block;
+		p.last_block = last_block;
+
+		pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_FULL,
+											  &p,
+											  0,
+											  NULL,
+											  BMR_REL(rel),
+											  forkNumber,
+											  pg_prewarm_streaming_read_next);
+
 		for (block = first_block; block <= last_block; ++block)
 		{
 			Buffer		buf;
 
 			CHECK_FOR_INTERRUPTS();
-			buf = ReadBufferExtended(rel, forkNumber, block, RBM_NORMAL, NULL);
+			buf = pg_streaming_read_buffer_get_next(pgsr, NULL);
 			ReleaseBuffer(buf);
 			++blocks_done;
 		}
+		Assert(pg_streaming_read_buffer_get_next(pgsr, NULL) == InvalidBuffer);
+		pg_streaming_read_free(pgsr);
 	}
 
 	/* Close relation, release lock. */
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..e437abebe4c
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,611 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_wait;
+	bool		advice_issued;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index;
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+	ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			distance;
+	bool		finished;
+	bool		advice_enabled;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs might be set to 0, but we want to allow at
+	 * least one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out doing full-sized
+	 * reads.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->distance = Min(MAX_BUFFERS_PER_TRANSFER, pgsr->max_pinned_buffers);
+	else
+		pgsr->distance = 1;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space for the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	int			result;
+
+	/*
+	 * Find slot in the circular buffer of per-buffer data, without using the
+	 * expensive % operator.
+	 */
+	result = range->per_buffer_data_index + n;
+	if (result >= pgsr->max_pinned_buffers)
+		result -= pgsr->max_pinned_buffers;
+	Assert(result == (range->per_buffer_data_index + n) % pgsr->max_pinned_buffers);
+
+	return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+	return (char *) pgsr->per_buffer_data +
+		pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	return get_per_buffer_data_by_index(pgsr,
+										get_per_buffer_data_index(pgsr,
+																  range,
+																  n));
+}
+
+/*
+ * Start reading the head range, and create a new head range.  The new head
+ * range is returned.  It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+	PgStreamingReadRange *new_head_range;
+	int			nblocks_pinned;
+	int			flags;
+
+	/* Caller should make sure we never exceed max_ios. */
+	Assert(pgsr->ios_in_progress < pgsr->max_ios);
+
+	/* Should only call if the head range has some blocks to read. */
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (pgsr->advice_enabled && head_range->blocknum != pgsr->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+
+	/* Start reading as many blocks as we can from the head range. */
+	nblocks_pinned = head_range->nblocks;
+	head_range->need_wait =
+		StartReadBuffers(pgsr->bmr,
+						 head_range->buffers,
+						 pgsr->forknum,
+						 head_range->blocknum,
+						 &nblocks_pinned,
+						 pgsr->strategy,
+						 flags,
+						 &head_range->operation);
+
+	if (head_range->need_wait)
+	{
+		/*
+		 * I/O necessary.  Look-ahead distance increases rapidly until it hits
+		 * the pin limit.
+		 */
+		if (pgsr->distance < pgsr->max_pinned_buffers)
+		{
+			int		distance;
+
+			distance = pgsr->distance * 2;
+			distance = Min(distance, pgsr->max_pinned_buffers);
+			pgsr->distance = distance;
+		}
+
+		/* Count an I/O in progress until we've "waited". */
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+			Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+		}
+	}
+	else
+	{
+		/*
+		 * No I/O necessary. Look-ahead distance decays slowly, but stays high
+		 * enough to form a full sized I/O.
+		 */
+		if (pgsr->distance > MAX_BUFFERS_PER_TRANSFER)
+			pgsr->distance--;
+	}
+
+	/*
+	 * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+	 * but always at least one.
+	 */
+	Assert(nblocks_pinned <= head_range->nblocks);
+	Assert(nblocks_pinned >= 1);
+	pgsr->pinned_buffers += nblocks_pinned;
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time.
+	 */
+	pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+	/*
+	 * Create a new head range.  There must be space, because we have enough
+	 * elements for every range to hold just one block, up to the pin limit.
+	 */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	new_head_range = &pgsr->ranges[pgsr->head];
+	new_head_range->nblocks = 0;
+	new_head_range->advice_issued = false;
+
+	/*
+	 * If we didn't manage to start the whole read above, we split the range,
+	 * moving the remainder into the new head range.
+	 */
+	if (nblocks_pinned < head_range->nblocks)
+	{
+		int			nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+		head_range->nblocks = nblocks_pinned;
+
+		new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+		new_head_range->nblocks = nblocks_remaining;
+	}
+
+	/* The new range has per-buffer data starting after the previous range. */
+	new_head_range->per_buffer_data_index =
+		get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+	return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = pgsr->unget_blocknum;
+		pgsr->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * pg_streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+	pgsr->unget_blocknum = blocknum;
+	pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *range;
+
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BlockNumber blocknum;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		range = &pgsr->ranges[pgsr->head];
+		if (range->nblocks == lengthof(range->buffers))
+		{
+			/* Start as much of it as we can. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/* If we're now at the I/O limit, stop here. */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+				return;
+
+			/*
+			 * If we couldn't form a full range, then stop here to avoid
+			 * creating small I/O.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				return;
+
+			/*
+			 * That might have only been partially started, but always
+			 * processes at least one so that'll do for now.
+			 */
+			Assert(range->nblocks < lengthof(range->buffers));
+		}
+
+		/* Find per-buffer data slot for the next block. */
+		per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			pgsr->finished = true;
+			break;
+		}
+
+		/*
+		 * Is there a head range that we cannot extend, because the requested
+		 * block is not consecutive?
+		 */
+		if (range->nblocks > 0 &&
+			range->blocknum + range->nblocks != blocknum)
+		{
+			/* Yes.  Start it, so we can begin building a new one. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * It's possible that it was only partially started, and we have a
+			 * new range with the remainder.  Keep starting I/Os until we get
+			 * it all out of the way, or we hit the I/O limit.
+			 */
+			while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+				range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * We have to 'unget' the block returned by the callback if we
+			 * don't have enough I/O capacity left to start something.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+			{
+				pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* If we have a new, empty range, initialize the start block. */
+		if (range->nblocks == 0)
+		{
+			range->blocknum = blocknum;
+		}
+
+		/* This block extends the range by one. */
+		Assert(range->blocknum + range->nblocks == blocknum);
+		range->nblocks++;
+
+	} while (pgsr->pinned_buffers + range->nblocks < pgsr->distance);
+
+	/* Start as much as we can. */
+	while (range->nblocks > 0)
+	{
+		range = pg_streaming_read_start_head_range(pgsr);
+		if (pgsr->ios_in_progress == pgsr->max_ios)
+			break;
+	}
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_wait)
+		{
+			WaitReadBuffers(&tail_range->operation);
+			tail_range->need_wait = false;
+
+			/*
+			 * We don't really know if the kernel generated a physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished now because we've performed the read.
+			 */
+			if (tail_range->advice_issued)
+			{
+				Assert(pgsr->ios_in_progress > 0);
+				pgsr->ios_in_progress--;
+			}
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (per_buffer_data)
+				*per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+
+		/*
+		 * If tail crashed into head, and head is not empty, then it is time
+		 * to start that range.
+		 */
+		if (pgsr->tail == pgsr->head &&
+			pgsr->ranges[pgsr->head].nblocks > 0)
+			pg_streaming_read_start_head_range(pgsr);
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	pgsr->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ios_in_progress == 0);
+
+	/* Release memory. */
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..729d1f91721 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -471,10 +476,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +504,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +785,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +797,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +819,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +990,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1070,404 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
 
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+	return BufferDescriptorGetBuffer(bufHdr);
+}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	operation->bmr = bmr;
+	operation->forknum = forkNum;
+	operation->blocknum = blockNum;
+	operation->buffers = buffers;
+	operation->nblocks = actual_nblocks;
+	operation->strategy = strategy;
+	operation->flags = flags;
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	operation->io_buffers_len = 0;
 
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		bool		found;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		buffers[i] = PrepareReadBuffer(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			operation->io_buffers_len++;
+		}
+	}
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+	if (operation->io_buffers_len > 0)
+	{
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
+			/*
+			 * In theory we should only do this if PrepareReadBuffers() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
 		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
 	}
+	else
+	{
+		return false;
+	}
+}
 
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
 	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
 	}
 	else
 	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
 	}
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for StartReadBuffers.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1475,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1534,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1602,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1637,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +1993,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2258,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2281,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2596,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2610,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3678,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5413,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5434,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..c4d3892bb26
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3a7f75b080..299c77ea69f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
@@ -2267,6 +2269,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.40.1

v5-0004-Add-pg_streaming_read_reset.patchtext/x-diff; charset=us-asciiDownload
From a9bb20c34dd31379fd45222efd701ffacc9732c4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:33:39 -0500
Subject: [PATCH v5 4/5] Add pg_streaming_read_reset

For rescan, we want to reuse the streaming read object and simply
release the buffers that were pinned by the streaming read
infrastructure.
---
 src/backend/storage/aio/streaming_read.c | 18 ++++++++++++++++++
 src/include/storage/streaming_read.h     |  1 +
 2 files changed, 19 insertions(+)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index e437abebe4c..9da863fd682 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -609,3 +609,21 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 
 	pfree(pgsr);
 }
+
+
+/*
+ * Reset a streaming read object by releasing all of the buffers. Note that
+ * max_ios is not recalculated, so any changes to maintenance_io_concurrency and
+ * effective_io_concurrency will have no effect.
+ */
+void
+pg_streaming_read_reset(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+	pgsr->finished = false;
+}
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index c4d3892bb26..63cef719e42 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -48,5 +48,6 @@ extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
 extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
 
 #endif
-- 
2.40.1

v5-0005-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-diff; charset=us-asciiDownload
From 6615e10588c800a8ab0d72587996610ebf26faef Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 26 Feb 2024 15:41:32 -0500
Subject: [PATCH v5 5/5] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the streaming read object in heap_beginscan(). On rescan, reset
the streaming read by release all pinned buffers and resetting the
prefetch block.

Implement a callback returning the next block to prefetch to the
streaming read API.

Invoke the streaming read API when a new page is needed and streaming
reads are enabled. When the scan direction changes, reset the streaming
read.

ci-os-only:
---
 src/backend/access/heap/heapam.c | 88 ++++++++++++++++++++++++++++----
 src/include/access/heapam.h      | 16 +++++-
 2 files changed, 93 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index adde61fca60..9606d71f457 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -62,6 +62,7 @@
 #include "storage/predicate.h"
 #include "storage/procarray.h"
 #include "storage/standby.h"
+#include "storage/streaming_read.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/relcache.h"
@@ -224,6 +225,29 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_pgsr_next(PgStreamingRead *pgsr, void *pgsr_private,
+					  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) pgsr_private;
+
+	/*
+	 * Hard-code ScanDirection to ForwardScanDirection since only forward
+	 * scans support streaming reads.
+	 */
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -326,6 +350,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -468,6 +499,8 @@ heapbuildvis(TableScanDesc sscan)
 static inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_pgsr);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -482,19 +515,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		pg_streaming_read_reset(scan->rs_pgsr);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	scan->rs_dir = dir;
+
+	scan->rs_cbuf = pg_streaming_read_buffer_get_next(scan->rs_pgsr, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -823,6 +860,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -913,6 +951,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -995,6 +1034,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	else
 		scan->rs_parallelworkerdata = NULL;
 
+	scan->rs_pgsr = NULL;
+
 	/*
 	 * we do this here instead of in initscan() because heap_rescan also calls
 	 * initscan() and we don't want to allocate memory again
@@ -1006,6 +1047,22 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	/*
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the streaming read object will be freed.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_SEQUENTIAL,
+													   scan,
+													   0,
+													   scan->rs_strategy,
+													   BMR_REL(scan->rs_base.rs_rd),
+													   MAIN_FORKNUM,
+													   heap_scan_pgsr_next);
+	}
+
 	return (TableScanDesc) scan;
 }
 
@@ -1044,6 +1101,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	 * reinitialize scan descriptor
 	 */
 	initscan(scan, key, true);
+
+	/*
+	 * The streaming read object is reset on rescan. This must be done after
+	 * initscan(), as some state referred to by pg_streaming_read_reset() is
+	 * reset in initscan().
+	 */
+	if (scan->rs_pgsr)
+		pg_streaming_read_reset(scan->rs_pgsr);
 }
 
 void
@@ -1059,6 +1124,9 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	if (scan->rs_pgsr)
+		pg_streaming_read_free(scan->rs_pgsr);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e2b1b2a3ad9..a476ae62a8f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -72,9 +72,23 @@ typedef struct HeapScanDescData
 	 */
 	ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
 
-	/* these fields only used in page-at-a-time mode and for bitmap scans */
+	/* only used in page-at-a-time mode and for bitmap scans */
 	int			rs_cindex;		/* current tuple's index in vistuples */
 	int			rs_ntuples;		/* number of visible tuples on page */
+
+	/*
+	 * Fields used for streaming reads by sequential scans and TID range
+	 * scans. The streaming read object is allocated at the beginning of the
+	 * scan and reset on rescan or when the scan direction changes. The scan
+	 * direction is saved each time a new page is requested. If the scan
+	 * direction changes from one page to the next, the streaming read object
+	 * releases all previously pinned buffers and resets the prefetch block.
+	 */
+	BlockNumber rs_prefetch_block;
+	ScanDirection rs_dir;
+	struct PgStreamingRead *rs_pgsr;
+
+	/* only used in page-at-a-time mode and for bitmap scans */
 	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
 }			HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
-- 
2.40.1

#11Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#10)
6 attachment(s)
Re: Streaming read-ready sequential scan code

On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:

On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:

On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.

While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.

I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.

Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.

Attached v6 has the version of the streaming read API mentioned here
[1]: /messages/by-id/CA+hUKGJTwrS7F=uJPx3SeigMiQiW+LJaOkjGyZdCntwyMR=uAw@mail.gmail.com
investigated in that thread by Andres, Bilal, and Thomas.

The one outstanding item for the sequential scan streaming read user
is deciding how the BAS_BULKREAD buffer access strategy should
interact with the streaming read infrastructure. We discussed a bit
off-list, and it seems clear that the ring must be at least as large
as io_combine_limit. This should be no problem for BAS_BULKREAD
because its ring is 16 MB. The question is whether or not we need to
do anything right now to ensure there aren't adverse interactions
between io_combine_limit, max_ios, and the buffer access strategy ring
buffer size.

- Melanie

[1]: /messages/by-id/CA+hUKGJTwrS7F=uJPx3SeigMiQiW+LJaOkjGyZdCntwyMR=uAw@mail.gmail.com

Attachments:

v6-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Split-heapgetpage-into-two-parts.patchDownload
From bed26d391190f4411eccc4533d188e5dba6e8f72 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v6 1/6] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 74 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2f6527df0dc..4291f161cf9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -360,17 +360,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -378,31 +379,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
-
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -475,6 +453,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -748,7 +757,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -869,7 +878,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..8784473b462 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2336,11 +2336,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2382,20 +2385,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f1122453738..bacef18c0d3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -254,7 +254,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.40.1

v6-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload
From 7e021d9f21292a939c3c3485ac44e17e0995127f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v6 2/6] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 79 ++++++++++++++------------------
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4291f161cf9..c814aea9407 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -83,6 +83,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -456,14 +459,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -478,10 +479,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -491,7 +501,7 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -631,7 +641,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -729,23 +739,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -755,9 +755,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -779,7 +782,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -809,9 +812,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -844,22 +844,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -876,9 +867,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -898,7 +892,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -909,9 +903,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.40.1

v6-0004-Provide-API-for-streaming-relation-data.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Provide-API-for-streaming-relation-data.patchDownload
From 78a9008b50d5398cce23cac695da884c41d0d720 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Feb 2024 00:01:42 +1300
Subject: [PATCH v6 4/6] Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue fadvise() advice
ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code that should eventually be replaced by it),
but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi> (contributions)
Suggested-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/aio/Makefile      |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/read_stream.c | 733 ++++++++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c   |  21 +-
 src/backend/storage/meson.build       |   1 +
 src/include/storage/read_stream.h     |  62 +++
 src/tools/pgindent/typedefs.list      |   2 +
 8 files changed, 829 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..2f29a9ec4d1
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..10e1aa3b20b
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 00000000000..4e293e0df65
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,733 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Sometimes we need to be able to 'unget' a block number to resolve a
+	 * flow control problem when I/Os are split.
+	 */
+	BlockNumber unget_blocknum;
+	bool		have_unget_blocknum;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (!stream->have_unget_blocknum)
+		return stream->callback(stream,
+								stream->callback_private_data,
+								per_buffer_data);
+
+	/*
+	 * You can only unget one block, and next_buffer_index can't change across
+	 * a get, unget, get sequence, so the callback's per_buffer_data, if any,
+	 * is still present in the correct slot.  We just have to return the
+	 * previous block number.
+	 */
+	stream->have_unget_blocknum = false;
+	return stream->unget_blocknum;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	Assert(!stream->have_unget_blocknum);
+	stream->have_unget_blocknum = true;
+	stream->unget_blocknum = blocknum;
+}
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * Normally we don't start the pending read just because we've hit a
+	 * limit, preferring to give it another chance to grow to a larger size
+	 * once more buffers have been consumed.  However, in cases where that
+	 * can't possibly happen, we might as well start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->distance == stream->pending_read_nblocks ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream created with
+ * read_stream_begin_buffered().  Each call returns successive blocks in the
+ * order specified by the callback.  If per_buffer_data_size was set to a
+ * non-zero size, *per_buffer_data receives a pointer to the extra per-buffer
+ * data that the callback had a chance to populate, which remains valid until
+ * the next call to read_stream_next_buffer().  When the stream runs out of
+ * data, InvalidBuffer is returned.  The caller may decide to end the stream
+ * early at any time by calling read_stream_end_buffered().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(per_buffer_data == NULL &&
+			   stream->ios_in_progress == 0 &&
+			   stream->pinned_buffers == 1 &&
+			   stream->distance == 1))
+	{
+		BlockNumber next_blocknum;
+
+		/*
+		 * We have a pinned buffer that we need to serve up, but we also want
+		 * to probe the next one before we return, just in case we need to
+		 * start an I/O.  We can re-use the same buffer slot, and an arbitrary
+		 * I/O slot since they're all free.
+		 */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+		Assert(stream->pending_read_nblocks <= 1);
+		if (unlikely(stream->pending_read_nblocks == 1))
+		{
+			next_blocknum = stream->pending_read_blocknum;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+			next_blocknum = read_stream_get_block(stream, NULL);
+		if (unlikely(next_blocknum == InvalidBlockNumber))
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			stream->next_buffer_index = oldest_buffer_index;
+			/* Pin transferred to caller. */
+			stream->pinned_buffers = 0;
+			return buffer;
+		}
+		/* Call the special single block version, which is marginally faster. */
+		if (unlikely(StartReadBuffer(&stream->ios[0].op,
+									 &stream->buffers[oldest_buffer_index],
+									 next_blocknum,
+									 stream->advice_enabled ?
+									 READ_BUFFERS_ISSUE_ADVICE : 0)))
+		{
+			/* I/O needed.  We'll take the general path next time. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+			/* Increase look ahead distance (move towards behavior B/C). */
+			stream->distance = Min(2, stream->max_pinned_buffers);
+		}
+		/* Pin transferred to caller, got another one, no net change. */
+		Assert(stream->pinned_buffers == 1);
+		return buffer;
+	}
+
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+	return buffer;
+}
+
+/*
+ * Release stream resources.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Release memory. */
+	pfree(stream);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8df5f3b43da..577bcf6e5dd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1210,13 +1210,14 @@ StartReadBuffer(ReadBuffersOperation *operation,
  * Begin reading a range of blocks beginning at blockNum and extending for
  * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
  * are written into the buffers array, and *nblocks is updated to contain the
- * actual number, which may be fewer than requested.
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
  *
- * If false is returned, no I/O is necessary and WaitReadBuffers() is not
- * necessary.  If true is returned, one I/O has been started, and
- * WaitReadBuffers() must be called with the same operation object before the
- * buffers are accessed.  Along with the operation object, the caller-supplied
- * array of buffers must remain valid until WaitReadBuffers() is called.
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
  *
  * Currently the I/O is only started with optional operating system advice,
  * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
@@ -2452,7 +2453,7 @@ MarkBufferDirty(Buffer buffer)
 	uint32		old_buf_state;
 
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 	{
@@ -4824,7 +4825,7 @@ void
 ReleaseBuffer(Buffer buffer)
 {
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 		UnpinLocalBuffer(buffer);
@@ -4891,7 +4892,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 	Page		page = BufferGetPage(buffer);
 
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 	{
@@ -5963,7 +5964,7 @@ ResOwnerReleaseBufferPin(Datum res)
 
 	/* Like ReleaseBuffer, but don't call ResourceOwnerForgetBuffer */
 	if (!BufferIsValid(buffer))
-		elog(ERROR, "bad buffer ID: %d", buffer);
+		elog(PANIC, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
 		UnpinLocalBufferNoOwner(buffer);
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 00000000000..9e5fa2acf15
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,62 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 97edd1388e9..82fa6c12970 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1214,6 +1214,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2292,6 +2293,7 @@ ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.40.1

v6-0003-Provide-vectored-variant-of-ReadBuffer.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Provide-vectored-variant-of-ReadBuffer.patchDownload
From 8b877bf126ebd9f1c5a896946b01120d7ece7ca5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v6 3/6] Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de> (optimization tweaks)
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/buffer/bufmgr.c           | 700 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 563 insertions(+), 222 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2ef..f3736000ad2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..8df5f3b43da 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,9 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/* Limit on how many blocks should be handled in single I/O operations. */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +479,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +507,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +788,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +800,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +822,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +993,146 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static inline Buffer
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1151,358 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+		if (bmr.smgr == NULL)
+		{
+			bmr.smgr = RelationGetSmgr(bmr.rel);
+			bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+		}
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
+		flags = 0;
+	operation.bmr = bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+
+	if (!operation->bmr.smgr)
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
+		operation->bmr.smgr = RelationGetSmgr(operation->bmr.rel);
+		operation->bmr.relpersistence = operation->bmr.rel->rd_rel->relpersistence;
 	}
 
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
 
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+	if (io_buffers_len > 0)
+	{
+		/* Populate information needed for I/O. */
+		operation->buffers = buffers;
+		operation->blocknum = blockNum;
+		operation->flags = flags;
+		operation->nblocks = actual_nblocks;
+		operation->io_buffers_len = io_buffers_len;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(operation->bmr.smgr,
+						 operation->forknum,
+						 blockNum,
+						 operation->io_buffers_len);
 		}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
 	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1510,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1569,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1637,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1672,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2028,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2293,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2316,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2631,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2645,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3713,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5448,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5469,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index abd9029451f..313e393262f 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f79..7fa6d5a64c8 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..241f68c45e1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaeac..97edd1388e9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2286,6 +2286,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.40.1

v6-0005-Add-read_stream_reset.patchtext/x-patch; charset=US-ASCII; name=v6-0005-Add-read_stream_reset.patchDownload
From 78b7bb5e30f7998b2701f5e967ec5d9c1849f7da Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Mar 2024 16:46:42 -0400
Subject: [PATCH v6 5/6] Add read_stream_reset

For rescan, we want to reuse the existing ReadStream and simply release
the pinned buffers.
---
 src/backend/storage/aio/read_stream.c | 19 +++++++++++++++++++
 src/include/storage/read_stream.h     |  1 +
 2 files changed, 20 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4e293e0df65..0d855475959 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -731,3 +731,22 @@ read_stream_end(ReadStream *stream)
 	/* Release memory. */
 	pfree(stream);
 }
+
+
+/*
+ * Reset a read stream by releasing all of the buffers.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	stream->distance = 1;
+}
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 9e5fa2acf15..dc1733a1594 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -58,5 +58,6 @@ extern ReadStream *read_stream_begin_relation(int flags,
 											  size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_end(ReadStream *stream);
+extern void read_stream_reset(ReadStream *stream);
 
 #endif							/* READ_STREAM_H */
-- 
2.40.1

v6-0006-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-patch; charset=US-ASCII; name=v6-0006-Sequential-scans-and-TID-range-scans-stream-reads.patchDownload
From f0c79f89574725460342fe55d56365bc2031510e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Mar 2024 20:25:06 -0400
Subject: [PATCH v6 6/6] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the read stream object in heap_beginscan(). On rescan, reset
the stream by releasing all pinned buffers and resetting the prefetch
block.

Implement a callback returning the next block to prefetch to the
read stream infrastructure.

Invoke the read stream API when a new page is needed. When the scan
direction changes, reset the stream.

ci-os-only:
---
 src/backend/access/heap/heapam.c | 88 ++++++++++++++++++++++++++++----
 src/include/access/heapam.h      | 15 ++++++
 2 files changed, 93 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c814aea9407..eee62b326e2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -221,6 +221,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -323,6 +342,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -465,6 +491,8 @@ heapbuildvis(TableScanDesc sscan)
 static inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -479,19 +507,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	scan->rs_dir = dir;
+
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -820,6 +852,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -910,6 +943,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -983,6 +1017,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	/* we only need to set this up once */
 	scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
 
+	scan->rs_read_stream = NULL;
+
 	/*
 	 * Allocate memory to keep track of page allocation for parallel workers
 	 * when doing a parallel scan.
@@ -1003,6 +1039,24 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	/*
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the read stream will be freed. This is best done
+	 * after initscan()
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  BMR_REL(scan->rs_base.rs_rd),
+														  MAIN_FORKNUM,
+														  heap_scan_stream_read_next,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1041,6 +1095,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	 * reinitialize scan descriptor
 	 */
 	initscan(scan, key, true);
+
+	/*
+	 * The read stream is reset on rescan. This must be done after initscan(),
+	 * as some state referred to by read_stream_reset() is reset in
+	 * initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
 }
 
 void
@@ -1056,6 +1118,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index bacef18c0d3..ae4577f1cb8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -66,6 +67,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
-- 
2.40.1

#12Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#11)
4 attachment(s)
Re: Streaming read-ready sequential scan code

On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote:

On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:

On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:

On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

There is an outstanding question about where to allocate the
PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.

While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.

I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.

Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.

Attached v6 has the version of the streaming read API mentioned here
[1]. This resolved the fully-in-shared-buffers regressions
investigated in that thread by Andres, Bilal, and Thomas.

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.

I noticed that 0001 in the set posed a small regression from master for
a sequential scan of a relation already in shared buffers. While
investigating this, I saw that heapfetchbuf() was still not being
inlined (compiled at -O2) and when I promoted heapfetchbuf() from static
inline to static pg_attribute_always_inline, most of the very small
regression I saw went away. I don't know if I squashed the issue
entirely, though.

- Melanie

Attachments:

v7-0001-Split-heapgetpage-into-two-parts.patchtext/x-diff; charset=us-asciiDownload
From db6219d9ea689fdd0150aa0fbbeaa2ee11364aa8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v7 1/4] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 74 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b661d9811eb..22ffa541ef7 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -360,17 +360,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -378,31 +379,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
-
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -475,6 +453,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static pg_attribute_always_inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -748,7 +757,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -869,7 +878,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 41a4bb0981d..20210784c30 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2363,11 +2363,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2409,20 +2412,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 32a3fbce961..5a6d22a4732 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -254,7 +254,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.40.1

v7-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-diff; charset=us-asciiDownload
From ca7f60419aecdc07c7145d62e9d5033087598fbb Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v7 2/4] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 79 ++++++++++++++------------------
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 22ffa541ef7..46645dc971f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -83,6 +83,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -456,14 +459,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static pg_attribute_always_inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -478,10 +479,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -491,7 +501,7 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -631,7 +641,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -729,23 +739,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -755,9 +755,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -779,7 +782,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -809,9 +812,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -844,22 +844,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -876,9 +867,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -898,7 +892,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -909,9 +903,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.40.1

v7-0003-v14-Streaming-Read-API.patchtext/x-diff; charset=us-asciiDownload
From 4ad1c9212b00d7fcff1f2e73e27538e32e61440a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v7 3/4] v14 Streaming Read API

Part 1:

Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Part 2:

Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.
---
 doc/src/sgml/config.sgml                      |  14 +
 src/backend/storage/Makefile                  |   2 +-
 src/backend/storage/aio/Makefile              |  14 +
 src/backend/storage/aio/meson.build           |   5 +
 src/backend/storage/aio/read_stream.c         | 819 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c           | 726 +++++++++++-----
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/storage/meson.build               |   1 +
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/storage/bufmgr.h                  |  41 +-
 src/include/storage/read_stream.h             |  63 ++
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 1484 insertions(+), 233 deletions(-)
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c
 create mode 100644 src/include/storage/read_stream.h

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f65c17e5ae4..241a6079688 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2720,6 +2720,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..2f29a9ec4d1
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..10e1aa3b20b
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 00000000000..f7e9dc1138b
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,819 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
+	 * control problems when I/Os are split.  Also useful for batch-loading
+	 * block numbers in the fast path.
+	 */
+	BlockNumber blocknums[16];
+	int16		blocknums_count;
+	int16		blocknums_next;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	bool		fast_path;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to work in batches.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (stream->blocknums_next < stream->blocknums_count)
+		return stream->blocknums[stream->blocknums_next++];
+
+	/*
+	 * We only bother to fetch one at a time here (but see the fast path which
+	 * uses more).
+	 */
+	return stream->callback(stream,
+							stream->callback_private_data,
+							per_buffer_data);
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	if (stream->blocknums_next == stream->blocknums_count)
+	{
+		/* Never initialized or entirely consumed.  Re-initialize. */
+		stream->blocknums[0] = blocknum;
+		stream->blocknums_count = 1;
+		stream->blocknums_next = 0;
+	}
+	else
+	{
+		/* Must be the last value return from blocknums array. */
+		Assert(stream->blocknums_next > 0);
+		stream->blocknums_next--;
+		Assert(stream->blocknums[stream->blocknums_next] == blocknum);
+	}
+}
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+static void
+read_stream_fill_blocknums(ReadStream *stream)
+{
+	BlockNumber blocknum;
+	int			i = 0;
+
+	do
+	{
+		blocknum = stream->callback(stream,
+									stream->callback_private_data,
+									NULL);
+		stream->blocknums[i++] = blocknum;
+	} while (i < lengthof(stream->blocknums) &&
+			 blocknum != InvalidBlockNumber);
+	stream->blocknums_count = i;
+	stream->blocknums_next = 0;
+}
+#endif
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached io_combine_limit, or we've reached the
+	 * distance limit and there isn't anything pinned yet, or the callback has
+	 * signaled end-of-stream, we start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 (stream->pending_read_nblocks == stream->distance &&
+		  stream->pinned_buffers == 0) ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new read stream object that can be used to perform the equivalent
+ * of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(stream->fast_path))
+	{
+		BlockNumber next_blocknum;
+		bool		need_wait;
+
+		/* Fast path assumptions. */
+		Assert(stream->ios_in_progress == 0);
+		Assert(stream->pinned_buffers == 1);
+		Assert(stream->distance == 1);
+		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->per_buffer_data_size == 0);
+
+		/* We're going to return the buffer we pinned last time. */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+
+		/*
+		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
+		 * I/O entry (they're all free).
+		 */
+		need_wait = StartReadBuffer(&stream->ios[0].op,
+									&stream->buffers[oldest_buffer_index],
+									stream->pending_read_blocknum,
+									stream->advice_enabled ?
+									READ_BUFFERS_ISSUE_ADVICE : 0);
+
+		/* Choose the block the next call will pin. */
+		if (unlikely(stream->blocknums_next == stream->blocknums_count))
+			read_stream_fill_blocknums(stream);
+		next_blocknum = stream->blocknums[stream->blocknums_next++];
+
+		/*
+		 * Fast return if the next call doesn't require I/O for the buffer we
+		 * just pinned, and we have a block number to give it as a pending
+		 * read.
+		 */
+		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		{
+			stream->pending_read_blocknum = next_blocknum;
+			return buffer;
+		}
+
+		/*
+		 * For anything more complex, set up some more state and take the slow
+		 * path next time.
+		 */
+		stream->fast_path = false;
+
+		if (need_wait)
+		{
+			/* Next call must wait for I/O for the newly pinned buffer. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+		}
+		if (next_blocknum == InvalidBlockNumber)
+		{
+			/* Next call hits end of stream and can't pin anything more. */
+			stream->distance = 0;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+		{
+			/* Set up the pending read. */
+			stream->pending_read_blocknum = next_blocknum;
+		}
+		return buffer;
+	}
+#endif
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+	/* See if we can take the fast path for all-cached scans next time. */
+	if (stream->ios_in_progress == 0 &&
+		stream->pinned_buffers == 1 &&
+		stream->distance == 1 &&
+		stream->pending_read_nblocks == 1 &&
+		stream->per_buffer_data_size == 0)
+	{
+		stream->fast_path = true;
+	}
+	else
+	{
+		stream->fast_path = false;
+	}
+#endif
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..70b19238b78 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -152,6 +157,13 @@ int			effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;
  */
 int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 
+/*
+ * Limit on how many blocks should be handled in single I/O operations.
+ * StartReadBuffers() callers should respect it, as should other operations
+ * that call smgr APIs directly.
+ */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
  * dependent defaults are set via the GUC mechanism.
@@ -471,10 +483,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation *bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,18 +511,18 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
-static BufferDesc *BufferAlloc(SMgrRelation smgr,
-							   char relpersistence,
-							   ForkNumber forkNum,
-							   BlockNumber blockNum,
-							   BufferAccessStrategy strategy,
-							   bool *foundPtr, IOContext io_context);
+static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
+									  char relpersistence,
+									  ForkNumber forkNum,
+									  BlockNumber blockNum,
+									  BufferAccessStrategy strategy,
+									  bool *foundPtr, IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -781,8 +792,8 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
+	BufferManagerRelation bmr;
 
 	/*
 	 * Reject attempts to read non-local temporary relations; we would be
@@ -794,15 +805,12 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	bmr.rel = reln;
+	bmr.smgr = RelationGetSmgr(reln);
+	bmr.relpersistence = reln->rd_rel->relpersistence;
+	buf = ReadBuffer_common(&bmr,
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +830,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(&BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									   RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -875,7 +882,7 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
 					Buffer *buffers,
 					uint32 *extended_by)
 {
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
+	/* Assert((bmr.rel != NULL) != (bmr.smgr != NULL)); */
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_by > 0);
 
@@ -911,7 +918,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Buffer		buffer = InvalidBuffer;
 	Buffer		buffers[64];
 
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
+	/* Assert((bmr.rel != NULL) != (bmr.smgr != NULL)); */
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
@@ -994,35 +1001,149 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(&bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation *bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr->smgr);
+
+	isLocalBuf = bmr->relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr->smgr->smgr_rlocator.locator.spcOid,
+									   bmr->smgr->smgr_rlocator.locator.dbOid,
+									   bmr->smgr->smgr_rlocator.locator.relNumber,
+									   bmr->smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr->smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr->smgr, bmr->relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr->rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr->rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr->rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr->smgr->smgr_rlocator.locator.spcOid,
+										  bmr->smgr->smgr_rlocator.locator.dbOid,
+										  bmr->smgr->smgr_rlocator.locator.relNumber,
+										  bmr->smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static pg_attribute_always_inline Buffer
+ReadBuffer_common(BufferManagerRelation *bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
+
+	/* Caller must make sure smgr is initialized. */
+	Assert(bmr->smgr != NULL);
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1162,353 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(*bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
+		bool		found;
+
+#if 0
+		if (bmr->smgr == NULL)
+		{
+			bmr->smgr = RelationGetSmgr(bmr->rel);
+			bmr->relpersistence = bmr->rel->rd_rel->relpersistence;
+		}
+#endif
+
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
+
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
-	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
-	}
+		flags = 0;
+	operation.bmr = *bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
+
+	return buffer;
+}
+
+/*
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice if
+ * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
+ * happens synchronously in WaitReadBuffers().  In future work, true I/O could
+ * be initiated here.
+ *
+ * Make sure that StartReadBuffer() inlines this function with an attribute.
+ */
+pg_attribute_always_inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
 
-	/* At this point we do NOT hold any locks. */
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	for (int i = 0; i < actual_nblocks; ++i)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		bool		found;
 
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
+		buffers[i] = PinBufferForBlock(&operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
 
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
+
+	if (likely(io_buffers_len == 0))
+		return false;
+
+	/* Populate information needed for I/O. */
+	operation->buffers = buffers;
+	operation->blocknum = blockNum;
+	operation->flags = flags;
+	operation->nblocks = actual_nblocks;
+	operation->io_buffers_len = io_buffers_len;
 
+	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	{
 		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
+		 * In theory we should only do this if PinBufferForBlock() had to
+		 * allocate new buffers above.  That way, if two calls to
+		 * StartReadBuffers() were made for the same blocks before
+		 * WaitReadBuffers(), only the first would issue the advice. That'd be
+		 * a better simulation of true asynchronous I/O, which would only
+		 * start the I/O once, but isn't done here for simplicity.  Note also
+		 * that the following call might actually issue two advice calls if we
+		 * cross a segment boundary; in a true asynchronous version we might
+		 * choose to process only one real I/O at a time in that case.
 		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
+		smgrprefetch(operation->bmr.smgr,
+					 operation->forknum,
+					 blockNum,
+					 operation->io_buffers_len);
+	}
 
-		return BufferDescriptorGetBuffer(bufHdr);
+	/* Indicate that WaitReadBuffers() should be called. */
+	return true;
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
 	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
 
 	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
 	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
 
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
 
 	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
 	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
 	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
 	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
 
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
 
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
 		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
 			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
 			}
 			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
 
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
 
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
 
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
 
-	return BufferDescriptorGetBuffer(bufHdr);
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1516,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1235,7 +1524,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * No locks are held either at entry or exit.
  */
-static BufferDesc *
+static inline BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
@@ -1286,19 +1575,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1643,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1678,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2034,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2299,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2322,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2637,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2651,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3719,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5454,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5475,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 92fcd5fa4d5..c12784cbec8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3129,6 +3129,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index adcc0257f91..baecde28410 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..241f68c45e1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 00000000000..f5dbc087b0b
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9add48f9924..1590dc6fddc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1215,6 +1215,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2288,11 +2289,13 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.40.1

v7-0004-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-diff; charset=us-asciiDownload
From dda58b9b556001a336b2ddaa0d516abce4b10f3f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Mar 2024 20:25:06 -0400
Subject: [PATCH v7 4/4] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the read stream object in heap_beginscan(). On rescan, reset
the stream by releasing all pinned buffers and resetting the prefetch
block.

Implement a callback returning the next block to prefetch to the
read stream infrastructure.

Invoke the read stream API when a new page is needed. When the scan
direction changes, reset the stream.
---
 src/backend/access/heap/heapam.c | 90 ++++++++++++++++++++++++++++----
 src/include/access/heapam.h      | 15 ++++++
 2 files changed, 95 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 46645dc971f..6b550308e58 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -221,6 +221,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -323,6 +342,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -465,6 +491,8 @@ heapbuildvis(TableScanDesc sscan)
 static pg_attribute_always_inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -479,19 +507,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	scan->rs_dir = dir;
+
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -820,6 +852,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -910,6 +943,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -1003,6 +1037,28 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	scan->rs_read_stream = NULL;
+
+	/*
+	 * For sequential scans and TID range scans, we will set up a read stream.
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the read stream will be freed. This should be
+	 * done after initscan() because initscan() allocates the
+	 * BufferAccessStrategy object.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  BMR_REL(scan->rs_base.rs_rd),
+														  MAIN_FORKNUM,
+														  heap_scan_stream_read_next,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1037,6 +1093,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * The read stream is reset on rescan. This must be done before
+	 * initscan(), as some state referred to by read_stream_reset() is reset
+	 * in initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
+
 	/*
 	 * reinitialize scan descriptor
 	 */
@@ -1056,6 +1120,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy.
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5a6d22a4732..3ec78672587 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -66,6 +67,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
-- 
2.40.1

#13Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Melanie Plageman (#12)
Re: Streaming read-ready sequential scan code

On 01/04/2024 22:58, Melanie Plageman wrote:

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.

I saw benchmarks in this thread to show that there's no regression when
the data is in cache, but I didn't see any benchmarks demonstrating the
benefit of this. So I ran this quick test:

-- create table ~1 GB table with only 1 row per page.
CREATE TABLE giga (i int, filler text) with (fillfactor=10);
insert into giga select g, repeat('x', 900) from generate_series(1,
140000) g;
vacuum freeze giga;

\timing on
select count(*) from giga;

The SELECT takes about 390 ms on 'master', and 230 ms with the patch.

This is pretty much the best case for this patch, real world gains will
be much smaller. Nevertheless, nice speedup!

--
Heikki Linnakangas
Neon (https://neon.tech)

#14Melanie Plageman
melanieplageman@gmail.com
In reply to: Heikki Linnakangas (#13)
3 attachment(s)
Re: Streaming read-ready sequential scan code

On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 01/04/2024 22:58, Melanie Plageman wrote:

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.

I saw benchmarks in this thread to show that there's no regression when
the data is in cache, but I didn't see any benchmarks demonstrating the
benefit of this. So I ran this quick test:

Good point! It would be good to show why we would actually want this
patch. Attached v8 is rebased over current master (which now has the
streaming read API).

On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.

Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);

Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.

- Melanie

Attachments:

v8-0003-Sequential-scans-and-TID-range-scans-stream-reads.patchtext/x-patch; charset=US-ASCII; name=v8-0003-Sequential-scans-and-TID-range-scans-stream-reads.patchDownload
From cfccafec650a77c53b1d78180b52db31742181ff Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Mar 2024 20:25:06 -0400
Subject: [PATCH v8 3/3] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the read stream object in heap_beginscan(). On rescan, reset
the stream by releasing all pinned buffers and resetting the prefetch
block.

Implement a callback returning the next block to prefetch to the
read stream infrastructure.

Invoke the read stream API when a new page is needed. When the scan
direction changes, reset the stream.
---
 src/backend/access/heap/heapam.c | 94 ++++++++++++++++++++++++++++----
 src/include/access/heapam.h      | 15 +++++
 2 files changed, 97 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6b26f5bf8af..3546f637c13 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -221,6 +221,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -323,6 +342,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -459,12 +485,14 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the next block of the relation into a buffer and returns
- * with that pinned buffer saved in the scan descriptor.
+ * This routine gets gets the next block of the relation from the read stream
+ * and saves that pinned buffer in the scan descriptor.
  */
 static inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -479,19 +507,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	scan->rs_dir = dir;
+
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -820,6 +852,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -910,6 +943,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -1003,6 +1037,28 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	scan->rs_read_stream = NULL;
+
+	/*
+	 * For sequential scans and TID range scans, we will set up a read stream.
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the read stream will be freed. This should be
+	 * done after initscan() because initscan() allocates the
+	 * BufferAccessStrategy object.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  scan->rs_base.rs_rd,
+														  MAIN_FORKNUM,
+														  heap_scan_stream_read_next,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1037,6 +1093,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * The read stream is reset on rescan. This must be done before
+	 * initscan(), as some state referred to by read_stream_reset() is reset
+	 * in initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
+
 	/*
 	 * reinitialize scan descriptor
 	 */
@@ -1056,6 +1120,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy.
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4d324c78e5b..41d32d5d95d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -70,6 +71,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
-- 
2.40.1

v8-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Replace-blocks-with-buffers-in-heapgettup-control.patchDownload
From 7b474f3b4fad7fcf092bae20e67cb020846be30f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v8 2/3] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the sequential scan streaming read user
which will implement a callback returning the next block to read.
Sequential scans previously looped through the blocks in the relation,
synchronously reading in a block and then processing it. An
InvalidBlockNumber returned by heapgettup_advance_block() meant that the
relation was exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.
---
 src/backend/access/heap/heapam.c | 79 ++++++++++++++------------------
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bf2b9b19e72..6b26f5bf8af 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -83,6 +83,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
 								 LockTupleMode mode, LockWaitPolicy wait_policy,
 								 bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+												   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  uint16 old_infomask2, TransactionId add_to_xmax,
 									  LockTupleMode mode, bool is_update,
@@ -456,14 +459,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -478,10 +479,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
 }
 
 /*
@@ -491,7 +501,7 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
  * occur with empty tables and in parallel scans when parallel workers get all
  * of the pages before we can get a chance to get our first page.
  */
-static BlockNumber
+BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
@@ -631,7 +641,7 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
  * This also adjusts rs_numblocks when a limit has been imposed by
  * heap_setscanlimits().
  */
-static inline BlockNumber
+BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
 	if (ScanDirectionIsForward(dir))
@@ -729,23 +739,13 @@ heapgettup(HeapScanDesc scan,
 		   ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	OffsetNumber lineoff;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;
-
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_continue_page(scan, dir, &linesleft, &lineoff);
 		goto continue_page;
@@ -755,9 +755,12 @@ heapgettup(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -779,7 +782,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			visible = HeapTupleSatisfiesVisibility(tuple,
 												   scan->rs_base.rs_snapshot,
@@ -809,9 +812,6 @@ continue_page:
 		 * it's time to move to the next.
 		 */
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
@@ -844,22 +844,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	BlockNumber block;
 	Page		page;
 	int			lineindex;
 	int			linesleft;
 
-	if (unlikely(!scan->rs_inited))
-	{
-		block = heapgettup_initial_block(scan, dir);
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(block != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
-	}
-	else
+	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		block = scan->rs_cblock;	/* current page */
 		page = BufferGetPage(scan->rs_cbuf);
 
 		lineindex = scan->rs_cindex + dir;
@@ -876,9 +867,12 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 * advance the scan until we find a qualifying tuple or run out of stuff
 	 * to scan
 	 */
-	while (block != InvalidBlockNumber)
+	while (true)
 	{
-		heapfetchbuf(scan, block);
+		heapfetchbuf(scan, dir);
+		if (!BufferIsValid(scan->rs_cbuf))
+			break;
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
 		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
@@ -898,7 +892,7 @@ continue_page:
 
 			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
 			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSet(&(tuple->t_self), block, lineoff);
+			ItemPointerSet(&(tuple->t_self), scan->rs_cblock, lineoff);
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -909,9 +903,6 @@ continue_page:
 			scan->rs_cindex = lineindex;
 			return;
 		}
-
-		/* get the BlockNumber to scan next */
-		block = heapgettup_advance_block(scan, block, dir);
 	}
 
 	/* end of scan */
-- 
2.40.1

v8-0001-Split-heapgetpage-into-two-parts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Split-heapgetpage-into-two-parts.patchDownload
From 2fb76f871d25a9a6967d700a5dccbe28473c0fc2 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v8 1/3] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for read streams. The streaming read API
will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 74 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 40 +++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a9d5b109a5e..bf2b9b19e72 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -360,17 +360,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -378,31 +379,8 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
-
-	buffer = scan->rs_cbuf;
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -475,6 +453,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - subroutine for heapgettup()
+ *
+ * This routine reads the specified block of the relation into a buffer and
+ * returns with that pinned buffer saved in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -748,7 +757,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -869,7 +878,8 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
+		heapbuildvis((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0952d4a98eb..f4f670e9b24 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2352,11 +2352,14 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2398,20 +2401,35 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
-	if (!BlockNumberIsValid(blockno))
+	hscan->rs_cblock = blockno;
+
+	if (!BlockNumberIsValid(hscan->rs_cblock))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * We may scan multiple pages before finding tuples to yield or finishing
+	 * the scan. Since we want to check for interrupts at least once per page,
+	 * do so here.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
 
+	/*
+	 * If pagemode is allowed, prune the page and build an array of visible
+	 * tuple offsets.
+	 */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heapbuildvis(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a307fb5f245..4d324c78e5b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -267,7 +267,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heapbuildvis(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.40.1

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Melanie Plageman (#14)
Re: Streaming read-ready sequential scan code

On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 01/04/2024 22:58, Melanie Plageman wrote:

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.

I saw benchmarks in this thread to show that there's no regression when
the data is in cache, but I didn't see any benchmarks demonstrating the
benefit of this. So I ran this quick test:

Good point! It would be good to show why we would actually want this
patch. Attached v8 is rebased over current master (which now has the
streaming read API).

Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE. That's great!, but it doesn't seem to be
due to intentional changes. No efficiency is coming from batching
anything. Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out? (Makes me
wonder what happens if you insert a memory prefetch for the page
header into that production line, and if there are more opportunities
to spread dependencies out eg hashing the buffer tag ahead of time.)

On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.

Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);

Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}

I just don't know where to get that '2'. The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true. The size of main
memory and L2 cache have increased dramatically since those strategies
were invented. I think we should at least double them, and more
likely quadruple them. I realise you already made them configurable
per command in commit 1cbbee033, but I mean the hard coded default 256
in freelist.c. It's not only to get more space for our prefetching
plans, it's also to give the system more chance of flushing WAL
asynchronously/in some other backend before you crash into dirty data;
as you discovered, prefetching accidentally makes that effect worse
in.a BAS_VACUUM strategy, by taking away space that is effectively
deferring WAL flushes, so I'd at least like to double the size for if
we use "/ 2" above.

#16Melanie Plageman
melanieplageman@gmail.com
In reply to: Thomas Munro (#15)
Re: Streaming read-ready sequential scan code

On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.

Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);

Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}

I just don't know where to get that '2'. The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

Yea, I don't actually understand why not just use strategy_buffers - 1
or something. 1/2 seems like a big limiting factor for those
strategies with small rings.

I don't really think it will come up for SELECT queries since they
rely on readahead and not prefetching.
It does seem like it could easily come up for analyze.

But I am on board with the idea of clamping for sequential scan/TID
range scan. For vacuum, we might have to think harder. If the user
specifies buffer_usage_limit and io_combine_limit and they are never
reaching IOs of io_combine_limit because of their buffer_usage_limit
value, then we should probably inform them.

- Melanie

#17Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#15)
Re: Streaming read-ready sequential scan code

Hi,

On 2024-04-04 09:27:35 +1300, Thomas Munro wrote:

Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE. That's great!, but it doesn't seem to be
due to intentional changes. No efficiency is coming from batching
anything. Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out?

Another theory is that it's due to the plain ReadBuffer() path needing to do
RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't
need to.

On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.

Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);

Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}

I just don't know where to get that '2'. The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
should we apply that only if the ring the strategy doesn't use the
StrategyRejectBuffer() logic?

I think it's fine to add a handwavy justification for the 2, saying that we
want to balance readahead distance and reducing WAL write frequency, and that
at some point more sophisticated logic will be needed.

Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true.

I'm not sure it's right tying them together. The concerns for both are fairly
different.

The size of main memory and L2 cache have increased dramatically since those
strategies were invented. I think we should at least double them, and more
likely quadruple them. I realise you already made them configurable per
command in commit 1cbbee033, but I mean the hard coded default 256 in
freelist.c. It's not only to get more space for our prefetching plans, it's
also to give the system more chance of flushing WAL asynchronously/in some
other backend before you crash into dirty data; as you discovered,
prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy,
by taking away space that is effectively deferring WAL flushes, so I'd at
least like to double the size for if we use "/ 2" above.

I think for VACUUM we should probably go a bit further. There's no comparable
L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Greetings,

Andres Freund

#18David Rowley
dgrowleyml@gmail.com
In reply to: Melanie Plageman (#14)
2 attachment(s)
Re: Streaming read-ready sequential scan code

On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote:

Attached v8 is rebased over current master (which now has the
streaming read API).

I've looked at the v8-0001 patch.

I wasn't too keen on heapbuildvis() as a function name for an external
function. Since it also does pruning work, it seemed weird to make it
sound like it only did visibility work. Per our offline discussion
about names, I've changed it to what you suggested which is
heap_page_prep().

Aside from that, there was an outdated comment: "In pageatatime mode,
heapgetpage() already did visibility checks," which was no longer true
as that's done in heapbuildvis() (now heap_page_prep()).

I also did a round of comment adjustments as there were a few things I
didn't like, e.g:

+ * heapfetchbuf - subroutine for heapgettup()

This is also used in heapgettup_pagemode(), so I thought it was a bad
idea for a function to list places it thinks it's being called. I
also thought it redundant to write "This routine" in the function head
comment. I think "this routine" is implied by the context. I ended up
with:

/*
* heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
*
* Read the specified block of the scan relation into a buffer and pin that
* buffer before saving it in the scan descriptor.
*/

I'm happy with your changes to heapam_scan_sample_next_block(). I did
adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
the same as the seqscan version, just with "seqscan" swapped for
"sample scan".

The only other thing I adjusted there was to use "blockno" in some
places where you were using hscan->rs_cblock. These all come after
the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
is more likely to avoid reading the value from the struct again if the
compiler isn't happy that the two values are still equivalent for some
reason. Even if the compiler is happy today, it would only take a
code change to pass hscan to some external function for the compiler
to perhaps be uncertain if that function has made an adjustment to
rs_cblock and go with the safe option of pulling the value out the
struct again which is a little more expensive as it requires some
maths to figure out the offset.

I've attached v9-0001 and a delta of just my changes from v8.

David

Attachments:

v9-0001-Split-heapgetpage-into-two-parts.patchtext/plain; charset=US-ASCII; name=v9-0001-Split-heapgetpage-into-two-parts.patchDownload
From 90bfc63097c556d0921d8f9165731fb07ec04167 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v9] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for read streams. The streaming read API
will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c         | 78 ++++++++++++++----------
 src/backend/access/heap/heapam_handler.c | 38 ++++++++----
 src/include/access/heapam.h              |  2 +-
 3 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a9d5b109a5..6524fc44a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -360,17 +360,17 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapgetpage - subroutine for heapgettup()
+ * heap_page_prep - Prepare the current scan page to be scanned in pagemode
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Preparation currently consists of 1. prune the scan's rs_cbuf page, and 2.
+ * fill the rs_vistuples array with the OffsetNumbers of visible tuples.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heap_page_prep(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_cbuf;
+	BlockNumber block = scan->rs_cblock;
 	Snapshot	snapshot;
 	Page		page;
 	int			lines;
@@ -378,31 +378,10 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	OffsetNumber lineoff;
 	bool		all_visible;
 
-	Assert(block < scan->rs_nblocks);
-
-	/* release previous scan buffer, if any */
-	if (BufferIsValid(scan->rs_cbuf))
-	{
-		ReleaseBuffer(scan->rs_cbuf);
-		scan->rs_cbuf = InvalidBuffer;
-	}
-
-	/*
-	 * Be sure to check for interrupts at least once per page.  Checks at
-	 * higher code levels won't be able to stop a seqscan that encounters many
-	 * pages' worth of consecutive dead tuples.
-	 */
-	CHECK_FOR_INTERRUPTS();
-
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-									   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
-
-	if (!(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE))
-		return;
+	Assert(BufferGetBlockNumber(buffer) == block);
 
-	buffer = scan->rs_cbuf;
+	/* ensure we're not accidentally being used when not in pagemode */
+	Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -475,6 +454,37 @@ heapgetpage(TableScanDesc sscan, BlockNumber block)
 	scan->rs_ntuples = ntup;
 }
 
+/*
+ * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
+ *
+ * Read the specified block of the scan relation into a buffer and pin that
+ * buffer before saving it in the scan descriptor.
+ */
+static inline void
+heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+{
+	Assert(block < scan->rs_nblocks);
+
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(scan->rs_cbuf))
+	{
+		ReleaseBuffer(scan->rs_cbuf);
+		scan->rs_cbuf = InvalidBuffer;
+	}
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a seqscan that encounters many
+	 * pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* read page using selected strategy */
+	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
+									   RBM_NORMAL, scan->rs_strategy);
+	scan->rs_cblock = block;
+}
+
 /*
  * heapgettup_initial_block - return the first BlockNumber to scan
  *
@@ -748,7 +758,7 @@ heapgettup(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		heapfetchbuf(scan, block);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 		page = heapgettup_start_page(scan, dir, &linesleft, &lineoff);
 continue_page:
@@ -869,7 +879,11 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
-		heapgetpage((TableScanDesc) scan, block);
+		/* read the page */
+		heapfetchbuf(scan, block);
+
+		/* prune the page and determine visible tuple offsets */
+		heap_page_prep((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0952d4a98e..a4451c24b4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2352,11 +2352,15 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
-	if (tsm->NextSampleBlock)
+	/* release previous scan buffer, if any */
+	if (BufferIsValid(hscan->rs_cbuf))
 	{
-		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
-		hscan->rs_cblock = blockno;
+		ReleaseBuffer(hscan->rs_cbuf);
+		hscan->rs_cbuf = InvalidBuffer;
 	}
+
+	if (tsm->NextSampleBlock)
+		blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
 	else
 	{
 		/* scanning table sequentially */
@@ -2398,20 +2402,32 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 		}
 	}
 
+	hscan->rs_cblock = blockno;
+
 	if (!BlockNumberIsValid(blockno))
 	{
-		if (BufferIsValid(hscan->rs_cbuf))
-			ReleaseBuffer(hscan->rs_cbuf);
-		hscan->rs_cbuf = InvalidBuffer;
-		hscan->rs_cblock = InvalidBlockNumber;
 		hscan->rs_inited = false;
-
 		return false;
 	}
 
-	heapgetpage(scan, blockno);
-	hscan->rs_inited = true;
+	Assert(hscan->rs_cblock < hscan->rs_nblocks);
+
+	/*
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a sample scan that encounters
+	 * many pages' worth of consecutive dead tuples.
+	 */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Read page using selected strategy */
+	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
+										blockno, RBM_NORMAL, hscan->rs_strategy);
 
+	/* in pagemode, prune the page and determine visible tuple offsets */
+	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
+		heap_page_prep(scan);
+
+	hscan->rs_inited = true;
 	return true;
 }
 
@@ -2572,7 +2588,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 	if (scan->rs_flags & SO_ALLOW_PAGEMODE)
 	{
 		/*
-		 * In pageatatime mode, heapgetpage() already did visibility checks,
+		 * In pageatatime mode, heap_page_prep() already did visibility checks,
 		 * so just look at the info it left in rs_vistuples[].
 		 *
 		 * We use a binary search over the known-sorted array.  Note: we could
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a307fb5f24..e8a211cf1b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -267,7 +267,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapgetpage(TableScanDesc sscan, BlockNumber block);
+extern void heap_page_prep(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
-- 
2.40.1.windows.1

v8_to_v9-0001_delta.patch.txttext/plain; charset=US-ASCII; name=v8_to_v9-0001_delta.patch.txtDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bf2b9b19e7..6524fc44a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -360,14 +360,13 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * heapbuildvis - Utility function for heap scans.
+ * heap_page_prep - Prepare the current scan page to be scanned in pagemode
  *
- * Given a page residing in a buffer saved in the scan descriptor, prune the
- * page and determine which of its tuples are all visible, saving their offsets
- * in an array in the scan descriptor.
+ * Preparation currently consists of 1. prune the scan's rs_cbuf page, and 2.
+ * fill the rs_vistuples array with the OffsetNumbers of visible tuples.
  */
 void
-heapbuildvis(TableScanDesc sscan)
+heap_page_prep(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
 	Buffer		buffer = scan->rs_cbuf;
@@ -381,6 +380,8 @@ heapbuildvis(TableScanDesc sscan)
 
 	Assert(BufferGetBlockNumber(buffer) == block);
 
+	/* ensure we're not accidentally being used when not in pagemode */
+	Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
 	snapshot = scan->rs_base.rs_snapshot;
 
 	/*
@@ -454,10 +455,10 @@ heapbuildvis(TableScanDesc sscan)
 }
 
 /*
- * heapfetchbuf - subroutine for heapgettup()
+ * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * Read the specified block of the scan relation into a buffer and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heapfetchbuf(HeapScanDesc scan, BlockNumber block)
@@ -878,8 +879,11 @@ heapgettup_pagemode(HeapScanDesc scan,
 	 */
 	while (block != InvalidBlockNumber)
 	{
+		/* read the page */
 		heapfetchbuf(scan, block);
-		heapbuildvis((TableScanDesc) scan);
+
+		/* prune the page and determine visible tuple offsets */
+		heap_page_prep((TableScanDesc) scan);
 		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f4f670e9b2..a4451c24b4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2352,6 +2352,7 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	if (hscan->rs_nblocks == 0)
 		return false;
 
+	/* release previous scan buffer, if any */
 	if (BufferIsValid(hscan->rs_cbuf))
 	{
 		ReleaseBuffer(hscan->rs_cbuf);
@@ -2403,7 +2404,7 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 
 	hscan->rs_cblock = blockno;
 
-	if (!BlockNumberIsValid(hscan->rs_cblock))
+	if (!BlockNumberIsValid(blockno))
 	{
 		hscan->rs_inited = false;
 		return false;
@@ -2412,22 +2413,19 @@ heapam_scan_sample_next_block(TableScanDesc scan, SampleScanState *scanstate)
 	Assert(hscan->rs_cblock < hscan->rs_nblocks);
 
 	/*
-	 * We may scan multiple pages before finding tuples to yield or finishing
-	 * the scan. Since we want to check for interrupts at least once per page,
-	 * do so here.
+	 * Be sure to check for interrupts at least once per page.  Checks at
+	 * higher code levels won't be able to stop a sample scan that encounters
+	 * many pages' worth of consecutive dead tuples.
 	 */
 	CHECK_FOR_INTERRUPTS();
 
 	/* Read page using selected strategy */
 	hscan->rs_cbuf = ReadBufferExtended(hscan->rs_base.rs_rd, MAIN_FORKNUM,
-										hscan->rs_cblock, RBM_NORMAL, hscan->rs_strategy);
+										blockno, RBM_NORMAL, hscan->rs_strategy);
 
-	/*
-	 * If pagemode is allowed, prune the page and build an array of visible
-	 * tuple offsets.
-	 */
+	/* in pagemode, prune the page and determine visible tuple offsets */
 	if (hscan->rs_base.rs_flags & SO_ALLOW_PAGEMODE)
-		heapbuildvis(scan);
+		heap_page_prep(scan);
 
 	hscan->rs_inited = true;
 	return true;
@@ -2590,7 +2588,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 	if (scan->rs_flags & SO_ALLOW_PAGEMODE)
 	{
 		/*
-		 * In pageatatime mode, heapgetpage() already did visibility checks,
+		 * In pageatatime mode, heap_page_prep() already did visibility checks,
 		 * so just look at the info it left in rs_vistuples[].
 		 *
 		 * We use a binary search over the known-sorted array.  Note: we could
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 4d324c78e5..e8a211cf1b 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -267,7 +267,7 @@ extern TableScanDesc heap_beginscan(Relation relation, Snapshot snapshot,
 									uint32 flags);
 extern void heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk,
 							   BlockNumber numBlks);
-extern void heapbuildvis(TableScanDesc sscan);
+extern void heap_page_prep(TableScanDesc sscan);
 extern void heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 						bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(TableScanDesc sscan);
#19Melanie Plageman
melanieplageman@gmail.com
In reply to: David Rowley (#18)
Re: Streaming read-ready sequential scan code

On Wed, Apr 3, 2024 at 9:08 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 4 Apr 2024 at 06:03, Melanie Plageman <melanieplageman@gmail.com> wrote:

Attached v8 is rebased over current master (which now has the
streaming read API).

I've looked at the v8-0001 patch.

Thanks for taking a look!

I wasn't too keen on heapbuildvis() as a function name for an external
function. Since it also does pruning work, it seemed weird to make it
sound like it only did visibility work. Per our offline discussion
about names, I've changed it to what you suggested which is
heap_page_prep().

Looking at it in the code, I am wondering if we should call
heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
that it is prepping a page to be scanned. You choose whatever you
think is best.

Aside from that, there was an outdated comment: "In pageatatime mode,
heapgetpage() already did visibility checks," which was no longer true
as that's done in heapbuildvis() (now heap_page_prep()).

I also did a round of comment adjustments as there were a few things I
didn't like, e.g:

+ * heapfetchbuf - subroutine for heapgettup()

This is also used in heapgettup_pagemode(), so I thought it was a bad
idea for a function to list places it thinks it's being called. I
also thought it redundant to write "This routine" in the function head
comment. I think "this routine" is implied by the context. I ended up
with:

/*
* heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
*
* Read the specified block of the scan relation into a buffer and pin that
* buffer before saving it in the scan descriptor.
*/

I'm happy with your changes to heapam_scan_sample_next_block(). I did
adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
the same as the seqscan version, just with "seqscan" swapped for
"sample scan".

That all is fine with me.

The only other thing I adjusted there was to use "blockno" in some
places where you were using hscan->rs_cblock. These all come after
the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
is more likely to avoid reading the value from the struct again if the
compiler isn't happy that the two values are still equivalent for some
reason. Even if the compiler is happy today, it would only take a
code change to pass hscan to some external function for the compiler
to perhaps be uncertain if that function has made an adjustment to
rs_cblock and go with the safe option of pulling the value out the
struct again which is a little more expensive as it requires some
maths to figure out the offset.

I've attached v9-0001 and a delta of just my changes from v8.

All sounds good and LGTM

- Melanie

#20David Rowley
dgrowleyml@gmail.com
In reply to: Melanie Plageman (#19)
Re: Streaming read-ready sequential scan code

On Thu, 4 Apr 2024 at 14:38, Melanie Plageman <melanieplageman@gmail.com> wrote:

Looking at it in the code, I am wondering if we should call
heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
that it is prepping a page to be scanned. You choose whatever you
think is best.

I ended up calling it heap_prepare_pagescan() as I started to think
prep/prepare should come first. I don't think it's perfect as the
intended meaning is heap_prepare_page_for_scanning_in_pagemode(), but
that's obviously too long.

I've pushed the v9-0001 with that rename done.

David

#21David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#20)
1 attachment(s)
Re: Streaming read-ready sequential scan code

On Thu, 4 Apr 2024 at 16:45, David Rowley <dgrowleyml@gmail.com> wrote:

I've pushed the v9-0001 with that rename done.

I've now just pushed the 0002 patch with some revisions:

1. The function declarations you added for heapgettup_advance_block()
and heapgettup_initial_block() didn't match the properties of their
definitions. You'd declared both of these static inline but neither
of these were.
2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as
that's effectively what it does with v8-0002, however, that's just too
many words all shoved together, so I renamed it to
heap_fetch_next_buffer().
3. I changed heapgettup_initial_block() to pg_noinline as it both
makes more sense to have this out of line and it also fixed a small
performance regression.

Looks like I also failed to grep for all the remaining instances of
"heapgetpage" in 44086b097. Those are now fixed by 3a4a3537a.

I also rebased the 0003 patch which I've attached as a raw patch.

FWIW, using Heikki's test in [1]/messages/by-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1@iki.fi with a pg_prewarm each time after
restarting the instance. No parallel aggregate.

drowley@amd3990x:~$ cat bench.sql
select count(*) from giga;

drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency

44086b097~1
latency average = 34.323 ms
latency average = 34.332 ms

44086b097
latency average = 34.811 ms
latency average = 34.862 ms

3a4a3537a
latency average = 34.497 ms
latency average = 34.538 ms

3a4a3537a + read_stream_for_seqscans.patch
latency average = 40.923 ms
latency average = 41.415 ms

i.e. no meaningful change from the refactor, but a regression from a
cached workload that changes the page often without doing much work in
between with the read stread patch.

I'm happy to run further benchmarks, but for the remainder of the
committer responsibility for the next patches, I'm going to leave that
to Thomas.

David

[1]: /messages/by-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1@iki.fi

Attachments:

read_stream_for_seqscans.patchtext/plain; charset=US-ASCII; name=read_stream_for_seqscans.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e..f7946a39fd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -482,25 +510,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
+	}
 
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(scan->rs_cblock != InvalidBlockNumber ||
-			   !BufferIsValid(scan->rs_cbuf));
+	scan->rs_dir = dir;
 
-		scan->rs_inited = true;
-	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock,
-												   dir);
-
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL,
-										   scan->rs_strategy);
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -833,6 +859,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -928,6 +955,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -1021,6 +1049,28 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	scan->rs_read_stream = NULL;
+
+	/*
+	 * For sequential scans and TID range scans, we will set up a read stream.
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the read stream will be freed. This should be
+	 * done after initscan() because initscan() allocates the
+	 * BufferAccessStrategy object.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  scan->rs_base.rs_rd,
+														  MAIN_FORKNUM,
+														  heap_scan_stream_read_next,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1055,6 +1105,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * The read stream is reset on rescan. This must be done before
+	 * initscan(), as some state referred to by read_stream_reset() is reset
+	 * in initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
+
 	/*
 	 * reinitialize scan descriptor
 	 */
@@ -1074,6 +1132,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy.
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2765efc4e5..332a7faa8d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -70,6 +71,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
#22Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#17)
2 attachment(s)
Re: Streaming read-ready sequential scan code

On Thu, Apr 4, 2024 at 11:13 AM Andres Freund <andres@anarazel.de> wrote:

The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
should we apply that only if the ring the strategy doesn't use the
StrategyRejectBuffer() logic?

Hmm, I don't really know, but that sounds plausible. What do you
think about the attached?

I think for VACUUM we should probably go a bit further. There's no comparable
L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Alright what about this?

Attachments:

0001-Allow-BufferAccessStrategy-to-limit-pin-count.txttext/plain; charset=US-ASCII; name=0001-Allow-BufferAccessStrategy-to-limit-pin-count.txtDownload
From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Apr 2024 20:31:26 +1300
Subject: [PATCH 1/2] Allow BufferAccessStrategy to limit pin count.

When pinning extra buffers to look ahead, users of a strategy are in
danger of pinning a lot of the buffers in the ring, or even more than
the ring size.  For some strategies, that means "escaping" from the
ring, and in others it means forcing dirty data to disk very frequently
with associated WAL flushing.  Since external code has no insight into
any of that, allow individual strategy types to expose a clamp that
should be applied when deciding how many buffers to pin at once.

Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c |  5 ++++
 src/backend/storage/buffer/freelist.c | 35 +++++++++++++++++++++++++++
 src/include/storage/bufmgr.h          |  1 +
 3 files changed, 41 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5..eab87f6f73 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -419,6 +419,7 @@ read_stream_begin_relation(int flags,
 	size_t		size;
 	int16		queue_size;
 	int16		max_ios;
+	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 	SMgrRelation smgr;
@@ -460,6 +461,10 @@ read_stream_begin_relation(int flags,
 	max_pinned_buffers = Min(max_pinned_buffers,
 							 PG_INT16_MAX - io_combine_limit - 1);
 
+	/* Give the strategy a chance to limit the number of buffers we pin. */
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
+
 	/* Don't allow this backend to pin more than its share of buffers. */
 	if (SmgrIsTemp(smgr))
 		LimitAdditionalLocalPins(&max_pinned_buffers);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3611357fa3..c69590d6d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy)
 	return strategy->nbuffers;
 }
 
+/*
+ * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned
+ *
+ * Strategies can specify the maximum number of buffers that a user should pin
+ * at once when performing look-ahead.  Callers should combine this number with
+ * other relevant limits and take the minimum.
+ */
+int
+GetAccessStrategyPinLimit(BufferAccessStrategy strategy)
+{
+	if (strategy == NULL)
+		return NBuffers;
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKREAD:
+
+			/*
+			 * Since BAS_BULKREAD uses StrategyRejectBuffer(), dirty buffers
+			 * shouldn't be a problem and the caller is free to pin up to the
+			 * entire ring at once.
+			 */
+			return strategy->nbuffers;
+
+		default:
+
+			/*
+			 * Tell call not to pin more than half the buffers in the ring.
+			 * This is a trade-off between look ahead distance and deferring
+			 * writeback and associated WAL traffic.
+			 */
+			return strategy->nbuffers / 2;
+	}
+}
+
 /*
  * FreeAccessStrategy -- release a BufferAccessStrategy object
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6..07ba1a6050 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,6 +318,7 @@ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType btype,
 													  int ring_size_kb);
 extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
+extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
-- 
2.39.3 (Apple Git-146)

0002-Increase-default-vacuum_buffer_usage_limit-to-2MB.txttext/plain; charset=US-ASCII; name=0002-Increase-default-vacuum_buffer_usage_limit-to-2MB.txtDownload
From 1488b8f56b9aa342ab3164135afffe9805e562c6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Apr 2024 21:11:06 +1300
Subject: [PATCH 2/2] Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f.  Commit
1cbbee03 made it configurable but retained the traditional default.
The correct default size has been debated for years, but 256kB is
certainly very small.  VACUUM soon needs to write back data it dirtied
only 32 blocks ago, which usually requires flushing the WAL.  New
experiments in prefetching pages for VACUUM exacerbated the problem by
crashing into dirty data even sooner.  Let's make the default 2MB.
That's 1.5% of the default toy buffer pool size, and 0.2% of 1GB, which
would be a considered a small shared_buffers setting for a real system
these days.  Users are still free to set the GUC to a different value.

Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de
Discussion: https://postgre.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 2 +-
 src/backend/storage/buffer/freelist.c         | 2 +-
 src/backend/utils/misc/guc_tables.c           | 2 +-
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 624518e0b0..d8e1282e12 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1961,7 +1961,7 @@ include_dir 'conf.d'
         valid sizes range from <literal>128 kB</literal> to
         <literal>16 GB</literal>.  If the specified size would exceed 1/8 the
         size of <varname>shared_buffers</varname>, the size is silently capped
-        to that value.  The default value is <literal>256 kB</literal>.  If
+        to that value.  The default value is <literal>2MB</literal>.  If
         this value is specified without units, it is taken as kilobytes.  This
         parameter can be set at any time.  It can be overridden for
         <xref linkend="sql-vacuum"/> and <xref linkend="sql-analyze"/>
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c69590d6d8..65a6b3b357 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -561,7 +561,7 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			ring_size_kb = 16 * 1024;
 			break;
 		case BAS_VACUUM:
-			ring_size_kb = 256;
+			ring_size_kb = 2048;
 			break;
 
 		default:
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c12784cbec..7d4e4387cf 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2275,7 +2275,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&VacuumBufferUsageLimit,
-		256, 0, MAX_BAS_VAC_RING_SIZE_KB,
+		2048, 0, MAX_BAS_VAC_RING_SIZE_KB,
 		check_vacuum_buffer_usage_limit, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index baecde2841..2166ea4a87 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -166,7 +166,7 @@
 					#   mmap
 					# (change requires restart)
 #min_dynamic_shared_memory = 0MB	# (change requires restart)
-#vacuum_buffer_usage_limit = 256kB	# size of vacuum and analyze buffer access strategy ring;
+#vacuum_buffer_usage_limit = 2MB	# size of vacuum and analyze buffer access strategy ring;
 					# 0 to disable vacuum buffer access strategy;
 					# range 128kB to 16GB
 
-- 
2.39.3 (Apple Git-146)

#23Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#22)
2 attachment(s)
Re: Streaming read-ready sequential scan code

On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Alright what about this?

Forgot to git add a change, new version.

Attachments:

v2-0001-Allow-BufferAccessStrategy-to-limit-pin-count.txttext/plain; charset=US-ASCII; name=v2-0001-Allow-BufferAccessStrategy-to-limit-pin-count.txtDownload
From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Apr 2024 20:31:26 +1300
Subject: [PATCH v2 1/2] Allow BufferAccessStrategy to limit pin count.

When pinning extra buffers to look ahead, users of a strategy are in
danger of pinning a lot of the buffers in the ring, or even more than
the ring size.  For some strategies, that means "escaping" from the
ring, and in others it means forcing dirty data to disk very frequently
with associated WAL flushing.  Since external code has no insight into
any of that, allow individual strategy types to expose a clamp that
should be applied when deciding how many buffers to pin at once.

Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c |  5 ++++
 src/backend/storage/buffer/freelist.c | 35 +++++++++++++++++++++++++++
 src/include/storage/bufmgr.h          |  1 +
 3 files changed, 41 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5..eab87f6f73 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -419,6 +419,7 @@ read_stream_begin_relation(int flags,
 	size_t		size;
 	int16		queue_size;
 	int16		max_ios;
+	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 	SMgrRelation smgr;
@@ -460,6 +461,10 @@ read_stream_begin_relation(int flags,
 	max_pinned_buffers = Min(max_pinned_buffers,
 							 PG_INT16_MAX - io_combine_limit - 1);
 
+	/* Give the strategy a chance to limit the number of buffers we pin. */
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
+
 	/* Don't allow this backend to pin more than its share of buffers. */
 	if (SmgrIsTemp(smgr))
 		LimitAdditionalLocalPins(&max_pinned_buffers);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3611357fa3..c69590d6d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy)
 	return strategy->nbuffers;
 }
 
+/*
+ * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned
+ *
+ * Strategies can specify the maximum number of buffers that a user should pin
+ * at once when performing look-ahead.  Callers should combine this number with
+ * other relevant limits and take the minimum.
+ */
+int
+GetAccessStrategyPinLimit(BufferAccessStrategy strategy)
+{
+	if (strategy == NULL)
+		return NBuffers;
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKREAD:
+
+			/*
+			 * Since BAS_BULKREAD uses StrategyRejectBuffer(), dirty buffers
+			 * shouldn't be a problem and the caller is free to pin up to the
+			 * entire ring at once.
+			 */
+			return strategy->nbuffers;
+
+		default:
+
+			/*
+			 * Tell call not to pin more than half the buffers in the ring.
+			 * This is a trade-off between look ahead distance and deferring
+			 * writeback and associated WAL traffic.
+			 */
+			return strategy->nbuffers / 2;
+	}
+}
+
 /*
  * FreeAccessStrategy -- release a BufferAccessStrategy object
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6..07ba1a6050 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,6 +318,7 @@ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType btype,
 													  int ring_size_kb);
 extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
+extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
-- 
2.39.3 (Apple Git-146)

v2-0002-Increase-default-vacuum_buffer_usage_limit-to-2MB.txttext/plain; charset=US-ASCII; name=v2-0002-Increase-default-vacuum_buffer_usage_limit-to-2MB.txtDownload
From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Apr 2024 21:11:06 +1300
Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f.  Commit
1cbbee03 made it configurable but retained the traditional default.
The correct default size has been debated for years, but 256kB is
certainly very small.  VACUUM soon needs to write back data it dirtied
only 32 blocks ago, which usually requires flushing the WAL.  New
experiments in prefetching pages for VACUUM exacerbated the problem by
crashing into dirty data even sooner.  Let's make the default 2MB.
That's 1.5% of the default toy buffer pool size, and 0.2% of 1GB, which
would be a considered a small shared_buffers setting for a real system
these days.  Users are still free to set the GUC to a different value.

Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de
Discussion: https://postgre.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 2 +-
 src/backend/storage/buffer/freelist.c         | 2 +-
 src/backend/utils/init/globals.c              | 2 +-
 src/backend/utils/misc/guc_tables.c           | 2 +-
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 624518e0b0..d8e1282e12 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1961,7 +1961,7 @@ include_dir 'conf.d'
         valid sizes range from <literal>128 kB</literal> to
         <literal>16 GB</literal>.  If the specified size would exceed 1/8 the
         size of <varname>shared_buffers</varname>, the size is silently capped
-        to that value.  The default value is <literal>256 kB</literal>.  If
+        to that value.  The default value is <literal>2MB</literal>.  If
         this value is specified without units, it is taken as kilobytes.  This
         parameter can be set at any time.  It can be overridden for
         <xref linkend="sql-vacuum"/> and <xref linkend="sql-analyze"/>
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index c69590d6d8..65a6b3b357 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -561,7 +561,7 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			ring_size_kb = 16 * 1024;
 			break;
 		case BAS_VACUUM:
-			ring_size_kb = 256;
+			ring_size_kb = 2048;
 			break;
 
 		default:
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3e38bb1311..cc61937eef 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -143,7 +143,7 @@ int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
 /* GUC parameters for vacuum */
-int			VacuumBufferUsageLimit = 256;
+int			VacuumBufferUsageLimit = 2048;
 
 int			VacuumCostPageHit = 1;
 int			VacuumCostPageMiss = 2;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c12784cbec..7d4e4387cf 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2275,7 +2275,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&VacuumBufferUsageLimit,
-		256, 0, MAX_BAS_VAC_RING_SIZE_KB,
+		2048, 0, MAX_BAS_VAC_RING_SIZE_KB,
 		check_vacuum_buffer_usage_limit, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index baecde2841..2166ea4a87 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -166,7 +166,7 @@
 					#   mmap
 					# (change requires restart)
 #min_dynamic_shared_memory = 0MB	# (change requires restart)
-#vacuum_buffer_usage_limit = 256kB	# size of vacuum and analyze buffer access strategy ring;
+#vacuum_buffer_usage_limit = 2MB	# size of vacuum and analyze buffer access strategy ring;
 					# 0 to disable vacuum buffer access strategy;
 					# range 128kB to 16GB
 
-- 
2.39.3 (Apple Git-146)

#24Thomas Munro
thomas.munro@gmail.com
In reply to: David Rowley (#21)
Re: Streaming read-ready sequential scan code

On Thu, Apr 4, 2024 at 8:02 PM David Rowley <dgrowleyml@gmail.com> wrote:

3a4a3537a
latency average = 34.497 ms
latency average = 34.538 ms

3a4a3537a + read_stream_for_seqscans.patch
latency average = 40.923 ms
latency average = 41.415 ms

i.e. no meaningful change from the refactor, but a regression from a
cached workload that changes the page often without doing much work in
between with the read stread patch.

I ran Heikki's test except I ran the "insert" 4 times to get a table
of 4376MB according to \d+. On my random cloud ARM server (SB=8GB,
huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms
when the data is in LInux cache and PG is not prewarmed, roughly as he
reported. Good.

If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying
things out to see what works, I got that down to 243ms (ie beat
master) by inserting a memory prefetch:

--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void
**per_buffer_data)
        /* Prepare for the next call. */
        read_stream_look_ahead(stream, false);

+ __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));

Maybe that's a solution to a different problem that just happens to
more than make up the difference in this case, and it may be
questionable whether that cache line will survive long enough to help
you, but this one-tuple-per-page test likes it... Hmm, to get a more
realistic table than the one-tuple-per-page on, I tried doubling a
tenk1 table until it reached 2759MB and then I got a post-prewarm
regression 702ms -> 721ms, and again I can beat master by memory
prefetching: 689ms.

Annoyingly, with the same table I see no difference between the actual
pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to
use the 'fast path' I made pretty much just to be able to minimise
regression in that (probably stupid) all-cached tes that doesn't even
look at the page contents. Unfortunately seq scan can't use it
because it has per-buffer data, which is one of the things it can't do
(because of a space management issue). Maybe I should try to find a
way to fix that.

I'm happy to run further benchmarks, but for the remainder of the
committer responsibility for the next patches, I'm going to leave that
to Thomas.

Thanks!

#25Melanie Plageman
melanieplageman@gmail.com
In reply to: David Rowley (#21)
Re: Streaming read-ready sequential scan code

On Thu, Apr 4, 2024 at 3:02 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 4 Apr 2024 at 16:45, David Rowley <dgrowleyml@gmail.com> wrote:

I've pushed the v9-0001 with that rename done.

I've now just pushed the 0002 patch with some revisions:

Thanks!

1. The function declarations you added for heapgettup_advance_block()
and heapgettup_initial_block() didn't match the properties of their
definitions. You'd declared both of these static inline but neither
of these were.

Ah, they needed to be defined static but I intentionally left off the
inline in the definition and only put it in the forward declaration
because I thought that was correct. Anyway, I'm fine with how you did
it in the end.

2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as
that's effectively what it does with v8-0002, however, that's just too
many words all shoved together, so I renamed it to
heap_fetch_next_buffer().

Sounds good.

3. I changed heapgettup_initial_block() to pg_noinline as it both
makes more sense to have this out of line and it also fixed a small
performance regression.

Ah, I guess it is in the unlikely path. I often forget that noinline
exists. It's interesting that that made a noticeable difference since
it is a pretty short function. Thanks for taking such a close look!

Looks like I also failed to grep for all the remaining instances of
"heapgetpage" in 44086b097. Those are now fixed by 3a4a3537a.

I also rebased the 0003 patch which I've attached as a raw patch.

Thanks!

FWIW, using Heikki's test in [1] with a pg_prewarm each time after
restarting the instance. No parallel aggregate.

drowley@amd3990x:~$ cat bench.sql
select count(*) from giga;

drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency

44086b097~1
latency average = 34.323 ms
latency average = 34.332 ms

44086b097
latency average = 34.811 ms
latency average = 34.862 ms

3a4a3537a
latency average = 34.497 ms
latency average = 34.538 ms

3a4a3537a + read_stream_for_seqscans.patch
latency average = 40.923 ms
latency average = 41.415 ms

i.e. no meaningful change from the refactor, but a regression from a
cached workload that changes the page often without doing much work in
between with the read stread patch.

Cool. Thanks for testing this out. Sounds like Thomas did some
analysis of how to resolve this for the streaming read user, and I
will do some too.

- Melanie

#26Melanie Plageman
melanieplageman@gmail.com
In reply to: Thomas Munro (#24)
Re: Streaming read-ready sequential scan code

On Thu, Apr 4, 2024 at 7:45 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Apr 4, 2024 at 8:02 PM David Rowley <dgrowleyml@gmail.com> wrote:

3a4a3537a
latency average = 34.497 ms
latency average = 34.538 ms

3a4a3537a + read_stream_for_seqscans.patch
latency average = 40.923 ms
latency average = 41.415 ms

i.e. no meaningful change from the refactor, but a regression from a
cached workload that changes the page often without doing much work in
between with the read stread patch.

I ran Heikki's test except I ran the "insert" 4 times to get a table
of 4376MB according to \d+. On my random cloud ARM server (SB=8GB,
huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms
when the data is in LInux cache and PG is not prewarmed, roughly as he
reported. Good.

If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying
things out to see what works, I got that down to 243ms (ie beat
master) by inserting a memory prefetch:

--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void
**per_buffer_data)
/* Prepare for the next call. */
read_stream_look_ahead(stream, false);

+ __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));

Maybe that's a solution to a different problem that just happens to
more than make up the difference in this case, and it may be
questionable whether that cache line will survive long enough to help
you, but this one-tuple-per-page test likes it... Hmm, to get a more
realistic table than the one-tuple-per-page on, I tried doubling a
tenk1 table until it reached 2759MB and then I got a post-prewarm
regression 702ms -> 721ms, and again I can beat master by memory
prefetching: 689ms.

Annoyingly, with the same table I see no difference between the actual
pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to
use the 'fast path' I made pretty much just to be able to minimise
regression in that (probably stupid) all-cached tes that doesn't even
look at the page contents. Unfortunately seq scan can't use it
because it has per-buffer data, which is one of the things it can't do
(because of a space management issue). Maybe I should try to find a
way to fix that.

So, sequential scan does not have per-buffer data. I did some logging
and the reason most fully-in-SB sequential scans don't use the fast
path is because read_stream->pending_read_nblocks is always 0.

When when read_stream->distance stays 1 (expected for all-in-SB as it
is initialized to 1 and we don't want distance to ramp up),
read_stream_look_ahead() never increments
read_stream->pending_read_nblocks because it sets it to 1 the first
time it is called and then the conditions of the while loop are not
met again

while (stream->ios_in_progress < stream->max_ios &&
stream->pinned_buffers + stream->pending_read_nblocks <
stream->distance)

distance is 1, pending_read_nblocks is 1, thus we only loop once and
don't increment pending_read_nblocks.

prewarm is only able to use the fast path because it passes
READ_STREAM_FULL and thus initializes read_stream->distance to a
higher initial value.

I added some logging to see if any of the sequential scans in the
regression suite used the fast path. The one example I see of the fast
path being used is a temp table IO stats test in
src/test/regress/sql/stats.sql. I didn't check exactly what conditions
led it to do this. But we probably want seq scans which are all in SB
to use the fast path.

- Melanie

#27Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#23)
Re: Streaming read-ready sequential scan code

Hi,

On 2024-04-04 22:37:39 +1300, Thomas Munro wrote:

On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Alright what about this?

I think it's probably worth adding a bit more of the commit message to the
function comment. Yes, there's a bit in one of the return branches, but that's
not what you're going to look at when just checking what the function does.

From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Apr 2024 21:11:06 +1300
Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f. Commit
1cbbee03 made it configurable but retained the traditional default.
The correct default size has been debated for years, but 256kB is
certainly very small. VACUUM soon needs to write back data it dirtied
only 32 blocks ago, which usually requires flushing the WAL. New
experiments in prefetching pages for VACUUM exacerbated the problem by
crashing into dirty data even sooner. Let's make the default 2MB.
That's 1.5% of the default toy buffer pool size, and 0.2% of 1GB, which
would be a considered a small shared_buffers setting for a real system
these days. Users are still free to set the GUC to a different value.

+1. Independent of any other changes, this improves the default vacuum
performance substantially. We might want to dynamically size the default at
some point - but we probably should overhaul the infrastructure first...

Greetings,

Andres Freund

#28Thomas Munro
thomas.munro@gmail.com
In reply to: Melanie Plageman (#26)
Re: Streaming read-ready sequential scan code

On Fri, Apr 5, 2024 at 4:20 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

So, sequential scan does not have per-buffer data. I did some logging
and the reason most fully-in-SB sequential scans don't use the fast
path is because read_stream->pending_read_nblocks is always 0.

Hnghghghgh... right, sorry I guessed the wrong reason, it turns out
that I made a fast path just a little too specialised for pg_prewarm.
Working on it...

#29Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#28)
4 attachment(s)
Re: Streaming read-ready sequential scan code

Yeah, I plead benchmarking myopia, sorry. The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does. Oops.
With the attached minor rearrangement, it works fine. I also poked
some more at that memory prefetcher. Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).

create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 560000) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;

select count(*) from t;

cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm

cold warm hot
master 2479ms 886ms 200ms
seqscan 2498ms 716ms 211ms <-- regression
seqscan + fastpath 2493ms 711ms 200ms <-- fixed, I think?
seqscan + memprefetch 2499ms 716ms 182ms
seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/

Cold has no difference. That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:

cold
master 1775ms
seqscan + fastpath + memprefetch 1700ms

Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).

The interesting column is hot. The 200ms->211ms regression is due to
the extra bookkeeping in the slow path. The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can
you reproduce that?

The memory prefetching trick, on top of that, seems to be a good
optimisation so far. Note that that's not an entirely independent
trick, it's something we can only do now that we can see into the
future; it's the next level up of prefetching, worth doing around 60ns
before you need the data I guess. Who knows how thrashed the cache
might be before the caller gets around to accessing that page, but
there doesn't seem to be much of a cost or downside to this bet. We
know there are many more opportunities like that[1]/messages/by-id/CAApHDvpTRx7hqFZGiZJ=d9JN4h1tzJ2=xt7bM-9XRmpVj63psQ@mail.gmail.com but I don't want
to second-guess the AM here, I'm just betting that the caller is going
to look at the header.

Unfortunately there seems to be a subtle bug hiding somewhere in here,
visible on macOS on CI. Looking into that, going to find my Mac...

[1]: /messages/by-id/CAApHDvpTRx7hqFZGiZJ=d9JN4h1tzJ2=xt7bM-9XRmpVj63psQ@mail.gmail.com

Attachments:

v10-0001-Use-streaming-I-O-in-heapam-sequential-scan.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Use-streaming-I-O-in-heapam-sequential-scan.patchDownload
From 74b8cde45a8babcec7b52b06bdb8ea046a0a966f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v10 1/4] Use streaming I/O in heapam sequential scan.

<commit message goes here>
---
 src/backend/access/heap/heapam.c | 100 +++++++++++++++++++++++++------
 src/include/access/heapam.h      |  15 +++++
 2 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e3..f7946a39fd9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -482,25 +510,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
+	}
 
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(scan->rs_cblock != InvalidBlockNumber ||
-			   !BufferIsValid(scan->rs_cbuf));
+	scan->rs_dir = dir;
 
-		scan->rs_inited = true;
-	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock,
-												   dir);
-
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL,
-										   scan->rs_strategy);
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -833,6 +859,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -928,6 +955,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -1021,6 +1049,28 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	scan->rs_read_stream = NULL;
+
+	/*
+	 * For sequential scans and TID range scans, we will set up a read stream.
+	 * We do not know the scan direction yet. If the scan does not end up
+	 * being a forward scan, the read stream will be freed. This should be
+	 * done after initscan() because initscan() allocates the
+	 * BufferAccessStrategy object.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  scan->rs_base.rs_rd,
+														  MAIN_FORKNUM,
+														  heap_scan_stream_read_next,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1055,6 +1105,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * The read stream is reset on rescan. This must be done before
+	 * initscan(), as some state referred to by read_stream_reset() is reset
+	 * in initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
+
 	/*
 	 * reinitialize scan descriptor
 	 */
@@ -1074,6 +1132,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy.
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2765efc4e5e..332a7faa8d1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -70,6 +71,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
-- 
2.39.2

v10-0002-Improve-read_stream.c-s-fast-path.patchtext/x-patch; charset=US-ASCII; name=v10-0002-Improve-read_stream.c-s-fast-path.patchDownload
From 108a3880ca1ff949f53fffeeec4e722a2b13dd57 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 12:08:24 +1300
Subject: [PATCH v10 2/4] Improve read_stream.c's fast path.

Unfortunately the "fast path" for cached scans that don't do any I/O was
coded in a way that could be used by pg_prewarm, but not the proposed
sequential scan patch which starts with a low initial distance due to
flags used to initialize the stream.  Refactor to fix that.
---
 src/backend/storage/aio/read_stream.c | 70 +++++++++++----------------
 1 file changed, 27 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..80b8be8cc7e 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -578,13 +578,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (likely(stream->fast_path))
 	{
 		BlockNumber next_blocknum;
-		bool		need_wait;
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
-		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 
 		/* We're going to return the buffer we pinned last time. */
@@ -594,40 +593,29 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		buffer = stream->buffers[oldest_buffer_index];
 		Assert(buffer != InvalidBuffer);
 
-		/*
-		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
-		 * I/O entry (they're all free).
-		 */
-		need_wait = StartReadBuffer(&stream->ios[0].op,
-									&stream->buffers[oldest_buffer_index],
-									stream->pending_read_blocknum,
-									stream->advice_enabled ?
-									READ_BUFFERS_ISSUE_ADVICE : 0);
-
-		/* Choose the block the next call will pin. */
+		/* Choose the next block to pin. */
 		if (unlikely(stream->blocknums_next == stream->blocknums_count))
 			read_stream_fill_blocknums(stream);
 		next_blocknum = stream->blocknums[stream->blocknums_next++];
 
-		/*
-		 * Fast return if the next call doesn't require I/O for the buffer we
-		 * just pinned, and we have a block number to give it as a pending
-		 * read.
-		 */
-		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		if (likely(next_blocknum != InvalidBlockNumber))
 		{
-			stream->pending_read_blocknum = next_blocknum;
-			return buffer;
-		}
-
-		/*
-		 * For anything more complex, set up some more state and take the slow
-		 * path next time.
-		 */
-		stream->fast_path = false;
+			/*
+			 * Pin a buffer for the next call.  Same buffer entry, and
+			 * arbitrary I/O entry (they're all free).  We don't have to
+			 * adjust pinned_buffers because we're transferring one to caller
+			 * but pinning one more.
+			 */
+			if (likely(!StartReadBuffer(&stream->ios[0].op,
+										&stream->buffers[oldest_buffer_index],
+										next_blocknum,
+										stream->advice_enabled ?
+										READ_BUFFERS_ISSUE_ADVICE : 0)))
+			{
+				/* Fast return. */
+				return buffer;
+			}
 
-		if (need_wait)
-		{
 			/* Next call must wait for I/O for the newly pinned buffer. */
 			stream->oldest_io_index = 0;
 			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
@@ -635,17 +623,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->ios[0].buffer_index = oldest_buffer_index;
 			stream->seq_blocknum = next_blocknum + 1;
 		}
-		if (next_blocknum == InvalidBlockNumber)
-		{
-			/* Next call hits end of stream and can't pin anything more. */
-			stream->distance = 0;
-			stream->pending_read_nblocks = 0;
-		}
 		else
 		{
-			/* Set up the pending read. */
-			stream->pending_read_blocknum = next_blocknum;
+			/* No more blocks, end of stream. */
+			stream->distance = 0;
+			stream->oldest_buffer_index = stream->next_buffer_index;
+			stream->pinned_buffers = 0;
 		}
+
+		stream->fast_path = false;
 		return buffer;
 	}
 #endif
@@ -762,15 +748,11 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (stream->ios_in_progress == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
-		stream->pending_read_nblocks == 1 &&
+		stream->pending_read_nblocks == 0 &&
 		stream->per_buffer_data_size == 0)
 	{
 		stream->fast_path = true;
 	}
-	else
-	{
-		stream->fast_path = false;
-	}
 #endif
 
 	return buffer;
@@ -789,6 +771,7 @@ read_stream_reset(ReadStream *stream)
 
 	/* Stop looking ahead. */
 	stream->distance = 0;
+	stream->fast_path = false;
 
 	/* Unpin anything that wasn't consumed. */
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
@@ -799,6 +782,7 @@ read_stream_reset(ReadStream *stream)
 
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
+	stream->fast_path = false;
 }
 
 /*
-- 
2.39.2

v10-0003-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchtext/x-patch; charset=US-ASCII; name=v10-0003-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchDownload
From 5022674f03046a5815e6a863d696ac7db2814a10 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 10 Jul 2023 11:22:34 +0200
Subject: [PATCH v10 3/4] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  1 +
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 6 files changed, 72 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb0..4cc02f97601 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 36feeafbb23..79b78c33ddc 100755
--- a/configure
+++ b/configure
@@ -15543,6 +15543,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 57f734879e1..6cd5441ed32 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1781,6 +1781,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 87437960bc3..210fd4c1cd7 100644
--- a/meson.build
+++ b/meson.build
@@ -1707,6 +1707,7 @@ builtins = [
   'constant_p',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index cf37e02fe1f..0b7aa3e2924 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -412,6 +412,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df6..085abf4c84e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -543,6 +543,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
-- 
2.39.2

v10-0004-Prefetch-page-header-memory-when-streaming-relat.patchtext/x-patch; charset=US-ASCII; name=v10-0004-Prefetch-page-header-memory-when-streaming-relat.patchDownload
From 43fcdbcdaee406b1d0eae71bd4a2b117d3acf416 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 15:06:32 +1300
Subject: [PATCH v10 4/4] Prefetch page header memory when streaming relations.

read_stream.c can always see at least one page ahead of the one the
caller is accessing.  Take the opportunity to prefetch the cache line
that holds the next page's header.  For some scans, that can generate a
decent speedup, though real world results will depend on how much work
the CPU does before it gets around to accessing the next page.
---
 src/backend/storage/aio/read_stream.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 80b8be8cc7e..35ee1fc1043 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -612,7 +612,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 										stream->advice_enabled ?
 										READ_BUFFERS_ISSUE_ADVICE : 0)))
 			{
-				/* Fast return. */
+				/* Predict caller will soon access next page's header. */
+				pg_prefetch_mem(BufferGetPage(stream->buffers[oldest_buffer_index]));
 				return buffer;
 			}
 
@@ -743,6 +744,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
 
+	/* Predict caller will soon access next page's header. */
+	if (stream->pinned_buffers > 0)
+		pg_prefetch_mem(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));
+
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
-- 
2.39.2

#30Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#27)
Re: Streaming read-ready sequential scan code

On Thu, Apr 4, 2024 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:

On 2024-04-04 22:37:39 +1300, Thomas Munro wrote:

On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Alright what about this?

I think it's probably worth adding a bit more of the commit message to the
function comment. Yes, there's a bit in one of the return branches, but that's
not what you're going to look at when just checking what the function does.

Agreed about the comment. I kept thinking that BAS_BULKREAD should
maybe return nbuffers - 1, but I couldn't convince myself why.
Otherwise v2-0001-Allow-BufferAccessStrategy-to-limit-pin-count LGTM.

- Melanie

#31Melanie Plageman
melanieplageman@gmail.com
In reply to: Thomas Munro (#29)
4 attachment(s)
Re: Streaming read-ready sequential scan code

On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Yeah, I plead benchmarking myopia, sorry. The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does. Oops.
With the attached minor rearrangement, it works fine. I also poked
some more at that memory prefetcher. Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).

create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 560000) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;

select count(*) from t;

cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm

cold warm hot
master 2479ms 886ms 200ms
seqscan 2498ms 716ms 211ms <-- regression
seqscan + fastpath 2493ms 711ms 200ms <-- fixed, I think?
seqscan + memprefetch 2499ms 716ms 182ms
seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/

Cold has no difference. That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:

cold
master 1775ms
seqscan + fastpath + memprefetch 1700ms

Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).

The interesting column is hot. The 200ms->211ms regression is due to
the extra bookkeeping in the slow path. The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can
you reproduce that?

I am able to reproduce the fast path solving the issue using Heikki's
example here [1]/messages/by-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1@iki.fi but in shared buffers (hot).

master: 25 ms
stream read: 29 ms
stream read + fast path: 25 ms

I haven't looked into or reviewed the memory prefetching part.

While reviewing 0002, I realized that I don't quite see how
read_stream_get_block() will be used in the fastpath -- which it
claims in its comments.
read_stream_next_buffer() is the only caller of
read_stream_look_ahead()->read_stream_get_block(), and if fast_path is
true, read_stream_next_buffer() always returns before calling
read_stream_look_ahead(). Maybe I am missing something. I see
fast_path uses read_stream_fill_blocknums() to invoke the callback.

Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist?

Otherwise 0002 looks good to me.

I haven't reviewed 0003 or 0004. I attached a new version (v11)
because I noticed an outdated comment in my seq scan streaming read
user patch (0001). The other patches in the set are untouched from
your versions besides adding author/reviewer info in commit message
for 0002.

- Melanie

[1]: /messages/by-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1@iki.fi

Attachments:

v11-0003-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchtext/x-patch; charset=US-ASCII; name=v11-0003-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchDownload
From acbd7172f10857d49d4b5d4afc4efab704b33486 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 10 Jul 2023 11:22:34 +0200
Subject: [PATCH v11 3/4] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  1 +
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 6 files changed, 72 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb0..4cc02f97601 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 36feeafbb23..79b78c33ddc 100755
--- a/configure
+++ b/configure
@@ -15543,6 +15543,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 57f734879e1..6cd5441ed32 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1781,6 +1781,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 87437960bc3..210fd4c1cd7 100644
--- a/meson.build
+++ b/meson.build
@@ -1707,6 +1707,7 @@ builtins = [
   'constant_p',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index cf37e02fe1f..0b7aa3e2924 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -412,6 +412,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df6..085abf4c84e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -543,6 +543,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
-- 
2.40.1

v11-0002-Improve-read_stream.c-s-fast-path.patchtext/x-patch; charset=US-ASCII; name=v11-0002-Improve-read_stream.c-s-fast-path.patchDownload
From 8c1cd34f7be9651b1fbe69e3d351c79cbf5b202a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 12:08:24 +1300
Subject: [PATCH v11 2/4] Improve read_stream.c's fast path.

Unfortunately the "fast path" for cached scans that don't do any I/O was
coded in a way that could be used by pg_prewarm, but not the proposed
sequential scan patch which starts with a low initial distance due to
flags used to initialize the stream.  Refactor to fix that.

Author: Thomas Munro
Reviewed-by: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 70 +++++++++++----------------
 1 file changed, 27 insertions(+), 43 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..80b8be8cc7e 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -578,13 +578,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (likely(stream->fast_path))
 	{
 		BlockNumber next_blocknum;
-		bool		need_wait;
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
-		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 
 		/* We're going to return the buffer we pinned last time. */
@@ -594,40 +593,29 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		buffer = stream->buffers[oldest_buffer_index];
 		Assert(buffer != InvalidBuffer);
 
-		/*
-		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
-		 * I/O entry (they're all free).
-		 */
-		need_wait = StartReadBuffer(&stream->ios[0].op,
-									&stream->buffers[oldest_buffer_index],
-									stream->pending_read_blocknum,
-									stream->advice_enabled ?
-									READ_BUFFERS_ISSUE_ADVICE : 0);
-
-		/* Choose the block the next call will pin. */
+		/* Choose the next block to pin. */
 		if (unlikely(stream->blocknums_next == stream->blocknums_count))
 			read_stream_fill_blocknums(stream);
 		next_blocknum = stream->blocknums[stream->blocknums_next++];
 
-		/*
-		 * Fast return if the next call doesn't require I/O for the buffer we
-		 * just pinned, and we have a block number to give it as a pending
-		 * read.
-		 */
-		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		if (likely(next_blocknum != InvalidBlockNumber))
 		{
-			stream->pending_read_blocknum = next_blocknum;
-			return buffer;
-		}
-
-		/*
-		 * For anything more complex, set up some more state and take the slow
-		 * path next time.
-		 */
-		stream->fast_path = false;
+			/*
+			 * Pin a buffer for the next call.  Same buffer entry, and
+			 * arbitrary I/O entry (they're all free).  We don't have to
+			 * adjust pinned_buffers because we're transferring one to caller
+			 * but pinning one more.
+			 */
+			if (likely(!StartReadBuffer(&stream->ios[0].op,
+										&stream->buffers[oldest_buffer_index],
+										next_blocknum,
+										stream->advice_enabled ?
+										READ_BUFFERS_ISSUE_ADVICE : 0)))
+			{
+				/* Fast return. */
+				return buffer;
+			}
 
-		if (need_wait)
-		{
 			/* Next call must wait for I/O for the newly pinned buffer. */
 			stream->oldest_io_index = 0;
 			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
@@ -635,17 +623,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->ios[0].buffer_index = oldest_buffer_index;
 			stream->seq_blocknum = next_blocknum + 1;
 		}
-		if (next_blocknum == InvalidBlockNumber)
-		{
-			/* Next call hits end of stream and can't pin anything more. */
-			stream->distance = 0;
-			stream->pending_read_nblocks = 0;
-		}
 		else
 		{
-			/* Set up the pending read. */
-			stream->pending_read_blocknum = next_blocknum;
+			/* No more blocks, end of stream. */
+			stream->distance = 0;
+			stream->oldest_buffer_index = stream->next_buffer_index;
+			stream->pinned_buffers = 0;
 		}
+
+		stream->fast_path = false;
 		return buffer;
 	}
 #endif
@@ -762,15 +748,11 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (stream->ios_in_progress == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
-		stream->pending_read_nblocks == 1 &&
+		stream->pending_read_nblocks == 0 &&
 		stream->per_buffer_data_size == 0)
 	{
 		stream->fast_path = true;
 	}
-	else
-	{
-		stream->fast_path = false;
-	}
 #endif
 
 	return buffer;
@@ -789,6 +771,7 @@ read_stream_reset(ReadStream *stream)
 
 	/* Stop looking ahead. */
 	stream->distance = 0;
+	stream->fast_path = false;
 
 	/* Unpin anything that wasn't consumed. */
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
@@ -799,6 +782,7 @@ read_stream_reset(ReadStream *stream)
 
 	/* Start off assuming data is cached. */
 	stream->distance = 1;
+	stream->fast_path = false;
 }
 
 /*
-- 
2.40.1

v11-0001-Use-streaming-IO-in-heapam-sequential-and-TID-ra.patchtext/x-patch; charset=US-ASCII; name=v11-0001-Use-streaming-IO-in-heapam-sequential-and-TID-ra.patchDownload
From 47eb8392b8cf75a4e21dc50414af4857a5eade56 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v11 1/4] Use streaming IO in heapam sequential and TID range
 scans

Instead of calling ReadBuffer() for each block heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

Author: Melanie Plageman
Reviewed-by: Thomas Munro
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
---
 src/backend/access/heap/heapam.c | 98 ++++++++++++++++++++++++++------
 src/include/access/heapam.h      | 15 +++++
 2 files changed, 95 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e3..bafe023bce8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -482,25 +510,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
+	}
 
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(scan->rs_cblock != InvalidBlockNumber ||
-			   !BufferIsValid(scan->rs_cbuf));
+	scan->rs_dir = dir;
 
-		scan->rs_inited = true;
-	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock,
-												   dir);
-
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL,
-										   scan->rs_strategy);
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -833,6 +859,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -928,6 +955,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -1021,6 +1049,26 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	scan->rs_read_stream = NULL;
+
+	/*
+	 * Set up a read stream for sequential scans and TID range scans. This
+	 * should be done after initscan() because initscan() allocates the
+	 * BufferAccessStrategy object passed to the streaming read API.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  scan->rs_base.rs_rd,
+														  MAIN_FORKNUM,
+														  heap_scan_stream_read_next,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1055,6 +1103,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * The read stream is reset on rescan. This must be done before
+	 * initscan(), as some state referred to by read_stream_reset() is reset
+	 * in initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
+
 	/*
 	 * reinitialize scan descriptor
 	 */
@@ -1074,6 +1130,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy.
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2765efc4e5e..332a7faa8d1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -70,6 +71,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
-- 
2.40.1

v11-0004-Prefetch-page-header-memory-when-streaming-relat.patchtext/x-patch; charset=US-ASCII; name=v11-0004-Prefetch-page-header-memory-when-streaming-relat.patchDownload
From 3840f869b7720c1458cadb99cec0d54e72aef9a9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 15:06:32 +1300
Subject: [PATCH v11 4/4] Prefetch page header memory when streaming relations.

read_stream.c can always see at least one page ahead of the one the
caller is accessing.  Take the opportunity to prefetch the cache line
that holds the next page's header.  For some scans, that can generate a
decent speedup, though real world results will depend on how much work
the CPU does before it gets around to accessing the next page.
---
 src/backend/storage/aio/read_stream.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 80b8be8cc7e..35ee1fc1043 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -612,7 +612,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 										stream->advice_enabled ?
 										READ_BUFFERS_ISSUE_ADVICE : 0)))
 			{
-				/* Fast return. */
+				/* Predict caller will soon access next page's header. */
+				pg_prefetch_mem(BufferGetPage(stream->buffers[oldest_buffer_index]));
 				return buffer;
 			}
 
@@ -743,6 +744,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
 
+	/* Predict caller will soon access next page's header. */
+	if (stream->pinned_buffers > 0)
+		pg_prefetch_mem(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));
+
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
-- 
2.40.1

#32Thomas Munro
thomas.munro@gmail.com
In reply to: Melanie Plageman (#31)
6 attachment(s)
Re: Streaming read-ready sequential scan code

On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro <thomas.munro@gmail.com> wrote:

The interesting column is hot. The 200ms->211ms regression is due to
the extra bookkeeping in the slow path. The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can
you reproduce that?

I am able to reproduce the fast path solving the issue using Heikki's
example here [1] but in shared buffers (hot).

master: 25 ms
stream read: 29 ms
stream read + fast path: 25 ms

Great, thanks.

I haven't looked into or reviewed the memory prefetching part.

While reviewing 0002, I realized that I don't quite see how
read_stream_get_block() will be used in the fastpath -- which it
claims in its comments.

Comments improved.

Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist?

Just for testing purposes. Behaviour should be identical to external
observers either way, it's just a hand-rolled specialisation for
certain parameters, and it's useful to be able to verify that and
measure the speedup. I think it's OK to leave a bit of
developer/testing scaffolding in the tree if it's likely to be useful
again and especially if like this case it doesn't create any dead
code. (Perhaps in later work we might find the right way to get the
compiler to do the specialisation? It's so simple though.)

The occasional CI problem I mentioned turned out to be
read_stream_reset() remembering a little too much state across
rescans. Fixed.

Thanks both for the feedback on the ring buffer tweaks. Comments
updated. Here is the full stack of patches I would like to commit
very soon, though I may leave the memory prefetching part out for a
bit longer to see if I can find any downside.

Attachments:

v12-0001-Improve-read_stream.c-s-fast-path.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Improve-read_stream.c-s-fast-path.patchDownload
From 54efd755040507b55672092907d53b4db30f5a06 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 12:08:24 +1300
Subject: [PATCH v12 1/6] Improve read_stream.c's fast path.

Unfortunately the "fast path" for cached scans that don't do any I/O was
accidentally coded in a way that could only be triggered by pg_prewarm's
usage patter.  We really need it to work for the streaming sequential
scan patch.  Refactor.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 75 +++++++++++----------------
 1 file changed, 31 insertions(+), 44 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..b9e11a28312 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -169,7 +169,7 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 /*
  * Ask the callback which block it would like us to read next, with a small
  * buffer in front to allow read_stream_unget_block() to work and to allow the
- * fast path to work in batches.
+ * fast path to skip this function and work directly from the array.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
@@ -578,13 +578,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (likely(stream->fast_path))
 	{
 		BlockNumber next_blocknum;
-		bool		need_wait;
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
-		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 
 		/* We're going to return the buffer we pinned last time. */
@@ -594,40 +593,29 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		buffer = stream->buffers[oldest_buffer_index];
 		Assert(buffer != InvalidBuffer);
 
-		/*
-		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
-		 * I/O entry (they're all free).
-		 */
-		need_wait = StartReadBuffer(&stream->ios[0].op,
-									&stream->buffers[oldest_buffer_index],
-									stream->pending_read_blocknum,
-									stream->advice_enabled ?
-									READ_BUFFERS_ISSUE_ADVICE : 0);
-
-		/* Choose the block the next call will pin. */
+		/* Choose the next block to pin. */
 		if (unlikely(stream->blocknums_next == stream->blocknums_count))
 			read_stream_fill_blocknums(stream);
 		next_blocknum = stream->blocknums[stream->blocknums_next++];
 
-		/*
-		 * Fast return if the next call doesn't require I/O for the buffer we
-		 * just pinned, and we have a block number to give it as a pending
-		 * read.
-		 */
-		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		if (likely(next_blocknum != InvalidBlockNumber))
 		{
-			stream->pending_read_blocknum = next_blocknum;
-			return buffer;
-		}
-
-		/*
-		 * For anything more complex, set up some more state and take the slow
-		 * path next time.
-		 */
-		stream->fast_path = false;
+			/*
+			 * Pin a buffer for the next call.  Same buffer entry, and
+			 * arbitrary I/O entry (they're all free).  We don't have to
+			 * adjust pinned_buffers because we're transferring one to caller
+			 * but pinning one more.
+			 */
+			if (likely(!StartReadBuffer(&stream->ios[0].op,
+										&stream->buffers[oldest_buffer_index],
+										next_blocknum,
+										stream->advice_enabled ?
+										READ_BUFFERS_ISSUE_ADVICE : 0)))
+			{
+				/* Fast return. */
+				return buffer;
+			}
 
-		if (need_wait)
-		{
 			/* Next call must wait for I/O for the newly pinned buffer. */
 			stream->oldest_io_index = 0;
 			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
@@ -635,17 +623,15 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 			stream->ios[0].buffer_index = oldest_buffer_index;
 			stream->seq_blocknum = next_blocknum + 1;
 		}
-		if (next_blocknum == InvalidBlockNumber)
-		{
-			/* Next call hits end of stream and can't pin anything more. */
-			stream->distance = 0;
-			stream->pending_read_nblocks = 0;
-		}
 		else
 		{
-			/* Set up the pending read. */
-			stream->pending_read_blocknum = next_blocknum;
+			/* No more blocks, end of stream. */
+			stream->distance = 0;
+			stream->oldest_buffer_index = stream->next_buffer_index;
+			stream->pinned_buffers = 0;
 		}
+
+		stream->fast_path = false;
 		return buffer;
 	}
 #endif
@@ -762,15 +748,11 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (stream->ios_in_progress == 0 &&
 		stream->pinned_buffers == 1 &&
 		stream->distance == 1 &&
-		stream->pending_read_nblocks == 1 &&
+		stream->pending_read_nblocks == 0 &&
 		stream->per_buffer_data_size == 0)
 	{
 		stream->fast_path = true;
 	}
-	else
-	{
-		stream->fast_path = false;
-	}
 #endif
 
 	return buffer;
@@ -790,6 +772,11 @@ read_stream_reset(ReadStream *stream)
 	/* Stop looking ahead. */
 	stream->distance = 0;
 
+	/* Forget buffered block numbers and fast path state. */
+	stream->blocknums_next = 0;
+	stream->blocknums_count = 0;
+	stream->fast_path = false;
+
 	/* Unpin anything that wasn't consumed. */
 	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 		ReleaseBuffer(buffer);
-- 
2.44.0

v12-0002-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchtext/x-patch; charset=US-ASCII; name=v12-0002-Add-pg_prefetch_mem-macro-to-load-cache-lines.patchDownload
From 2e5146cb4aa4d5307836129ed6ef74076ecb9b74 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Mon, 10 Jul 2023 11:22:34 +0200
Subject: [PATCH v12 2/6] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC/Clang and MSVC builtins.

Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4       | 17 ++++++++++++++++
 configure                  | 40 ++++++++++++++++++++++++++++++++++++++
 configure.ac               |  3 +++
 meson.build                |  1 +
 src/include/c.h            |  8 ++++++++
 src/include/pg_config.h.in |  3 +++
 6 files changed, 72 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb0..4cc02f97601 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
                    [Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# -----------------------
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+    $1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+                   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 36feeafbb23..79b78c33ddc 100755
--- a/configure
+++ b/configure
@@ -15543,6 +15543,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+void
+call__builtin_prefetch(void *x)
+{
+    __builtin_prefetch(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"${pgac_cv__builtin_prefetch}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_PREFETCH 1
+_ACEOF
+
+fi
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _LARGEFILE_SOURCE value needed for large files" >&5
diff --git a/configure.ac b/configure.ac
index 57f734879e1..6cd5441ed32 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1781,6 +1781,9 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_popcount], [unsigned int x])
 # so it needs a different test function.
 PGAC_CHECK_BUILTIN_FUNC_PTR([__builtin_frame_address], [0])
 
+# Can we use a built-in to prefetch memory?
+PGAC_CHECK_BUILTIN_VOID_FUNC([__builtin_prefetch], [void *x])
+
 # We require 64-bit fseeko() to be available, but run this check anyway
 # in case it finds that _LARGEFILE_SOURCE has to be #define'd for that.
 AC_FUNC_FSEEKO
diff --git a/meson.build b/meson.build
index 87437960bc3..210fd4c1cd7 100644
--- a/meson.build
+++ b/meson.build
@@ -1707,6 +1707,7 @@ builtins = [
   'constant_p',
   'frame_address',
   'popcount',
+  'prefetch',
   'unreachable',
 ]
 
diff --git a/src/include/c.h b/src/include/c.h
index cf37e02fe1f..0b7aa3e2924 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -412,6 +412,14 @@ typedef void (*pg_funcptr_t) (void);
 #define HAVE_PRAGMA_GCC_SYSTEM_HEADER	1
 #endif
 
+/* Do we have support for prefetching memory? */
+#if defined(HAVE__BUILTIN_PREFETCH)
+#define pg_prefetch_mem(a) __builtin_prefetch(a)
+#elif defined(_MSC_VER)
+#define pg_prefetch_mem(a) _m_prefetch(a)
+#else
+#define pg_prefetch_mem(a)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 2:	bool, true, false
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 591e1ca3df6..085abf4c84e 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -543,6 +543,9 @@
 /* Define to 1 if your compiler understands __builtin_popcount. */
 #undef HAVE__BUILTIN_POPCOUNT
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
-- 
2.44.0

v12-0003-Prefetch-page-header-memory-when-streaming-relat.patchtext/x-patch; charset=US-ASCII; name=v12-0003-Prefetch-page-header-memory-when-streaming-relat.patchDownload
From 0e9d86010581a50bfe71644a8f8eac087aba4293 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 15:06:32 +1300
Subject: [PATCH v12 3/6] Prefetch page header memory when streaming relations.

read_stream.c can always see at least one page ahead of the one the
caller is accessing.  Take the opportunity to prefetch the cache line
that holds the next page's header.  For some scans, that can generate a
decent speedup, though real world results will depend on how much work
the CPU actually does before it gets around to accessing the next page.

Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index b9e11a28312..e5b64238f21 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -612,7 +612,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 										stream->advice_enabled ?
 										READ_BUFFERS_ISSUE_ADVICE : 0)))
 			{
-				/* Fast return. */
+				/* Predict caller will soon access next page's header. */
+				pg_prefetch_mem(BufferGetPage(stream->buffers[oldest_buffer_index]));
 				return buffer;
 			}
 
@@ -743,6 +744,10 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	/* Prepare for the next call. */
 	read_stream_look_ahead(stream, false);
 
+	/* Predict caller will soon access next page's header. */
+	if (stream->pinned_buffers > 0)
+		pg_prefetch_mem(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));
+
 #ifndef READ_STREAM_DISABLE_FAST_PATH
 	/* See if we can take the fast path for all-cached scans next time. */
 	if (stream->ios_in_progress == 0 &&
-- 
2.44.0

v12-0004-Allow-BufferAccessStrategy-to-limit-pin-count.patchtext/x-patch; charset=US-ASCII; name=v12-0004-Allow-BufferAccessStrategy-to-limit-pin-count.patchDownload
From 1d88127b78b11e7a1fcb9d4e5a21227ada32d262 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Apr 2024 20:31:26 +1300
Subject: [PATCH v12 4/6] Allow BufferAccessStrategy to limit pin count.

When pinning extra buffers to look ahead, users of a strategy are in
danger of pinning a lot of the buffers in the ring, or even more than
the ring size.  For some strategies, that means "escaping" from the
ring, and in others it means forcing dirty data to disk very frequently
with associated WAL flushing.  Since external code has no insight into
any of that, allow individual strategy types to expose a clamp that
should be applied when deciding how many buffers to pin at once.

Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c |  5 ++++
 src/backend/storage/buffer/freelist.c | 42 +++++++++++++++++++++++++++
 src/include/storage/bufmgr.h          |  1 +
 3 files changed, 48 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e5b64238f21..164049f1474 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -419,6 +419,7 @@ read_stream_begin_relation(int flags,
 	size_t		size;
 	int16		queue_size;
 	int16		max_ios;
+	int			strategy_pin_limit;
 	uint32		max_pinned_buffers;
 	Oid			tablespace_id;
 	SMgrRelation smgr;
@@ -460,6 +461,10 @@ read_stream_begin_relation(int flags,
 	max_pinned_buffers = Min(max_pinned_buffers,
 							 PG_INT16_MAX - io_combine_limit - 1);
 
+	/* Give the strategy a chance to limit the number of buffers we pin. */
+	strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+	max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
+
 	/* Don't allow this backend to pin more than its share of buffers. */
 	if (SmgrIsTemp(smgr))
 		LimitAdditionalLocalPins(&max_pinned_buffers);
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3611357fa30..d8096d2ee4f 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -629,6 +629,48 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy)
 	return strategy->nbuffers;
 }
 
+/*
+ * GetAccessStrategyPinLimit -- get cap of number of buffers that should be pinned
+ *
+ * When pinning extra buffers to look ahead, users of a ring-based strategy are
+ * in danger of pinning too much of the ring at once while performing look-ahead.
+ * For some strategies, that means "escaping" from the ring, and in others it
+ * means forcing dirty data to disk very frequently with associated WAL
+ * flushing.  Since external code has no insight into any of that, allow
+ * individual strategy types to expose a clamp that should be applied when
+ * deciding on a maximum number of buffers to pin at once.
+ *
+ * Callers should combine this number with other relevant limits and take the
+ * minimum.
+ */
+int
+GetAccessStrategyPinLimit(BufferAccessStrategy strategy)
+{
+	if (strategy == NULL)
+		return NBuffers;
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKREAD:
+
+			/*
+			 * Since BAS_BULKREAD uses StrategyRejectBuffer(), dirty buffers
+			 * shouldn't be a problem and the caller is free to pin up to the
+			 * entire ring at once.
+			 */
+			return strategy->nbuffers;
+
+		default:
+
+			/*
+			 * Tell call not to pin more than half the buffers in the ring.
+			 * This is a trade-off between look ahead distance and deferring
+			 * writeback and associated WAL traffic.
+			 */
+			return strategy->nbuffers / 2;
+	}
+}
+
 /*
  * FreeAccessStrategy -- release a BufferAccessStrategy object
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6c..07ba1a60502 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,6 +318,7 @@ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType btype,
 													  int ring_size_kb);
 extern int	GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
+extern int	GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
-- 
2.44.0

v12-0005-Increase-default-vacuum_buffer_usage_limit-to-2M.patchtext/x-patch; charset=US-ASCII; name=v12-0005-Increase-default-vacuum_buffer_usage_limit-to-2M.patchDownload
From d51ebdc7b6137f7d9ae9a48f2b806768b9d91dea Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Thu, 4 Apr 2024 21:11:06 +1300
Subject: [PATCH v12 5/6] Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f introduced
the mechanism 17 years ago.  Commit 1cbbee03 recently made it
configurable but retained the traditional default.  The correct default
size has been debated for years, but 256kB is certainly very small.
VACUUM soon needs to write back data it dirtied only 32 blocks ago,
which usually requires flushing the WAL.  New experiments in prefetching
pages for VACUUM exacerbated the problem by crashing into dirty data
even sooner.  Let's make the default 2MB.  That's 1.6% of the default
toy buffer pool size, and 0.2% of 1GB, which would be a considered a
small shared_buffers setting for a real system these days.  Users are
still free to set the GUC to a different value.

Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20240403221257.md4gfki3z75cdyf6%40awork3.anarazel.de
Discussion: https://postgre.es/m/CA%2BhUKGLY4Q4ZY4f1rvnFtv6%2BPkjNf8MejdPkcju3Qii9DYqqcQ%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      | 2 +-
 src/backend/storage/buffer/freelist.c         | 2 +-
 src/backend/utils/init/globals.c              | 2 +-
 src/backend/utils/misc/guc_tables.c           | 2 +-
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 624518e0b01..d8e1282e128 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1961,7 +1961,7 @@ include_dir 'conf.d'
         valid sizes range from <literal>128 kB</literal> to
         <literal>16 GB</literal>.  If the specified size would exceed 1/8 the
         size of <varname>shared_buffers</varname>, the size is silently capped
-        to that value.  The default value is <literal>256 kB</literal>.  If
+        to that value.  The default value is <literal>2MB</literal>.  If
         this value is specified without units, it is taken as kilobytes.  This
         parameter can be set at any time.  It can be overridden for
         <xref linkend="sql-vacuum"/> and <xref linkend="sql-analyze"/>
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index d8096d2ee4f..314b5f73ddc 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -561,7 +561,7 @@ GetAccessStrategy(BufferAccessStrategyType btype)
 			ring_size_kb = 16 * 1024;
 			break;
 		case BAS_VACUUM:
-			ring_size_kb = 256;
+			ring_size_kb = 2048;
 			break;
 
 		default:
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 3e38bb1311d..cc61937eef7 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -143,7 +143,7 @@ int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
 /* GUC parameters for vacuum */
-int			VacuumBufferUsageLimit = 256;
+int			VacuumBufferUsageLimit = 2048;
 
 int			VacuumCostPageHit = 1;
 int			VacuumCostPageMiss = 2;
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c12784cbec8..7d4e4387cf5 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2275,7 +2275,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&VacuumBufferUsageLimit,
-		256, 0, MAX_BAS_VAC_RING_SIZE_KB,
+		2048, 0, MAX_BAS_VAC_RING_SIZE_KB,
 		check_vacuum_buffer_usage_limit, NULL, NULL
 	},
 
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index baecde28410..2166ea4a87a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -166,7 +166,7 @@
 					#   mmap
 					# (change requires restart)
 #min_dynamic_shared_memory = 0MB	# (change requires restart)
-#vacuum_buffer_usage_limit = 256kB	# size of vacuum and analyze buffer access strategy ring;
+#vacuum_buffer_usage_limit = 2MB	# size of vacuum and analyze buffer access strategy ring;
 					# 0 to disable vacuum buffer access strategy;
 					# range 128kB to 16GB
 
-- 
2.44.0

v12-0006-Use-streaming-IO-in-heapam-sequential-and-TID-ra.patchtext/x-patch; charset=US-ASCII; name=v12-0006-Use-streaming-IO-in-heapam-sequential-and-TID-ra.patchDownload
From f41335bf8f35a966767c1e9c3b578478ca4159db Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v12 6/6] Use streaming IO in heapam sequential and TID range
 scans

Instead of calling ReadBuffer() for each block heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
---
 src/backend/access/heap/heapam.c | 98 ++++++++++++++++++++++++++------
 src/include/access/heapam.h      | 15 +++++
 2 files changed, 95 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e3..bafe023bce8 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -482,25 +510,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
+	}
 
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(scan->rs_cblock != InvalidBlockNumber ||
-			   !BufferIsValid(scan->rs_cbuf));
+	scan->rs_dir = dir;
 
-		scan->rs_inited = true;
-	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock,
-												   dir);
-
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL,
-										   scan->rs_strategy);
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -833,6 +859,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -928,6 +955,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -1021,6 +1049,26 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	scan->rs_read_stream = NULL;
+
+	/*
+	 * Set up a read stream for sequential scans and TID range scans. This
+	 * should be done after initscan() because initscan() allocates the
+	 * BufferAccessStrategy object passed to the streaming read API.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  scan->rs_base.rs_rd,
+														  MAIN_FORKNUM,
+														  heap_scan_stream_read_next,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1055,6 +1103,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * The read stream is reset on rescan. This must be done before
+	 * initscan(), as some state referred to by read_stream_reset() is reset
+	 * in initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
+
 	/*
 	 * reinitialize scan descriptor
 	 */
@@ -1074,6 +1130,12 @@ heap_endscan(TableScanDesc sscan)
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy.
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2765efc4e5e..332a7faa8d1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -70,6 +71,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
-- 
2.44.0

#33Melanie Plageman
melanieplageman@gmail.com
In reply to: Thomas Munro (#32)
Re: Streaming read-ready sequential scan code

On Fri, Apr 5, 2024 at 7:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I haven't looked into or reviewed the memory prefetching part.

While reviewing 0002, I realized that I don't quite see how
read_stream_get_block() will be used in the fastpath -- which it
claims in its comments.

Comments improved.

Ah, makes sense now.

The occasional CI problem I mentioned turned out to be
read_stream_reset() remembering a little too much state across
rescans. Fixed.

Nice investigative work figuring this out.

Thanks both for the feedback on the ring buffer tweaks. Comments
updated. Here is the full stack of patches I would like to commit
very soon, though I may leave the memory prefetching part out for a
bit longer to see if I can find any downside.

0001 LGTM. I did not review 0002 or 0003.

0004 looks good except for one comment typo:

/*
* Tell call not to pin more than half the buffers in the ring.
* This is a trade-off between look ahead distance and deferring
* writeback and associated WAL traffic.
*/

call -> caller

0006, I noticed the commit message is missing an important comma:

Instead of calling ReadBuffer() for each block heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

should be

Instead of calling ReadBuffer() for each block, heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

- Melanie

#34Thomas Munro
thomas.munro@gmail.com
In reply to: Melanie Plageman (#33)
1 attachment(s)
Re: Streaming read-ready sequential scan code

I found a bug in read_stream.c that could be hit with Melanie's
streaming seq scan patch with parallelism enabled and certain buffer
pool conditions. Short version: there is an edge case where an "if"
needed to be a "while", or we could lose a few blocks. Here's the fix
for that, longer explanation in commit message.

Attachments:

0001-Fix-bug-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-bug-in-read_stream.c.patchDownload
From 43cef2d58141ba048e9349b0027afff148be5553 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 7 Apr 2024 12:36:44 +1200
Subject: [PATCH] Fix bug in read_stream.c.

When we determine that a wanted block can't be combined with the current
pending read, it's time to start that pending read to get it out of the
way.  An "if" in that code path should have been a "while", because it
might take more than one go to get that job done.  Otherwise the
remaining part of a partially started read could be clobbered and we
could lose some blocks.  This was only broken for smaller ranges, as the
more common case of io_combine_limit-sized ranges is handled earlier in
the code and knows how to loop.

Discovered while testing parallel sequential scans of partially cached
tables.  They have a ramp-down phase with ever smaller ranges of
contiguous blocks, to be fair to parallel workers as the work runs out.

Defect in commit b5a9b18c.
---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 9a70a81f7ae..f54dacdd914 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		}
 
 		/* We have to start the pending read before we can build another. */
-		if (stream->pending_read_nblocks > 0)
+		while (stream->pending_read_nblocks > 0)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
-- 
2.44.0

#35Melanie Plageman
melanieplageman@gmail.com
In reply to: Thomas Munro (#34)
2 attachment(s)
Re: Streaming read-ready sequential scan code

On Sat, Apr 6, 2024 at 9:25 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I found a bug in read_stream.c that could be hit with Melanie's
streaming seq scan patch with parallelism enabled and certain buffer
pool conditions. Short version: there is an edge case where an "if"
needed to be a "while", or we could lose a few blocks. Here's the fix
for that, longer explanation in commit message.

Attached v13 0001 is your fix and 0002 is a new version of the
sequential scan streaming read user. Off-list Andres mentioned that I
really ought to separate the parallel and serial sequential scan users
into two different callbacks. I've done that in the attached. It
actually makes the code used by the callbacks nicer and more readable
anyway (putting aside performance). I was able to measure a small
performance difference as well.

I've also added a few comments and improved existing comments.

- Melanie

Attachments:

v13-0002-Use-streaming-IO-in-heapam-sequential-and-TID-ra.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Use-streaming-IO-in-heapam-sequential-and-TID-ra.patchDownload
From eded321df22bf472f147bd8f94b596d465355c70 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v13 2/2] Use streaming IO in heapam sequential and TID range
 scans

Instead of calling ReadBuffer() for each block, heap sequential scans
and TID range scans now use the streaming read API introduced in
b5a9b18cd0.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
---
 src/backend/access/heap/heapam.c | 234 +++++++++++++++++++++----------
 src/include/access/heapam.h      |  15 ++
 2 files changed, 176 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 01bb2f4cc16..9d10d42b69b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,66 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * ----------------------------------------------------------------
  */
 
+/*
+ * Streaming read API callback for parallel sequential scans. Returns the next
+ * block the caller wants from the read stream or InvalidBlockNumber when done.
+ */
+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+									void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	Assert(ScanDirectionIsForward(scan->rs_dir));
+	Assert(scan->rs_base.rs_parallel);
+
+	if (unlikely(!scan->rs_inited))
+	{
+		/* parallel scan */
+		table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+												 scan->rs_parallelworkerdata,
+												 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* may return InvalidBlockNumber if there are no more blocks */
+		scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+																	scan->rs_parallelworkerdata,
+																	(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+		scan->rs_inited = true;
+	}
+	else
+	{
+		scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+																	scan->rs_parallelworkerdata, (ParallelBlockTableScanDesc)
+																	scan->rs_base.rs_parallel);
+	}
+
+	return scan->rs_prefetch_block;
+}
+
+/*
+ * Streaming read API callback for serial sequential and TID range scans.
+ * Returns the next block the caller wants from the read stream or
+ * InvalidBlockNumber when done.
+ */
+static BlockNumber
+heap_scan_stream_read_next_serial(ReadStream *pgsr, void *private_data,
+								  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+														   scan->rs_prefetch_block,
+														   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* ----------------
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * ----------------
@@ -325,6 +385,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and
+	 * because heap scans go forward before going backward (e.g. CURSORs).
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +529,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -482,25 +551,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
+	}
 
-		/* ensure rs_cbuf is invalid when we get InvalidBlockNumber */
-		Assert(scan->rs_cblock != InvalidBlockNumber ||
-			   !BufferIsValid(scan->rs_cbuf));
+	scan->rs_dir = dir;
 
-		scan->rs_inited = true;
-	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock,
-												   dir);
-
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   scan->rs_cblock, RBM_NORMAL,
-										   scan->rs_strategy);
+	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
+	if (BufferIsValid(scan->rs_cbuf))
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
 }
 
 /*
@@ -514,6 +581,7 @@ static pg_noinline BlockNumber
 heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 {
 	Assert(!scan->rs_inited);
+	Assert(scan->rs_base.rs_parallel == NULL);
 
 	/* When there are no pages to scan, return InvalidBlockNumber */
 	if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
@@ -521,27 +589,10 @@ heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir)
 
 	if (ScanDirectionIsForward(dir))
 	{
-		/* serial scan */
-		if (scan->rs_base.rs_parallel == NULL)
-			return scan->rs_startblock;
-		else
-		{
-			/* parallel scan */
-			table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata,
-													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
-
-			/* may return InvalidBlockNumber if there are no more blocks */
-			return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata,
-													 (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
-		}
+		return scan->rs_startblock;
 	}
 	else
 	{
-		/* backward parallel scan not supported */
-		Assert(scan->rs_base.rs_parallel == NULL);
-
 		/*
 		 * Disable reporting to syncscan logic in a backwards scan; it's not
 		 * very likely anyone else is doing the same thing at the same time,
@@ -653,50 +704,43 @@ heapgettup_continue_page(HeapScanDesc scan, ScanDirection dir, int *linesleft,
 static inline BlockNumber
 heapgettup_advance_block(HeapScanDesc scan, BlockNumber block, ScanDirection dir)
 {
-	if (ScanDirectionIsForward(dir))
+	Assert(scan->rs_base.rs_parallel == NULL);
+
+	if (likely(ScanDirectionIsForward(dir)))
 	{
-		if (scan->rs_base.rs_parallel == NULL)
-		{
-			block++;
+		block++;
 
-			/* wrap back to the start of the heap */
-			if (block >= scan->rs_nblocks)
-				block = 0;
+		/* wrap back to the start of the heap */
+		if (block >= scan->rs_nblocks)
+			block = 0;
 
-			/*
-			 * Report our new scan position for synchronization purposes. We
-			 * don't do that when moving backwards, however. That would just
-			 * mess up any other forward-moving scanners.
-			 *
-			 * Note: we do this before checking for end of scan so that the
-			 * final state of the position hint is back at the start of the
-			 * rel.  That's not strictly necessary, but otherwise when you run
-			 * the same query multiple times the starting position would shift
-			 * a little bit backwards on every invocation, which is confusing.
-			 * We don't guarantee any specific ordering in general, though.
-			 */
-			if (scan->rs_base.rs_flags & SO_ALLOW_SYNC)
-				ss_report_location(scan->rs_base.rs_rd, block);
-
-			/* we're done if we're back at where we started */
-			if (block == scan->rs_startblock)
-				return InvalidBlockNumber;
+		/*
+		 * Report our new scan position for synchronization purposes. We don't
+		 * do that when moving backwards, however. That would just mess up any
+		 * other forward-moving scanners.
+		 *
+		 * Note: we do this before checking for end of scan so that the final
+		 * state of the position hint is back at the start of the rel.  That's
+		 * not strictly necessary, but otherwise when you run the same query
+		 * multiple times the starting position would shift a little bit
+		 * backwards on every invocation, which is confusing. We don't
+		 * guarantee any specific ordering in general, though.
+		 */
+		if (scan->rs_base.rs_flags & SO_ALLOW_SYNC)
+			ss_report_location(scan->rs_base.rs_rd, block);
 
-			/* check if the limit imposed by heap_setscanlimits() is met */
-			if (scan->rs_numblocks != InvalidBlockNumber)
-			{
-				if (--scan->rs_numblocks == 0)
-					return InvalidBlockNumber;
-			}
+		/* we're done if we're back at where we started */
+		if (block == scan->rs_startblock)
+			return InvalidBlockNumber;
 
-			return block;
-		}
-		else
+		/* check if the limit imposed by heap_setscanlimits() is met */
+		if (scan->rs_numblocks != InvalidBlockNumber)
 		{
-			return table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
-													 scan->rs_parallelworkerdata, (ParallelBlockTableScanDesc)
-													 scan->rs_base.rs_parallel);
+			if (--scan->rs_numblocks == 0)
+				return InvalidBlockNumber;
 		}
+
+		return block;
 	}
 	else
 	{
@@ -833,6 +877,7 @@ continue_page:
 
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -928,6 +973,7 @@ continue_page:
 		ReleaseBuffer(scan->rs_cbuf);
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
+	scan->rs_prefetch_block = InvalidBlockNumber;
 	tuple->t_data = NULL;
 	scan->rs_inited = false;
 }
@@ -1023,6 +1069,34 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	initscan(scan, key, false);
 
+	scan->rs_read_stream = NULL;
+
+	/*
+	 * Set up a read stream for sequential scans and TID range scans. This
+	 * should be done after initscan() because initscan() allocates the
+	 * BufferAccessStrategy object passed to the streaming read API.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN ||
+		scan->rs_base.rs_flags & SO_TYPE_TIDRANGESCAN)
+	{
+		BlockNumber (*cb) (ReadStream *pgsr, void *private_data,
+						   void *per_buffer_data);
+
+		if (scan->rs_base.rs_parallel)
+			cb = heap_scan_stream_read_next_parallel;
+		else
+			cb = heap_scan_stream_read_next_serial;
+
+		scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL,
+														  scan->rs_strategy,
+														  scan->rs_base.rs_rd,
+														  MAIN_FORKNUM,
+														  cb,
+														  scan,
+														  0);
+	}
+
+
 	return (TableScanDesc) scan;
 }
 
@@ -1065,6 +1139,14 @@ heap_rescan(TableScanDesc sscan, ScanKey key, bool set_params,
 
 	Assert(scan->rs_empty_tuples_pending == 0);
 
+	/*
+	 * The read stream is reset on rescan. This must be done before
+	 * initscan(), as some state referred to by read_stream_reset() is reset
+	 * in initscan().
+	 */
+	if (scan->rs_read_stream)
+		read_stream_reset(scan->rs_read_stream);
+
 	/*
 	 * reinitialize scan descriptor
 	 */
@@ -1089,6 +1171,12 @@ heap_endscan(TableScanDesc sscan)
 
 	Assert(scan->rs_empty_tuples_pending == 0);
 
+	/*
+	 * Must free the read stream before freeing the BufferAccessStrategy.
+	 */
+	if (scan->rs_read_stream)
+		read_stream_end(scan->rs_read_stream);
+
 	/*
 	 * decrement relation reference count and free scan descriptor storage
 	 */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 750ea30852e..48936826bcc 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -70,6 +71,20 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	/* For scans that stream reads */
+	ReadStream *rs_read_stream;
+
+	/*
+	 * For sequential scans and TID range scans to stream reads. The read
+	 * stream is allocated at the beginning of the scan and reset on rescan or
+	 * when the scan direction changes. The scan direction is saved each time
+	 * a new page is requested. If the scan direction changes from one page to
+	 * the next, the read stream releases all previously pinned buffers and
+	 * resets the prefetch block.
+	 */
+	ScanDirection rs_dir;
+	BlockNumber rs_prefetch_block;
+
 	/*
 	 * For parallel scans to store page allocation data.  NULL when not
 	 * performing a parallel scan.
-- 
2.40.1

v13-0001-Fix-bug-in-read_stream.c.patchtext/x-patch; charset=US-ASCII; name=v13-0001-Fix-bug-in-read_stream.c.patchDownload
From d3d7d4d09d804a8a8d00ecea080de5d63d21159c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 7 Apr 2024 12:36:44 +1200
Subject: [PATCH v13 1/2] Fix bug in read_stream.c.

When we determine that a wanted block can't be combined with the current
pending read, it's time to start that pending read to get it out of the
way.  An "if" in that code path should have been a "while", because it
might take more than one go to get that job done.  Otherwise the
remaining part of a partially started read could be clobbered and we
could lose some blocks.  This was only broken for smaller ranges, as the
more common case of io_combine_limit-sized ranges is handled earlier in
the code and knows how to loop.

Discovered while testing parallel sequential scans of partially cached
tables.  They have a ramp-down phase with ever smaller ranges of
contiguous blocks, to be fair to parallel workers as the work runs out.

Defect in commit b5a9b18c.
---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 9a70a81f7ae..f54dacdd914 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		}
 
 		/* We have to start the pending read before we can build another. */
-		if (stream->pending_read_nblocks > 0)
+		while (stream->pending_read_nblocks > 0)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
-- 
2.40.1

#36Thomas Munro
thomas.munro@gmail.com
In reply to: Melanie Plageman (#35)
Re: Streaming read-ready sequential scan code

On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Attached v13 0001 is your fix and 0002 is a new version of the
sequential scan streaming read user. Off-list Andres mentioned that I
really ought to separate the parallel and serial sequential scan users
into two different callbacks. I've done that in the attached. It
actually makes the code used by the callbacks nicer and more readable
anyway (putting aside performance). I was able to measure a small
performance difference as well.

Thanks. I changed a couple of very trivial things before pushing.

+        BlockNumber (*cb) (ReadStream *pgsr, void *private_data,
+                           void *per_buffer_data);

This type has a friendly name: ReadStreamBlockNumberCB.

+ scan->rs_read_stream =
read_stream_begin_relation(READ_STREAM_SEQUENTIAL,

I've been on the fence about that flag for sequential scan... Some
days I want to consider changing to READ_STREAM_DEFAULT and relying on
our "anti-heuristics" to suppress advice, which would work out the
same in most cases but might occasionally win big. It might also
hurt, I dunno, so I suspect we'd have to make it better first, which
my patch in [1]/messages/by-id/CA+hUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV+vk8fwMbreA@mail.gmail.com was a first swing at, but I haven't researched that
enough. So, kept this way!

- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.

Changed to:

* Read the next block of the scan relation from the read stream and save it
* in the scan descriptor. It is already pinned.

+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+                                    void *per_buffer_data)

Changed argument names to match the function pointer type definition,
"stream" and "callback_private_data".

BTW looking at the branching in read-stream user patches that have an
initialisation step like yours, I wonder if it might every make sense
to be able to change the callback on the fly from inside the callback,
so that you finish up with a branchless one doing most of the work. I
have no idea if it's worth it...

[1]: /messages/by-id/CA+hUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV+vk8fwMbreA@mail.gmail.com

#37Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#36)
Re: Streaming read-ready sequential scan code

Hi,

On 2024-04-08 09:36:59 +1200, Thomas Munro wrote:

I've been on the fence about that flag for sequential scan... Some
days I want to consider changing to READ_STREAM_DEFAULT and relying on
our "anti-heuristics" to suppress advice, which would work out the
same in most cases but might occasionally win big.

Agreed, it's pretty easy to end up with a fairly "fragmented" set of a
relation's buffers in s_b. OTOH, there might not be any need for the
heuristic if we actually trigger reads asynchronously.

BTW looking at the branching in read-stream user patches that have an
initialisation step like yours, I wonder if it might every make sense
to be able to change the callback on the fly from inside the callback,
so that you finish up with a branchless one doing most of the work. I
have no idea if it's worth it...

I was wondering about that too, I dislike those branches. But instead of
changing the callback, it seems like a better design would be to have another
dedicated callback for that? There already is a dedicated branch for the
"just starting up" path in read_stream_next_buffer(), so it'd be pretty much
free to call another callback there.

Greetings,

Andres Freund

#38Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#37)
1 attachment(s)
Re: Streaming read-ready sequential scan code

Hello,

I decided to compare v17 vs v16 performance (as I did the last year [1]/messages/by-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52@gmail.com)
and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
benchmark, query15 (and several others, but I focused on this one):
Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15
Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15
Please look at the full html report attached in case you're interested.

(I used my pg-mark tool to measure/analyze performance, but I believe the
same results can be seen without it.)

`git bisect` for this performance degradation pointed at b7b0f3f27...

[1]: /messages/by-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52@gmail.com

Best regards,
Alexander

Attachments:

benchmark-results.htmltext/html; charset=UTF-8; name=benchmark-results.htmlDownload
#39Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#38)
Re: Streaming read-ready sequential scan code

On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I decided to compare v17 vs v16 performance (as I did the last year [1])
and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
benchmark, query15 (and several others, but I focused on this one):
Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15
Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15
Please look at the full html report attached in case you're interested.

(I used my pg-mark tool to measure/analyze performance, but I believe the
same results can be seen without it.)

Will investigate, but if it's easy for you to rerun, does it help if
you increase Linux readahead, eg blockdev --setra setting?

#40Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#39)
Re: Streaming read-ready sequential scan code

On Sat, May 18, 2024 at 8:09 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I decided to compare v17 vs v16 performance (as I did the last year [1])
and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
benchmark, query15 (and several others, but I focused on this one):
Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15
Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15
Please look at the full html report attached in case you're interested.

(I used my pg-mark tool to measure/analyze performance, but I believe the
same results can be seen without it.)

Will investigate, but if it's easy for you to rerun, does it help if
you increase Linux readahead, eg blockdev --setra setting?

Andres happened to have TPC-DS handy, and reproduced that regression
in q15. We tried some stuff and figured out that it requires
parallel_leader_participation=on, ie that this looks like some kind of
parallel fairness and/or timing problem. It seems to be a question of
which worker finishes up processing matching rows, and the leader gets
a ~10ms head start but may be a little more greedy with the new
streaming code. He tried reordering the table contents and then saw
17 beat 16. So for q15, initial indications are that this isn't a
fundamental regression, it's just a test that is sensitive to some
arbitrary conditions.

I'll try to figure out some more details about that, ie is it being
too greedy on small-ish tables, and generally I do wonder about the
interactions between the heuristics and batching working at different
levels (OS, seq scan, read stream, hence my earlier ra question which
is likely a red herring) and how there might be unintended
consequences/interference patterns, but this particular case seems
more data dependent.

#41Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#40)
Re: Streaming read-ready sequential scan code

On Sat, May 18, 2024 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Andres happened to have TPC-DS handy, and reproduced that regression
in q15. We tried some stuff and figured out that it requires
parallel_leader_participation=on, ie that this looks like some kind of
parallel fairness and/or timing problem. It seems to be a question of
which worker finishes up processing matching rows, and the leader gets
a ~10ms head start but may be a little more greedy with the new
streaming code. He tried reordering the table contents and then saw
17 beat 16. So for q15, initial indications are that this isn't a
fundamental regression, it's just a test that is sensitive to some
arbitrary conditions.

I'll try to figure out some more details about that, ie is it being
too greedy on small-ish tables,

After more debugging, we learned a lot more things...

1. That query produces spectacularly bad estimates, so we finish up
having to increase the number of buckets in a parallel hash join many
times. That is quite interesting, but unrelated to new code.
2. Parallel hash join is quite slow at negotiating an increase in the
number of hash bucket, if all of the input tuples are being filtered
out by quals, because of the choice of where workers check for
PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I
think. I have put that on my todo list 'cause that's also my code,
but it's not a new issue it's just one that is now highlighted...
3. This bit of read_stream.c is exacerbating unfairness in the
underlying scan, so that 1 and 2 come together and produce a nasty
slowdown, which goes away if you change it like so:

-       BlockNumber blocknums[16];
+       BlockNumber blocknums[1];

I will follow up after some more study.

#42Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#41)
Re: Streaming read-ready sequential scan code

Hello Thomas,

18.05.2024 07:47, Thomas Munro wrote:

After more debugging, we learned a lot more things...

1. That query produces spectacularly bad estimates, so we finish up
having to increase the number of buckets in a parallel hash join many
times. That is quite interesting, but unrelated to new code.
2. Parallel hash join is quite slow at negotiating an increase in the
number of hash bucket, if all of the input tuples are being filtered
out by quals, because of the choice of where workers check for
PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I
think. I have put that on my todo list 'cause that's also my code,
but it's not a new issue it's just one that is now highlighted...
3. This bit of read_stream.c is exacerbating unfairness in the
underlying scan, so that 1 and 2 come together and produce a nasty
slowdown, which goes away if you change it like so:

-       BlockNumber blocknums[16];
+       BlockNumber blocknums[1];

I will follow up after some more study.

Thank you for the information!

Unfortunately, I can't see significant differences in my environment with
parallel_leader_participation=off.

With blocknums[1], timing is changed, but the effect is not persistent.
10 query15 executions in a row, b7b0f3f27:
277.932 ms
281.805 ms
278.335 ms
281.565 ms
284.167 ms
283.171 ms
281.165 ms
281.615 ms
285.394 ms
277.301 ms

b7b0f3f27~1:
159.789 ms
165.407 ms
160.893 ms
159.343 ms
160.936 ms
161.577 ms
161.637 ms
163.421 ms
163.143 ms
167.109 ms

b7b0f3f27 + blocknums[1]:
164.133 ms
280.920 ms
160.748 ms
163.182 ms
161.709 ms
161.998 ms
161.239 ms
276.256 ms
161.601 ms
160.384 ms

I placed PGDATA on tmpfs to rule out any blockdev specifics (increasing
blockdev ra from 256 to 4096 didn't help me with PGDATA on NVME either.)

Best regards,
Alexander

#43Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#42)
Re: Streaming read-ready sequential scan code

On Sun, May 19, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

With blocknums[1], timing is changed, but the effect is not persistent.
10 query15 executions in a row, b7b0f3f27:
277.932 ms
281.805 ms
278.335 ms
281.565 ms
284.167 ms
283.171 ms
281.165 ms
281.615 ms
285.394 ms
277.301 ms

The bad time 10/10.

b7b0f3f27~1:
159.789 ms
165.407 ms
160.893 ms
159.343 ms
160.936 ms
161.577 ms
161.637 ms
163.421 ms
163.143 ms
167.109 ms

The good time 10/10.

b7b0f3f27 + blocknums[1]:
164.133 ms
280.920 ms
160.748 ms
163.182 ms
161.709 ms
161.998 ms
161.239 ms
276.256 ms
161.601 ms
160.384 ms

The good time 8/10, the bad time 2/10.

Thanks for checking! I bet all branches can show that flip/flop
instability in these adverse conditions, depending on random
scheduling details. I will start a new thread with a patch for the
root cause of that, ie problem #2 (this will need back-patching), and
post a fix for #3 (v17 blocknums[N] tweak affecting
fairness/likelihood, which was probably basically a bit of ill-advised
premature optimisation) here in a few days.

#44Melanie Plageman
melanieplageman@gmail.com
In reply to: Thomas Munro (#41)
Re: Streaming read-ready sequential scan code

Thank you to all of you for looking into this.

On Sat, May 18, 2024 at 12:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sat, May 18, 2024 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Andres happened to have TPC-DS handy, and reproduced that regression
in q15. We tried some stuff and figured out that it requires
parallel_leader_participation=on, ie that this looks like some kind of
parallel fairness and/or timing problem. It seems to be a question of
which worker finishes up processing matching rows, and the leader gets
a ~10ms head start but may be a little more greedy with the new
streaming code. He tried reordering the table contents and then saw
17 beat 16. So for q15, initial indications are that this isn't a
fundamental regression, it's just a test that is sensitive to some
arbitrary conditions.

I'll try to figure out some more details about that, ie is it being
too greedy on small-ish tables,

After more debugging, we learned a lot more things...

1. That query produces spectacularly bad estimates, so we finish up
having to increase the number of buckets in a parallel hash join many
times. That is quite interesting, but unrelated to new code.
2. Parallel hash join is quite slow at negotiating an increase in the
number of hash bucket, if all of the input tuples are being filtered
out by quals, because of the choice of where workers check for
PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I
think. I have put that on my todo list 'cause that's also my code,
but it's not a new issue it's just one that is now highlighted...
3. This bit of read_stream.c is exacerbating unfairness in the
underlying scan, so that 1 and 2 come together and produce a nasty
slowdown, which goes away if you change it like so:

-       BlockNumber blocknums[16];
+       BlockNumber blocknums[1];

I will follow up after some more study.

So, if you are seeing the slow-down mostly go away by reducing
blocknums array size, does the regression only appear when the scan
data is fully in shared buffers? Or is this blocknums other use
(dealing with short reads)?

Is your theory that one worker ends up reading 16 blocks that should
have been distributed across multiple workers?

- Melanie

#45Thomas Munro
thomas.munro@gmail.com
In reply to: Melanie Plageman (#44)
Re: Streaming read-ready sequential scan code

On Tue, May 21, 2024 at 9:11 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

So, if you are seeing the slow-down mostly go away by reducing
blocknums array size, does the regression only appear when the scan
data is fully in shared buffers? Or is this blocknums other use
(dealing with short reads)?

That must be true (that blocknums array is normally only "filled" in
the "fast path", where all buffers are found in cache).

Is your theory that one worker ends up reading 16 blocks that should
have been distributed across multiple workers?

Yes, it just jiggles the odds around a bit, introducing a bit of extra
unfairness by calling the callback in a tighter loop to build a little
batch, revealing a pre-existing problem.

The mistake in PHJ (problem #2 above) is that, once a worker decides
it would like all workers to stop inserting so it can increase the
number of buckets, it sets a flag to ask them to do that, and waits
for them to see it, but if there is a worker filtering all tuples out,
it never checks the "growth" flag. So it scans all the way to the end
while the other guy waits. Normally it checks that flag when it is
time to allocate a new chunk of memory, which seemed to make sense to
me at the time: if we've hit the needs-more-buckets (or
needs-more-batches) logic, then surely workers are inserting tuples
and will soon allocate a new chunk! But, of course, here is the edge
case where that isn't true: we had bad estimates so hash table too
small (problem #1), we got lots of tuples clustered over a few heap
pages and decided to expand the hash table, but right at that moment,
matching tuples ran out so somebody had to finish the whole scan
without ever checking the flag (problem #2), and that someone happened
to have all the rest of the pages because we made the lookahead a bit
less fair (problem #3). Nice confluence of problems. I expect #2 and
#3 to be easy to fix, and I didn't look at the estimation problem #1
at all (perhaps a stats puzzle designed by the TPC to trip us up?).

#46Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#45)
1 attachment(s)
Re: Streaming read-ready sequential scan code

Here's a really simple way to see the new unfairness at the end of a
parallel scan:

drop table if exists t;
create table t (i int);
insert into t select generate_series(1, 100000);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;

On my machine, unpatched master shows:

Worker 0: actual time=0.036..12.452 rows=51076 loops=1
Buffers: shared hit=226
Worker 1: actual time=0.037..12.003 rows=48924 loops=1
Buffers: shared hit=217

The attached patch, which I'd like to push, is effectively what
Alexander tested (blocknums[16] -> blocknums[1]). There's no point in
using an array of size 1, so I've turned it into a simple variable and
deleted the relevant comments. My machine shows:

Worker 0: actual time=0.038..12.115 rows=49946 loops=1
Buffers: shared hit=221
Worker 1: actual time=0.038..12.109 rows=50054 loops=1
Buffers: shared hit=222

That difference may not seem huge, but other pre-existing things are
going pathologically wrong in the reported query that magnify it (see
my earlier analysis). It's an interesting problem that will require
more study (my earlier analysis missed a detail that I'll write about
separately), but it doesn't seem to be new or have easy fixes, so that
will have to be for later work.

Attachments:

0001-Fix-unfairness-in-all-cached-parallel-seq-scan.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-unfairness-in-all-cached-parallel-seq-scan.patchDownload
From 07ff31ad30bf9e383e42336e28143852e3793c5b Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 27 Aug 2024 15:11:53 +1200
Subject: [PATCH] Fix unfairness in all-cached parallel seq scan.

Commit b5a9b18c introduced block streaming infrastructure with a special
fast path for all-cached blocks, and commit b7b0f3f2 connected the
infrastructure up to sequential scans.  One of the fast path
optimizations had an unintended consequence: it interfered with the
underlying parallel sequential scan's block allocator, which has its own
ramp-up and ramp-down algorithm.  A scan of a small all-cached table
could give more blocks to one worker.  In some plans (probably already
very bad plans, such as the one reported by Alexander), the unfairness
could be magnified.

Now all-cached scans will call the next-block-number callback just once
each time it wants a new block, instead of trying to buffer 16 block
numbers at once.

Back-patch to 17.

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/63a63690-dd92-c809-0b47-af05459e95d1%40gmail.com
---
 src/backend/storage/aio/read_stream.c | 82 ++++++++-------------------
 1 file changed, 24 insertions(+), 58 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index a83c18c2a4b..57d9e93c001 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -117,13 +117,10 @@ struct ReadStream
 	bool		advice_enabled;
 
 	/*
-	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
-	 * control problems when I/Os are split.  Also useful for batch-loading
-	 * block numbers in the fast path.
+	 * One-block buffer to support 'ungetting' a block number, to resolve flow
+	 * control problems when I/Os are split.
 	 */
-	BlockNumber blocknums[16];
-	int16		blocknums_count;
-	int16		blocknums_next;
+	BlockNumber buffered_blocknum;
 
 	/*
 	 * The callback that will tell us which block numbers to read, and an
@@ -167,68 +164,39 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 }
 
 /*
- * Ask the callback which block it would like us to read next, with a small
- * buffer in front to allow read_stream_unget_block() to work and to allow the
- * fast path to skip this function and work directly from the array.
+ * Ask the callback which block it would like us to read next, with a one-block
+ * buffer in front to allow read_stream_unget_block() to work.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 {
-	if (stream->blocknums_next < stream->blocknums_count)
-		return stream->blocknums[stream->blocknums_next++];
+	BlockNumber blocknum;
 
-	/*
-	 * We only bother to fetch one at a time here (but see the fast path which
-	 * uses more).
-	 */
-	return stream->callback(stream,
-							stream->callback_private_data,
-							per_buffer_data);
+	blocknum = stream->buffered_blocknum;
+	if (blocknum != InvalidBlockNumber)
+		stream->buffered_blocknum = InvalidBlockNumber;
+	else
+		blocknum = stream->callback(stream,
+									stream->callback_private_data,
+									per_buffer_data);
+
+	return blocknum;
 }
 
 /*
  * In order to deal with short reads in StartReadBuffers(), we sometimes need
- * to defer handling of a block until later.
+ * to defer handling of a block number we've already received from the callback
+ * until later.
  */
 static inline void
 read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
 {
-	if (stream->blocknums_next == stream->blocknums_count)
-	{
-		/* Never initialized or entirely consumed.  Re-initialize. */
-		stream->blocknums[0] = blocknum;
-		stream->blocknums_count = 1;
-		stream->blocknums_next = 0;
-	}
-	else
-	{
-		/* Must be the last value return from blocknums array. */
-		Assert(stream->blocknums_next > 0);
-		stream->blocknums_next--;
-		Assert(stream->blocknums[stream->blocknums_next] == blocknum);
-	}
+	/* We shouldn't ever unget more than one block. */
+	Assert(stream->buffered_blocknum == InvalidBlockNumber);
+	Assert(blocknum != InvalidBlockNumber);
+	stream->buffered_blocknum = blocknum;
 }
 
-#ifndef READ_STREAM_DISABLE_FAST_PATH
-static void
-read_stream_fill_blocknums(ReadStream *stream)
-{
-	BlockNumber blocknum;
-	int			i = 0;
-
-	do
-	{
-		blocknum = stream->callback(stream,
-									stream->callback_private_data,
-									NULL);
-		stream->blocknums[i++] = blocknum;
-	} while (i < lengthof(stream->blocknums) &&
-			 blocknum != InvalidBlockNumber);
-	stream->blocknums_count = i;
-	stream->blocknums_next = 0;
-}
-#endif
-
 static void
 read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 {
@@ -530,6 +498,7 @@ read_stream_begin_impl(int flags,
 	stream->queue_size = queue_size;
 	stream->callback = callback;
 	stream->callback_private_data = callback_private_data;
+	stream->buffered_blocknum = InvalidBlockNumber;
 
 	/*
 	 * Skip the initial ramp-up phase if the caller says we're going to be
@@ -649,9 +618,7 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		Assert(buffer != InvalidBuffer);
 
 		/* Choose the next block to pin. */
-		if (unlikely(stream->blocknums_next == stream->blocknums_count))
-			read_stream_fill_blocknums(stream);
-		next_blocknum = stream->blocknums[stream->blocknums_next++];
+		next_blocknum = read_stream_get_block(stream, NULL);
 
 		if (likely(next_blocknum != InvalidBlockNumber))
 		{
@@ -828,8 +795,7 @@ read_stream_reset(ReadStream *stream)
 	stream->distance = 0;
 
 	/* Forget buffered block numbers and fast path state. */
-	stream->blocknums_next = 0;
-	stream->blocknums_count = 0;
+	stream->buffered_blocknum = InvalidBlockNumber;
 	stream->fast_path = false;
 
 	/* Unpin anything that wasn't consumed. */
-- 
2.46.0

#47Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#46)
Re: Streaming read-ready sequential scan code

Hello Thomas,

27.08.2024 09:52, Thomas Munro wrote:

Here's a really simple way to see the new unfairness at the end of a
parallel scan:

drop table if exists t;
create table t (i int);
insert into t select generate_series(1, 100000);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;

On my machine, unpatched master shows:

Worker 0: actual time=0.036..12.452 rows=51076 loops=1
Buffers: shared hit=226
Worker 1: actual time=0.037..12.003 rows=48924 loops=1
Buffers: shared hit=217

The attached patch, which I'd like to push, is effectively what
Alexander tested (blocknums[16] -> blocknums[1]). There's no point in
using an array of size 1, so I've turned it into a simple variable and
deleted the relevant comments. My machine shows:

Worker 0: actual time=0.038..12.115 rows=49946 loops=1
Buffers: shared hit=221
Worker 1: actual time=0.038..12.109 rows=50054 loops=1
Buffers: shared hit=222

That difference may not seem huge, but other pre-existing things are
going pathologically wrong in the reported query that magnify it (see
my earlier analysis). It's an interesting problem that will require
more study (my earlier analysis missed a detail that I'll write about
separately), but it doesn't seem to be new or have easy fixes, so that
will have to be for later work.

I've tried your query and could not get sustainable results, unfortunately.
The following script:
rm -rf "$PGDATA"; initdb -D "$PGDATA" >initdb.log 2>&1

pg_ctl -s -l server.log start

cat << EOF | psql | grep 'Parallel Seq Scan' -A10 | grep 'Worker ' -A1
create table t (i int);
insert into t select generate_series(1, 100000);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;
EOF

pg_ctl -s stop

gives me unstable numbers on unpatched master:
                     Worker 0:  actual time=0.024..5.814 rows=51076 loops=1
                       Buffers: shared hit=226
                     Worker 1:  actual time=0.023..5.614 rows=48924 loops=1
                       Buffers: shared hit=217
---
                     Worker 0:  actual time=0.027..5.130 rows=36612 loops=1
                       Buffers: shared hit=162
                     Worker 1:  actual time=0.013..5.605 rows=63388 loops=1
                       Buffers: shared hit=281
---
                     Worker 0:  actual time=0.025..5.447 rows=47460 loops=1
                       Buffers: shared hit=210
                     Worker 1:  actual time=0.019..5.688 rows=52540 loops=1
                       Buffers: shared hit=233

and also with the patch applied:
                     Worker 0:  actual time=0.012..4.486 rows=55478 loops=1
                       Buffers: shared hit=246
                     Worker 1:  actual time=0.014..4.430 rows=44522 loops=1
                       Buffers: shared hit=197
---
                     Worker 0:  actual time=0.013..4.269 rows=55822 loops=1
                       Buffers: shared hit=247
                     Worker 1:  actual time=0.017..4.238 rows=44178 loops=1
                       Buffers: shared hit=196
---
                     Worker 0:  actual time=0.014..4.974 rows=50624 loops=1
                       Buffers: shared hit=224
                     Worker 1:  actual time=0.016..4.932 rows=49376 loops=1
                       Buffers: shared hit=219
---
                     Worker 0:  actual time=0.012..5.459 rows=65648 loops=1
                       Buffers: shared hit=291
                     Worker 1:  actual time=0.022..5.109 rows=34352 loops=1
                       Buffers: shared hit=152

Please correct me, if I'm doing something wrong.

Best regards,
Alexander

#48Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#47)
Re: Streaming read-ready sequential scan code

On Wed, Aug 28, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

gives me unstable numbers on unpatched master:
Buffers: shared hit=226
Buffers: shared hit=217

Buffers: shared hit=162
Buffers: shared hit=281

Buffers: shared hit=210
Buffers: shared hit=233

and also with the patch applied:
Buffers: shared hit=246
Buffers: shared hit=197

Buffers: shared hit=247
Buffers: shared hit=196

Buffers: shared hit=224
Buffers: shared hit=219

Buffers: shared hit=291
Buffers: shared hit=152

Please correct me, if I'm doing something wrong.

Huh. I can reproduce what I showed with pretty low variance, on my
FreeBSD, Linux and macOS systems. I included
parallel_leader_participation=off so that the workers would hopefully
start as closely together in time as possible, and hopefully allow
only a block or so of variation in the outcome. If I do:

create or replace function f(i int) returns int language plpgsql
parallel safe as $$
begin
raise notice '% pid %', clock_timestamp(), pg_backend_pid();
return i;
end;
$$;

then

postgres=# explain (analyze) select f(i) from t limit 1;
NOTICE: 2024-08-28 16:41:32.845747+12 pid 47019
NOTICE: 2024-08-28 16:41:32.845746+12 pid 47018

shows start times differ by only a few microseconds at most, often 0.
I wonder if your system is more variable at beginning execution?
Maybe you're on a multi-socket system and forking/startup times vary
in some way I can't see on small systems or something like that?

#49Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#48)
Re: Streaming read-ready sequential scan code

28.08.2024 07:45, Thomas Munro wrote:

Huh. I can reproduce what I showed with pretty low variance, on my
FreeBSD, Linux and macOS systems. I included
parallel_leader_participation=off so that the workers would hopefully
start as closely together in time as possible, and hopefully allow
only a block or so of variation in the outcome. If I do:

create or replace function f(i int) returns int language plpgsql
parallel safe as $$
begin
raise notice '% pid %', clock_timestamp(), pg_backend_pid();
return i;
end;
$$;

then

postgres=# explain (analyze) select f(i) from t limit 1;
NOTICE: 2024-08-28 16:41:32.845747+12 pid 47019
NOTICE: 2024-08-28 16:41:32.845746+12 pid 47018

shows start times differ by only a few microseconds at most, often 0.
I wonder if your system is more variable at beginning execution?
Maybe you're on a multi-socket system and forking/startup times vary
in some way I can't see on small systems or something like that?

I use a single-socket system with AMD Ryzen 5900X, running Ubuntu 20.04.
With no active background tasks running, I'm seeing:
NOTICE:  2024-08-28 08:17:36.917162+00 pid 320103
NOTICE:  2024-08-28 08:17:36.917163+00 pid 320102

NOTICE:  2024-08-28 08:17:40.592333+00 pid 320143
NOTICE:  2024-08-28 08:17:40.592645+00 pid 320144

With log_min_messages = DEBUG3, I get:
NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
NOTICE:  2024-08-28 08:41:59.310079+00 pid 338264

And the following messages in the server.log:
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process "parallel worker for PID 338262"
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process "parallel worker for PID 338262"
2024-08-28 08:41:59.305 UTC [338263] DEBUG:  InitPostgres
2024-08-28 08:41:59.305 UTC [338264] DEBUG:  InitPostgres
2024-08-28 08:41:59.309 UTC [338263] NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
2024-08-28 08:41:59.309 UTC [338263] CONTEXT:  PL/pgSQL function f(integer) line 3 at RAISE
2024-08-28 08:41:59.309 UTC [338262] NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
2024-08-28 08:41:59.309 UTC [338262] CONTEXT:  PL/pgSQL function f(integer) line 3 at RAISE
    parallel worker
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 5 before_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(0): 1 callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  exit(0)
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 before_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 on_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-08-28 08:41:59.310 UTC [338264] NOTICE:  2024-08-28 08:41:59.310079+00 pid 338264
2024-08-28 08:41:59.310 UTC [338264] CONTEXT:  PL/pgSQL function f(integer) line 3 at RAISE

It looks like the two parallel workers were started simultaneously, but
then the second one lagged behind...

Best regards,
Alexander

#50Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#49)
Re: Streaming read-ready sequential scan code

On Wed, Aug 28, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:

2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262"
2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262"

2024-08-28 08:41:59.305 UTC [338263] DEBUG: InitPostgres
2024-08-28 08:41:59.305 UTC [338264] DEBUG: InitPostgres

2024-08-28 08:41:59.309 UTC [338263] NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263
2024-08-28 08:41:59.310 UTC [338264] NOTICE: 2024-08-28 08:41:59.310079+00 pid 338264

It looks like the two parallel workers were started simultaneously, but
then the second one lagged behind...

Yeah. That's quite interesting, and must destabilise that
simple-minded demo. I'm curious to know exactly what contention is
causing that (about 3/4 of a millisecond that I don't see and now I
want to know what it's waiting for), but it's a very crude test
lacking timer resolution in the earlier messages, and it's an
unrelated topic and a distraction. Perhaps it explains why you saw
two different behaviours in Q15 with the patch and I didn't, though.
Really it shouldn't be so sensitive to such variations, it's obviously
a terrible plan, and TPC-DS needs a planner hacker mega-brain applied
to it; I'm going to try to nerd-snipe one...

#51Alexander Lakhin
exclusion@gmail.com
In reply to: Thomas Munro (#50)
Re: Streaming read-ready sequential scan code

Hello Thomas,

29.08.2024 01:16, Thomas Munro wrote:

Yeah. That's quite interesting, and must destabilise that
simple-minded demo. I'm curious to know exactly what contention is
causing that (about 3/4 of a millisecond that I don't see and now I
want to know what it's waiting for), but it's a very crude test
lacking timer resolution in the earlier messages, and it's an
unrelated topic and a distraction. Perhaps it explains why you saw
two different behaviours in Q15 with the patch and I didn't, though.
Really it shouldn't be so sensitive to such variations, it's obviously
a terrible plan, and TPC-DS needs a planner hacker mega-brain applied
to it; I'm going to try to nerd-snipe one...

I looked at two perf profiles of such out-of-sync processes and found no
extra calls or whatsoever in the slow one, it just has the number of perf
samples increased proportionally. It made me suspect CPU frequency
scaling... Indeed, with the "performance" governor set and the boost mode
disabled, I'm now seeing much more stable numbers (I do this tuning before
running performance tests, but I had forgotten about that when I ran that
your test, my bad).

I'm sorry for the noise and the distraction.

Still, now I can confirm your results. Without the patch, two parallel
workers gave "Buffers: shared hit=217 / Buffers: shared hit=226" 10 times
out of 10. And with the patch, I've got
"shared hit=219 / shared hit=224" 3 times,
"shared hit=220 / shared hit=223" 4 times,
"shared hit=221 / shared hit=222" 3 times of 10.

On b7b0f3f27~1, my results are:
"shared hit=219 / shared hit=224": 2
"shared hit=220 / shared hit=223": 3
"shared hit=221 / shared hit=222": 4
"shared hit=218 / shared hit=225": 1

Best regards,
Alexander

#52Thomas Munro
thomas.munro@gmail.com
In reply to: Alexander Lakhin (#51)
Re: Streaming read-ready sequential scan code

On Fri, Aug 30, 2024 at 1:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

I looked at two perf profiles of such out-of-sync processes and found no
extra calls or whatsoever in the slow one, it just has the number of perf
samples increased proportionally. It made me suspect CPU frequency
scaling... Indeed, with the "performance" governor set and the boost mode
disabled, I'm now seeing much more stable numbers (I do this tuning before
running performance tests, but I had forgotten about that when I ran that
your test, my bad).

Aha, mystery solved.

I have pushed the fix. Thanks!