Use streaming read API in ANALYZE

Started by Nazir Bilal Yavuzalmost 2 years ago39 messages

byavuz81@gmail.com

almost 2 years ago

2 attachment(s)

Hi,

I worked on using the currently proposed streaming read API [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com in ANALYZE.
The patch is attached. 0001 is the not yet merged streaming read API code
changes that can be applied to the master, 0002 is the actual code.

The blocks to analyze are obtained by using the streaming read API now.

- Since streaming read API is already doing prefetch, I removed the #ifdef
USE_PREFETCH code from acquire_sample_rows().

- Changed 'while (BlockSampler_HasMore(&bs))' to 'while (nblocks)' because
the prefetch mechanism in the streaming read API will advance 'bs' before
returning buffers.

- Removed BlockNumber and BufferAccessStrategy from the declaration of
scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them.

I counted syscalls of analyzing ~5GB table. It can be seen that the patched
version did ~1300 less read calls.

Patched:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
39.67 0.012128 0 29809 pwrite64
36.96 0.011299 0 28594 pread64
23.24 0.007104 0 27611 fadvise64

Master (21a71648d3):

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
38.94 0.016457 0 29816 pwrite64
36.79 0.015549 0 29850 pread64
23.91 0.010106 0 29848 fadvise64

Any kind of feedback would be appreciated.

[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v1-0001-Streaming-read-API-changes-that-are-not-committed.patchtext/x-diff; charset=US-ASCII; name=v1-0001-Streaming-read-API-changes-that-are-not-committed.patchDownload

From 509e55997c084f831fcbcb46cabe79d4f97aa93e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 22 Jul 2023 17:31:54 +1200
Subject: [PATCH v1 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h             |  22 +
 src/include/storage/streaming_read.h     |  52 ++
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 472 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 592 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/tools/pgindent/typedefs.list         |   2 +
 10 files changed, 953 insertions(+), 223 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..a38f1acb37a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,18 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+extern Buffer PrepareReadBuffer(BufferManagerRelation bmr,
+								ForkNumber forkNum,
+								BlockNumber blockNum,
+								BufferAccessStrategy strategy,
+								bool *foundPtr);
+extern void CompleteReadBuffers(BufferManagerRelation bmr,
+								Buffer *buffers,
+								ForkNumber forknum,
+								BlockNumber blocknum,
+								int nblocks,
+								bool zero_on_error,
+								BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -247,9 +265,13 @@ extern void LockBufferForCleanup(Buffer buffer);
 extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
+extern void ZeroBuffer(Buffer buffer, ReadBufferMode mode);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..c4d3892bb26
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..bb5b863be76
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,472 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ *
+ * For hits, need_to_complete is false and there is just one block per
+ * range, already pinned and ready for use.
+ *
+ * For misses, need_to_complete is true and buffers[] holds a range of
+ * blocks that are contiguous in storage (though the buffers may not be
+ * contiguous in memory), so we can complete them with a single call to
+ * CompleteReadBuffers().
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		advice_issued;
+	bool		need_complete;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index[MAX_BUFFERS_PER_TRANSFER];
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+} PgStreamingReadRange;
+
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			ios_in_progress_trigger;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			ramp_up_pin_limit;
+	int			ramp_up_pin_stall;
+	bool		finished;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	bool		advice_enabled;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int			per_buffer_data_next;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs, we might have 0.  We want to allow at least
+	 * one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We start off building small ranges, but double that quickly, for the
+	 * benefit of users that don't know how far ahead they'll read.  This can
+	 * be disabled by users that already know they'll read all the way.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->ramp_up_pin_limit = INT_MAX;
+	else
+		pgsr->ramp_up_pin_limit = 1;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Start building a new range.  This is called after the previous one
+ * reached maximum size, or the callback's next block can't be merged with it.
+ *
+ * Since the previous head range has now reached its full potential size, this
+ * is also a good time to issue 'prefetch' advice, because we know that'll
+ * soon be reading.  In future, we could start an actual I/O here.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_new_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If a call to CompleteReadBuffers() will be needed, and we can issue
+	 * advice to the kernel to get the read started.  We suppress it if the
+	 * access pattern appears to be completely sequential, though, because on
+	 * some systems that interfers with the kernel's own sequential read ahead
+	 * heurstics and hurts performance.
+	 */
+	if (pgsr->advice_enabled)
+	{
+		BlockNumber blocknum = head_range->blocknum;
+		int			nblocks = head_range->nblocks;
+
+		if (head_range->need_complete && blocknum != pgsr->seq_blocknum)
+		{
+			SMgrRelation smgr =
+				pgsr->bmr.smgr ? pgsr->bmr.smgr :
+				RelationGetSmgr(pgsr->bmr.rel);
+
+			Assert(!head_range->advice_issued);
+
+			smgrprefetch(smgr, pgsr->forknum, blocknum, nblocks);
+
+			/*
+			 * Count this as an I/O that is concurrently in progress, though
+			 * we don't really know if the kernel generates a physical I/O.
+			 */
+			head_range->advice_issued = true;
+			pgsr->ios_in_progress++;
+		}
+
+		/* Remember the block after this range, for sequence detection. */
+		pgsr->seq_blocknum = blocknum + nblocks;
+	}
+
+	/* Create a new head range.  There must be space. */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	head_range = &pgsr->ranges[pgsr->head];
+	head_range->nblocks = 0;
+
+	return head_range;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	/*
+	 * If we're still ramping up, we may have to stall to wait for buffers to
+	 * be consumed first before we do any more prefetching.
+	 */
+	if (pgsr->ramp_up_pin_stall > 0)
+	{
+		Assert(pgsr->pinned_buffers > 0);
+		return;
+	}
+
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BufferManagerRelation bmr;
+		ForkNumber	forknum;
+		BlockNumber blocknum;
+		Buffer		buffer;
+		bool		found;
+		bool		need_complete;
+		PgStreamingReadRange *head_range;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks == lengthof(head_range->buffers))
+		{
+			Assert(head_range->need_complete);
+			head_range = pg_streaming_read_new_range(pgsr);
+
+			/*
+			 * Give up now if I/O is saturated, or we wouldn't be able form
+			 * another full range after this due to the pin limit.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger ||
+				pgsr->ios_in_progress == pgsr->max_ios)
+				break;
+		}
+
+		per_buffer_data = (char *) pgsr->per_buffer_data +
+			pgsr->per_buffer_data_size * pgsr->per_buffer_data_next;
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			pgsr->finished = true;
+			break;
+		}
+		bmr = pgsr->bmr;
+		forknum = pgsr->forknum;
+
+		Assert(pgsr->pinned_buffers < pgsr->max_pinned_buffers);
+
+		buffer = PrepareReadBuffer(bmr,
+								   forknum,
+								   blocknum,
+								   pgsr->strategy,
+								   &found);
+		pgsr->pinned_buffers++;
+
+		need_complete = !found;
+
+		/* Is there a head range that we can't extend? */
+		head_range = &pgsr->ranges[pgsr->head];
+		if (head_range->nblocks > 0 &&
+			(!need_complete ||
+			 !head_range->need_complete ||
+			 head_range->blocknum + head_range->nblocks != blocknum))
+		{
+			/* Yes, time to start building a new one. */
+			head_range = pg_streaming_read_new_range(pgsr);
+			Assert(head_range->nblocks == 0);
+		}
+
+		if (head_range->nblocks == 0)
+		{
+			/* Initialize a new range beginning at this block. */
+			head_range->blocknum = blocknum;
+			head_range->need_complete = need_complete;
+			head_range->advice_issued = false;
+		}
+		else
+		{
+			/* We can extend an existing range by one block. */
+			Assert(head_range->blocknum + head_range->nblocks == blocknum);
+			Assert(head_range->need_complete);
+		}
+
+		head_range->per_buffer_data_index[head_range->nblocks] = pgsr->per_buffer_data_next++;
+		head_range->buffers[head_range->nblocks] = buffer;
+		head_range->nblocks++;
+
+		if (pgsr->per_buffer_data_next == pgsr->max_pinned_buffers)
+			pgsr->per_buffer_data_next = 0;
+
+	} while (pgsr->pinned_buffers < pgsr->max_pinned_buffers &&
+			 pgsr->ios_in_progress < pgsr->max_ios &&
+			 pgsr->pinned_buffers < pgsr->ramp_up_pin_limit);
+
+	/* If we've hit the ramp-up limit, insert a stall. */
+	if (pgsr->pinned_buffers >= pgsr->ramp_up_pin_limit)
+	{
+		/* Can't get here if an earlier stall hasn't finished. */
+		Assert(pgsr->ramp_up_pin_stall == 0);
+		/* Don't do any more prefetching until these buffers are consumed. */
+		pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+		/* Double it.  It will soon be out of the way. */
+		pgsr->ramp_up_pin_limit *= 2;
+	}
+
+	if (pgsr->ranges[pgsr->head].nblocks > 0)
+		pg_streaming_read_new_range(pgsr);
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_complete)
+		{
+			CompleteReadBuffers(pgsr->bmr,
+								tail_range->buffers,
+								pgsr->forknum,
+								tail_range->blocknum,
+								tail_range->nblocks,
+								false,
+								pgsr->strategy);
+			tail_range->need_complete = false;
+
+			/*
+			 * We don't really know if the kernel generated an physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished after a read call returns.
+			 */
+			if (tail_range->advice_issued)
+				pgsr->ios_in_progress--;
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (pgsr->ramp_up_pin_stall > 0)
+				pgsr->ramp_up_pin_stall--;
+
+			if (per_buffer_data)
+				*per_buffer_data = (char *) pgsr->per_buffer_data +
+					tail_range->per_buffer_data_index[buffer_index] *
+					pgsr->per_buffer_data_size;
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 07575ef3129..ef5367c6cc7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -472,7 +472,7 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
@@ -795,15 +795,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
+	buf = ReadBuffer_common(BMR_REL(reln),
 							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+
 	return buf;
 }
 
@@ -827,8 +821,9 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -998,7 +993,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 		bool		hit;
 
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
+		buffer = ReadBuffer_common(bmr,
 								   fork, extend_to - 1, mode, strategy,
 								   &hit);
 	}
@@ -1012,18 +1007,11 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
 				  BufferAccessStrategy strategy, bool *hit)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	Buffer		buffer;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1042,177 +1030,333 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		*hit = false;
+
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
-	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
-	}
-	else
-	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
-	}
+	buffer = PrepareReadBuffer(bmr,
+							   forkNum,
+							   blockNum,
+							   strategy,
+							   hit);
 
 	/* At this point we do NOT hold any locks. */
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+	{
+		/* if we just want zeroes and a lock, we're done */
+		ZeroBuffer(buffer, mode);
+	}
+	else if (!*hit)
+	{
+		/* we might need to perform I/O */
+		CompleteReadBuffers(bmr,
+							&buffer,
+							forkNum,
+							blockNum,
+							1,
+							mode == RBM_ZERO_ON_ERROR,
+							strategy);
+	}
+
+	return buffer;
+}
+
+/*
+ * Prepare to read a block.  The buffer is pinned.  If this is a 'hit', then
+ * the returned buffer can be used immediately.  Otherwise, a physical read
+ * should be completed with CompleteReadBuffers(), or the buffer should be
+ * zeroed with ZeroBuffer().  PrepareReadBuffer() followed by
+ * CompleteReadBuffers() or ZeroBuffer() is equivalent to ReadBuffer(), but
+ * the caller has the opportunity to combine reads of multiple neighboring
+ * blocks into one CompleteReadBuffers() call.
+ *
+ * *foundPtr is set to true for a hit, and false for a miss.
+ */
+Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
+
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * CompleteReadBuffers() (so, not for hits, and not for buffers that
+		 * are zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
-
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
-
-		return BufferDescriptorGetBuffer(bufHdr);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
-
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
-	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
-
-		smgrread(smgr, forkNum, blockNum, bufBlock);
-
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
-
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
-		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
-
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
-
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
-
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
-
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+static inline bool
+CompleteReadBuffersCanStartIO(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true);
+}
+
+/*
+ * Complete a set reads prepared with PrepareReadBuffers().  The buffers must
+ * cover a cluster of neighboring block numbers.
+ *
+ * Typically this performs one physical vector read covering the block range,
+ * but if some of the buffers have already been read in the meantime by any
+ * backend, zero or multiple reads may be performed.
+ */
+void
+CompleteReadBuffers(BufferManagerRelation bmr,
+					Buffer *buffers,
+					ForkNumber forknum,
+					BlockNumber blocknum,
+					int nblocks,
+					bool zero_on_error,
+					BufferAccessStrategy strategy)
+{
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+#ifdef USE_ASSERT_CHECKING
+
+		/*
+		 * We could get all the information from buffer headers, but it can be
+		 * expensive to access buffer header cache lines so we make the caller
+		 * provide all the information we need, and assert that it is
+		 * consistent.
+		 */
+		{
+			RelFileLocator xlocator;
+			ForkNumber	xforknum;
+			BlockNumber xblocknum;
+
+			BufferGetTag(buffers[i], &xlocator, &xforknum, &xblocknum);
+			Assert(RelFileLocatorEquals(bmr.smgr->smgr_rlocator.locator, xlocator));
+			Assert(xforknum == forknum);
+			Assert(xblocknum == blocknum + i);
+		}
+#endif
+
+		/* Skip this block if someone else has already completed it. */
+		if (!CompleteReadBuffersCanStartIO(buffers[i]))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?
+		 */
+		while ((i + 1) < nblocks &&
+			   CompleteReadBuffersCanStartIO(buffers[i + 1]))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if (zero_on_error || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
+}
+
 /*
  * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
  *		buffer.  If no buffer exists already, selects a replacement
@@ -1224,11 +1368,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.  A read should be
+ * performed with CompleteReadBuffers().
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1287,19 +1428,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1364,19 +1496,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called PrepareReadBuffer() but not yet CompleteReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1408,15 +1531,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1770,7 +1887,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2373,7 +2490,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if PrepareReadBuffer() was called and
+		 * CompleteReadBuffers() hasn't been called yet.  We'll check by
+		 * loading the flags without locking.  This is racy, but it's OK to
+		 * return false spuriously: when CompleteReadBuffers() calls
+		 * StartBufferIO(), it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2382,7 +2504,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -4833,6 +4955,46 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or
+ * RBM_ZERO_AND_CLEANUP_LOCK would.  The buffer must be already pinned.  It
+ * does not have to be valid, but it is valid and locked on return.
+ */
+void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * Verify that this backend is pinning the buffer exactly once.
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1f02fed250e..6956d4e5b49 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d808aad8b05..35657008761 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2094,6 +2094,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
-- 
2.43.0

v1-0002-Use-streaming-read-API-in-ANALYZE.patchtext/x-diff; charset=US-ASCII; name=v1-0002-Use-streaming-read-API-in-ANALYZE.patchDownload

From ead994a29826be6033c3acc8fca90a6f6cc58a30 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 19 Feb 2024 14:30:47 +0300
Subject: [PATCH v1 2/2] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/tableam.h             | 16 ++--
 src/backend/access/heap/heapam_handler.c | 11 +--
 src/backend/commands/analyze.c           | 97 ++++++++----------------
 3 files changed, 45 insertions(+), 79 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..7e6e99ba71d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "executor/tuptable.h"
+#include "storage/streaming_read.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -648,9 +649,9 @@ typedef struct TableAmRoutine
 									BufferAccessStrategy bstrategy);
 
 	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
+	 * Prepare to analyze next block of `scan`. Next block is decided by
+	 * callback function of `pgsr`. The scan has been started with
+	 * table_beginscan_analyze(). See also table_scan_analyze_next_block().
 	 *
 	 * The callback may acquire resources like locks that are held until
 	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
@@ -665,8 +666,7 @@ typedef struct TableAmRoutine
 	 * isn't one yet.
 	 */
 	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
+											PgStreamingRead *pgsr);
 
 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -1714,11 +1714,9 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
+table_scan_analyze_next_block(TableScanDesc scan, PgStreamingRead *pgsr)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
+	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, pgsr);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..7ffaf8ac402 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -993,10 +993,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 static bool
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, PgStreamingRead *pgsr)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
+	BlockNumber *current_block;
 
 	/*
 	 * We must maintain a pin on the target page's buffer to ensure that
@@ -1007,10 +1007,11 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
+	hscan->rs_cbuf = pg_streaming_read_buffer_get_next(pgsr, (void **) &current_block);
+	hscan->rs_cblock = *current_block;
 	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
 	/* in heap all blocks can contain tuples, so always return true */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index a03495d6c95..7576afcf655 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1112,6 +1112,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+pg_block_sampling_streaming_read_next(PgStreamingRead *pgsr,
+									  void *pgsr_private,
+									  void *per_io_data)
+{
+	BlockSamplerData *bs = pgsr_private;
+	BlockNumber *current_block = per_io_data;
+
+	if (BlockSampler_HasMore(bs))
+		*current_block = BlockSampler_Next(bs);
+	else
+		*current_block = InvalidBlockNumber;
+
+	return *current_block;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the table
  *
@@ -1164,10 +1184,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	PgStreamingRead *pgsr = NULL;
 
 	Assert(targrows > 0);
 
@@ -1180,13 +1197,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1197,68 +1207,23 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+										  &bs,
+										  sizeof(BlockSamplerData),
+										  vac_strategy,
+										  BMR_REL(scan->rs_rd),
+										  MAIN_FORKNUM,
+										  pg_block_sampling_streaming_read_next);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (nblocks)
 	{
 		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
 
 		vacuum_delay_point();
 
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+		block_accepted = table_scan_analyze_next_block(scan, pgsr);
 
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
 		if (!block_accepted)
 			continue;
 
@@ -1309,7 +1274,9 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
+		nblocks--;
 	}
+	pg_streaming_read_free(pgsr);
 
 	ExecDropSingleTupleTableSlot(slot);
 	table_endscan(scan);
-- 
2.43.0

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#1)

2 attachment(s)

Re: Use streaming read API in ANALYZE

Hi,

On Mon, 19 Feb 2024 at 18:13, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

I worked on using the currently proposed streaming read API [1] in ANALYZE. The patch is attached. 0001 is the not yet merged streaming read API code changes that can be applied to the master, 0002 is the actual code.

The blocks to analyze are obtained by using the streaming read API now.

- Since streaming read API is already doing prefetch, I removed the #ifdef USE_PREFETCH code from acquire_sample_rows().

- Changed 'while (BlockSampler_HasMore(&bs))' to 'while (nblocks)' because the prefetch mechanism in the streaming read API will advance 'bs' before returning buffers.

- Removed BlockNumber and BufferAccessStrategy from the declaration of scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them.

I counted syscalls of analyzing ~5GB table. It can be seen that the patched version did ~1300 less read calls.

Patched:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
39.67 0.012128 0 29809 pwrite64
36.96 0.011299 0 28594 pread64
23.24 0.007104 0 27611 fadvise64

Master (21a71648d3):

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
38.94 0.016457 0 29816 pwrite64
36.79 0.015549 0 29850 pread64
23.91 0.010106 0 29848 fadvise64

Any kind of feedback would be appreciated.

[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

The new version of the streaming read API [1]/messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com is posted. I updated the
streaming read API changes patch (0001), using the streaming read API
in ANALYZE patch (0002) remains the same. This should make it easier
to review as it can be applied on top of master

[1]: /messages/by-id/CA+hUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ@mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v2-0001-Streaming-read-API-changes-that-are-not-committed.patchtext/x-diff; charset=US-ASCII; name=v2-0001-Streaming-read-API-changes-that-are-not-committed.patchDownload

From 21d9043501284c6bae996522ff2f3ac693f81986 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v2 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h             |  45 ++
 src/include/storage/streaming_read.h     |  52 ++
 src/backend/storage/Makefile             |   2 +-
 src/backend/storage/aio/Makefile         |  14 +
 src/backend/storage/aio/meson.build      |   5 +
 src/backend/storage/aio/streaming_read.c | 612 ++++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c      | 687 +++++++++++++++--------
 src/backend/storage/buffer/localbuf.c    |  14 +-
 src/backend/storage/meson.build          |   1 +
 src/tools/pgindent/typedefs.list         |   3 +
 10 files changed, 1202 insertions(+), 233 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +292,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..c4d3892bb26
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,52 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define PGSR_FLAG_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define PGSR_FLAG_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define PGSR_FLAG_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define PGSR_FLAG_FULL 0x04
+
+struct PgStreamingRead;
+typedef struct PgStreamingRead PgStreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*PgStreamingReadBufferCB) (PgStreamingRead *pgsr,
+												void *pgsr_private,
+												void *per_buffer_private);
+
+extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
+													   void *pgsr_private,
+													   size_t per_buffer_private_size,
+													   BufferAccessStrategy strategy,
+													   BufferManagerRelation bmr,
+													   ForkNumber forknum,
+													   PgStreamingReadBufferCB next_block_cb);
+
+extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
+extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
+extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+
+#endif
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..71f2c4a70b6
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,612 @@
+#include "postgres.h"
+
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+
+/*
+ * Element type for PgStreamingRead's circular array of block ranges.
+ */
+typedef struct PgStreamingReadRange
+{
+	bool		need_wait;
+	bool		advice_issued;
+	BlockNumber blocknum;
+	int			nblocks;
+	int			per_buffer_data_index;
+	Buffer		buffers[MAX_BUFFERS_PER_TRANSFER];
+	ReadBuffersOperation operation;
+} PgStreamingReadRange;
+
+/*
+ * Streaming read object.
+ */
+struct PgStreamingRead
+{
+	int			max_ios;
+	int			ios_in_progress;
+	int			max_pinned_buffers;
+	int			pinned_buffers;
+	int			pinned_buffers_trigger;
+	int			next_tail_buffer;
+	int			ramp_up_pin_limit;
+	int			ramp_up_pin_stall;
+	bool		finished;
+	bool		advice_enabled;
+	void	   *pgsr_private;
+	PgStreamingReadBufferCB callback;
+
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* Space for optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Circular buffer of ranges. */
+	int			size;
+	int			head;
+	int			tail;
+	PgStreamingReadRange ranges[FLEXIBLE_ARRAY_MEMBER];
+};
+
+static PgStreamingRead *
+pg_streaming_read_buffer_alloc_internal(int flags,
+										void *pgsr_private,
+										size_t per_buffer_data_size,
+										BufferAccessStrategy strategy)
+{
+	PgStreamingRead *pgsr;
+	int			size;
+	int			max_ios;
+	uint32		max_pinned_buffers;
+
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	if (flags & PGSR_FLAG_MAINTENANCE)
+		max_ios = maintenance_io_concurrency;
+	else
+		max_ios = effective_io_concurrency;
+
+	/*
+	 * The desired level of I/O concurrency controls how far ahead we are
+	 * willing to look ahead.  We also clamp it to at least
+	 * MAX_BUFFER_PER_TRANFER so that we can have a chance to build up a full
+	 * sized read, even when max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, MAX_BUFFERS_PER_TRANSFER);
+
+	/*
+	 * The *_io_concurrency GUCs might be set to 0, but we want to allow at
+	 * least one, to keep our gating logic simple.
+	 */
+	max_ios = Max(max_ios, 1);
+
+	/*
+	 * Don't allow this backend to pin too many buffers.  For now we'll apply
+	 * the limit for the shared buffer pool and the local buffer pool, without
+	 * worrying which it is.
+	 */
+	LimitAdditionalPins(&max_pinned_buffers);
+	LimitAdditionalLocalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * pgsr->ranges is a circular buffer.  When it is empty, head == tail.
+	 * When it is full, there is an empty element between head and tail.  Head
+	 * can also be empty (nblocks == 0), therefore we need two extra elements
+	 * for non-occupied ranges, on top of max_pinned_buffers to allow for the
+	 * maxmimum possible number of occupied ranges of the smallest possible
+	 * size of one.
+	 */
+	size = max_pinned_buffers + 2;
+
+	pgsr = (PgStreamingRead *)
+		palloc0(offsetof(PgStreamingRead, ranges) +
+				sizeof(pgsr->ranges[0]) * size);
+
+	pgsr->max_ios = max_ios;
+	pgsr->per_buffer_data_size = per_buffer_data_size;
+	pgsr->max_pinned_buffers = max_pinned_buffers;
+	pgsr->pgsr_private = pgsr_private;
+	pgsr->strategy = strategy;
+	pgsr->size = size;
+
+	pgsr->unget_blocknum = InvalidBlockNumber;
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  As long as direct I/O isn't
+	 * enabled, and the caller hasn't promised sequential access, we can use
+	 * it.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & PGSR_FLAG_SEQUENTIAL) == 0)
+		pgsr->advice_enabled = true;
+#endif
+
+	/*
+	 * We start off building small ranges, but double that quickly, for the
+	 * benefit of users that don't know how far ahead they'll read.  This can
+	 * be disabled by users that already know they'll read all the way.
+	 */
+	if (flags & PGSR_FLAG_FULL)
+		pgsr->ramp_up_pin_limit = INT_MAX;
+	else
+		pgsr->ramp_up_pin_limit = 1;
+
+	/*
+	 * We want to avoid creating ranges that are smaller than they could be
+	 * just because we hit max_pinned_buffers.  We only look ahead when the
+	 * number of pinned buffers falls below this trigger number, or put
+	 * another way, we stop looking ahead when we wouldn't be able to build a
+	 * "full sized" range.
+	 */
+	pgsr->pinned_buffers_trigger =
+		Max(1, (int) max_pinned_buffers - MAX_BUFFERS_PER_TRANSFER);
+
+	/* Space for the callback to store extra data along with each block. */
+	if (per_buffer_data_size)
+		pgsr->per_buffer_data = palloc(per_buffer_data_size * max_pinned_buffers);
+
+	return pgsr;
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.
+ */
+PgStreamingRead *
+pg_streaming_read_buffer_alloc(int flags,
+							   void *pgsr_private,
+							   size_t per_buffer_data_size,
+							   BufferAccessStrategy strategy,
+							   BufferManagerRelation bmr,
+							   ForkNumber forknum,
+							   PgStreamingReadBufferCB next_block_cb)
+{
+	PgStreamingRead *result;
+
+	result = pg_streaming_read_buffer_alloc_internal(flags,
+													 pgsr_private,
+													 per_buffer_data_size,
+													 strategy);
+	result->callback = next_block_cb;
+	result->bmr = bmr;
+	result->forknum = forknum;
+
+	return result;
+}
+
+/*
+ * Find the per-buffer data index for the Nth block of a range.
+ */
+static int
+get_per_buffer_data_index(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	int			result;
+
+	/*
+	 * Find slot in the circular buffer of per-buffer data, without using the
+	 * expensive % operator.
+	 */
+	result = range->per_buffer_data_index + n;
+	if (result >= pgsr->max_pinned_buffers)
+		result -= pgsr->max_pinned_buffers;
+	Assert(result == (range->per_buffer_data_index + n) % pgsr->max_pinned_buffers);
+
+	return result;
+}
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data_by_index(PgStreamingRead *pgsr, int per_buffer_data_index)
+{
+	return (char *) pgsr->per_buffer_data +
+		pgsr->per_buffer_data_size * per_buffer_data_index;
+}
+
+/*
+ * Return a pointer to the per-buffer data for the Nth block of a range.
+ */
+static void *
+get_per_buffer_data(PgStreamingRead *pgsr, PgStreamingReadRange *range, int n)
+{
+	return get_per_buffer_data_by_index(pgsr,
+										get_per_buffer_data_index(pgsr,
+																  range,
+																  n));
+}
+
+/*
+ * Start reading the head range, and create a new head range.  The new head
+ * range is returned.  It may not be empty, if StartReadBuffers() couldn't
+ * start the entire range; in that case the returned range contains the
+ * remaining portion of the range.
+ */
+static PgStreamingReadRange *
+pg_streaming_read_start_head_range(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *head_range;
+	PgStreamingReadRange *new_head_range;
+	int			nblocks_pinned;
+	int			flags;
+
+	/* Caller should make sure we never exceed max_ios. */
+	Assert(pgsr->ios_in_progress < pgsr->max_ios);
+
+	/* Should only call if the head range has some blocks to read. */
+	head_range = &pgsr->ranges[pgsr->head];
+	Assert(head_range->nblocks > 0);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (pgsr->advice_enabled && head_range->blocknum != pgsr->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+
+	/* Start reading as many blocks as we can from the head range. */
+	nblocks_pinned = head_range->nblocks;
+	head_range->need_wait =
+		StartReadBuffers(pgsr->bmr,
+						 head_range->buffers,
+						 pgsr->forknum,
+						 head_range->blocknum,
+						 &nblocks_pinned,
+						 pgsr->strategy,
+						 flags,
+						 &head_range->operation);
+
+	/* Did that start an I/O? */
+	if (head_range->need_wait && (flags & READ_BUFFERS_ISSUE_ADVICE))
+	{
+		head_range->advice_issued = true;
+		pgsr->ios_in_progress++;
+		Assert(pgsr->ios_in_progress <= pgsr->max_ios);
+	}
+
+	/*
+	 * StartReadBuffers() might have pinned fewer blocks than we asked it to,
+	 * but always at least one.
+	 */
+	Assert(nblocks_pinned <= head_range->nblocks);
+	Assert(nblocks_pinned >= 1);
+	pgsr->pinned_buffers += nblocks_pinned;
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time.
+	 */
+	pgsr->seq_blocknum = head_range->blocknum + nblocks_pinned;
+
+	/*
+	 * Create a new head range.  There must be space, because we have enough
+	 * elements for every range to hold just one block, up to the pin limit.
+	 */
+	Assert(pgsr->size > pgsr->max_pinned_buffers);
+	Assert((pgsr->head + 1) % pgsr->size != pgsr->tail);
+	if (++pgsr->head == pgsr->size)
+		pgsr->head = 0;
+	new_head_range = &pgsr->ranges[pgsr->head];
+	new_head_range->nblocks = 0;
+	new_head_range->advice_issued = false;
+
+	/*
+	 * If we didn't manage to start the whole read above, we split the range,
+	 * moving the remainder into the new head range.
+	 */
+	if (nblocks_pinned < head_range->nblocks)
+	{
+		int			nblocks_remaining = head_range->nblocks - nblocks_pinned;
+
+		head_range->nblocks = nblocks_pinned;
+
+		new_head_range->blocknum = head_range->blocknum + nblocks_pinned;
+		new_head_range->nblocks = nblocks_remaining;
+	}
+
+	/* The new range has per-buffer data starting after the previous range. */
+	new_head_range->per_buffer_data_index =
+		get_per_buffer_data_index(pgsr, head_range, nblocks_pinned);
+
+	return new_head_range;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow pg_streaming_unget_block() to work.
+ */
+static BlockNumber
+pg_streaming_get_block(PgStreamingRead *pgsr, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(pgsr->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = pgsr->unget_blocknum;
+		pgsr->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * pg_streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == pgsr->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = pgsr->callback(pgsr, pgsr->pgsr_private, per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by pg_streaming_get_block().
+ */
+static void
+pg_streaming_unget_block(PgStreamingRead *pgsr, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(pgsr->unget_blocknum == InvalidBlockNumber);
+	pgsr->unget_blocknum = blocknum;
+	pgsr->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+pg_streaming_read_look_ahead(PgStreamingRead *pgsr)
+{
+	PgStreamingReadRange *range;
+
+	/*
+	 * If we're still ramping up, we may have to stall to wait for buffers to
+	 * be consumed first before we do any more prefetching.
+	 */
+	if (pgsr->ramp_up_pin_stall > 0)
+	{
+		Assert(pgsr->pinned_buffers > 0);
+		return;
+	}
+
+	/*
+	 * If we're finished or can't start more I/O, then don't look ahead.
+	 */
+	if (pgsr->finished || pgsr->ios_in_progress == pgsr->max_ios)
+		return;
+
+	/*
+	 * We'll also wait until the number of pinned buffers falls below our
+	 * trigger level, so that we have the chance to create a full range.
+	 */
+	if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+		return;
+
+	do
+	{
+		BlockNumber blocknum;
+		void	   *per_buffer_data;
+
+		/* Do we have a full-sized range? */
+		range = &pgsr->ranges[pgsr->head];
+		if (range->nblocks == lengthof(range->buffers))
+		{
+			/* Start as much of it as we can. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/* If we're now at the I/O limit, stop here. */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+				return;
+
+			/*
+			 * If we couldn't form a full range, then stop here to avoid
+			 * creating small I/O.
+			 */
+			if (pgsr->pinned_buffers >= pgsr->pinned_buffers_trigger)
+				return;
+
+			/*
+			 * That might have only been partially started, but always
+			 * processes at least one so that'll do for now.
+			 */
+			Assert(range->nblocks < lengthof(range->buffers));
+		}
+
+		/* Find per-buffer data slot for the next block. */
+		per_buffer_data = get_per_buffer_data(pgsr, range, range->nblocks);
+
+		/* Find out which block the callback wants to read next. */
+		blocknum = pg_streaming_get_block(pgsr, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			pgsr->finished = true;
+			break;
+		}
+
+		/*
+		 * Is there a head range that we cannot extend, because the requested
+		 * block is not consecutive?
+		 */
+		if (range->nblocks > 0 &&
+			range->blocknum + range->nblocks != blocknum)
+		{
+			/* Yes.  Start it, so we can begin building a new one. */
+			range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * It's possible that it was only partially started, and we have a
+			 * new range with the remainder.  Keep starting I/Os until we get
+			 * it all out of the way, or we hit the I/O limit.
+			 */
+			while (range->nblocks > 0 && pgsr->ios_in_progress < pgsr->max_ios)
+				range = pg_streaming_read_start_head_range(pgsr);
+
+			/*
+			 * We have to 'unget' the block returned by the callback if we
+			 * don't have enough I/O capacity left to start something.
+			 */
+			if (pgsr->ios_in_progress == pgsr->max_ios)
+			{
+				pg_streaming_unget_block(pgsr, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* If we have a new, empty range, initialize the start block. */
+		if (range->nblocks == 0)
+		{
+			range->blocknum = blocknum;
+		}
+
+		/* This block extends the range by one. */
+		Assert(range->blocknum + range->nblocks == blocknum);
+		range->nblocks++;
+
+	} while (pgsr->pinned_buffers + range->nblocks < pgsr->max_pinned_buffers &&
+			 pgsr->pinned_buffers + range->nblocks < pgsr->ramp_up_pin_limit);
+
+	/* If we've hit the ramp-up limit, insert a stall. */
+	if (pgsr->pinned_buffers + range->nblocks >= pgsr->ramp_up_pin_limit)
+	{
+		/* Can't get here if an earlier stall hasn't finished. */
+		Assert(pgsr->ramp_up_pin_stall == 0);
+		/* Don't do any more prefetching until these buffers are consumed. */
+		pgsr->ramp_up_pin_stall = pgsr->ramp_up_pin_limit;
+		/* Double it.  It will soon be out of the way. */
+		pgsr->ramp_up_pin_limit *= 2;
+	}
+
+	/* Start as much as we can. */
+	while (range->nblocks > 0)
+	{
+		range = pg_streaming_read_start_head_range(pgsr);
+		if (pgsr->ios_in_progress == pgsr->max_ios)
+			break;
+	}
+}
+
+Buffer
+pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_data)
+{
+	pg_streaming_read_look_ahead(pgsr);
+
+	/* See if we have one buffer to return. */
+	while (pgsr->tail != pgsr->head)
+	{
+		PgStreamingReadRange *tail_range;
+
+		tail_range = &pgsr->ranges[pgsr->tail];
+
+		/*
+		 * Do we need to perform an I/O before returning the buffers from this
+		 * range?
+		 */
+		if (tail_range->need_wait)
+		{
+			WaitReadBuffers(&tail_range->operation);
+			tail_range->need_wait = false;
+
+			/*
+			 * We don't really know if the kernel generated a physical I/O
+			 * when we issued advice, let alone when it finished, but it has
+			 * certainly finished now because we've performed the read.
+			 */
+			if (tail_range->advice_issued)
+			{
+				Assert(pgsr->ios_in_progress > 0);
+				pgsr->ios_in_progress--;
+			}
+		}
+
+		/* Are there more buffers available in this range? */
+		if (pgsr->next_tail_buffer < tail_range->nblocks)
+		{
+			int			buffer_index;
+			Buffer		buffer;
+
+			buffer_index = pgsr->next_tail_buffer++;
+			buffer = tail_range->buffers[buffer_index];
+
+			Assert(BufferIsValid(buffer));
+
+			/* We are giving away ownership of this pinned buffer. */
+			Assert(pgsr->pinned_buffers > 0);
+			pgsr->pinned_buffers--;
+
+			if (pgsr->ramp_up_pin_stall > 0)
+				pgsr->ramp_up_pin_stall--;
+
+			if (per_buffer_data)
+				*per_buffer_data = get_per_buffer_data(pgsr, tail_range, buffer_index);
+
+			return buffer;
+		}
+
+		/* Advance tail to next range, if there is one. */
+		if (++pgsr->tail == pgsr->size)
+			pgsr->tail = 0;
+		pgsr->next_tail_buffer = 0;
+
+		/*
+		 * If tail crashed into head, and head is not empty, then it is time
+		 * to start that range.
+		 */
+		if (pgsr->tail == pgsr->head &&
+			pgsr->ranges[pgsr->head].nblocks > 0)
+			pg_streaming_read_start_head_range(pgsr);
+	}
+
+	Assert(pgsr->pinned_buffers == 0);
+
+	return InvalidBuffer;
+}
+
+void
+pg_streaming_read_free(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	pgsr->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(pgsr->pinned_buffers == 0);
+	Assert(pgsr->ios_in_progress == 0);
+
+	/* Release memory. */
+	if (pgsr->per_buffer_data)
+		pfree(pgsr->per_buffer_data);
+
+	pfree(pgsr);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bdf89bbc4dc..3b1b0ad99df 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -472,10 +477,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -501,7 +505,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -782,7 +786,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -795,15 +798,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -823,13 +820,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, InvalidBackendId);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -995,35 +991,68 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
+/*
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
+ */
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
  * *hit is set to true if the request was satisfied from shared buffer cache.
  */
 static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
 				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+				  BufferAccessStrategy strategy)
 {
-	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
-
-	*hit = false;
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1042,181 +1071,404 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+static Buffer
+PrepareReadBuffer(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
-
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
-
-		return BufferDescriptorGetBuffer(bufHdr);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
-
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
-	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
-
-		smgrread(smgr, forkNum, blockNum, bufBlock);
-
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
-
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
-		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
-
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
-
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
-
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
-
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	operation->bmr = bmr;
+	operation->forknum = forkNum;
+	operation->blocknum = blockNum;
+	operation->buffers = buffers;
+	operation->nblocks = actual_nblocks;
+	operation->strategy = strategy;
+	operation->flags = flags;
+
+	operation->io_buffers_len = 0;
+
+	for (int i = 0; i < actual_nblocks; ++i)
+	{
+		bool		found;
+
+		buffers[i] = PrepareReadBuffer(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
+
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			operation->io_buffers_len++;
+		}
+	}
+
+	if (operation->io_buffers_len > 0)
+	{
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PrepareReadBuffers() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
+		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
+	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
+
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFERS_PER_TRANSFER];
+		void	   *io_pages[MAX_BUFFERS_PER_TRANSFER];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PrepareReadBuffer().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
+}
+
+/*
+ * BufferAlloc -- subroutine for StartReadBuffers.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1224,11 +1476,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1287,19 +1535,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1364,19 +1603,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1408,15 +1638,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1770,7 +1994,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2035,7 +2259,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2058,7 +2282,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2373,7 +2597,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2382,7 +2611,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3450,7 +3679,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5185,9 +5414,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5200,6 +5435,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 1f02fed250e..6956d4e5b49 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -288,7 +287,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fc8b15d0cf2..cfb58cf4836 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2097,6 +2097,8 @@ PgStat_TableCounts
 PgStat_TableStatus
 PgStat_TableXactStatus
 PgStat_WalStats
+PgStreamingRead
+PgStreamingReadRange
 PgXmlErrorContext
 PgXmlStrictness
 Pg_finfo_record
@@ -2267,6 +2269,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
-- 
2.43.0

v2-0002-Use-streaming-read-API-in-ANALYZE.patchtext/x-diff; charset=US-ASCII; name=v2-0002-Use-streaming-read-API-in-ANALYZE.patchDownload

From 6bca02c986e29e86bcda21675f285a99149b6f74 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 19 Feb 2024 14:30:47 +0300
Subject: [PATCH v2 2/2] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/tableam.h             | 16 ++--
 src/backend/access/heap/heapam_handler.c | 11 +--
 src/backend/commands/analyze.c           | 97 ++++++++----------------
 3 files changed, 45 insertions(+), 79 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5f8474871d2..7e6e99ba71d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "executor/tuptable.h"
+#include "storage/streaming_read.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -648,9 +649,9 @@ typedef struct TableAmRoutine
 									BufferAccessStrategy bstrategy);
 
 	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
+	 * Prepare to analyze next block of `scan`. Next block is decided by
+	 * callback function of `pgsr`. The scan has been started with
+	 * table_beginscan_analyze(). See also table_scan_analyze_next_block().
 	 *
 	 * The callback may acquire resources like locks that are held until
 	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
@@ -665,8 +666,7 @@ typedef struct TableAmRoutine
 	 * isn't one yet.
 	 */
 	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
+											PgStreamingRead *pgsr);
 
 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -1714,11 +1714,9 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
+table_scan_analyze_next_block(TableScanDesc scan, PgStreamingRead *pgsr)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
+	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, pgsr);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 680a50bf8b1..7ffaf8ac402 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -993,10 +993,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 static bool
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, PgStreamingRead *pgsr)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
+	BlockNumber *current_block;
 
 	/*
 	 * We must maintain a pin on the target page's buffer to ensure that
@@ -1007,10 +1007,11 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
+	hscan->rs_cbuf = pg_streaming_read_buffer_get_next(pgsr, (void **) &current_block);
+	hscan->rs_cblock = *current_block;
 	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
 	/* in heap all blocks can contain tuples, so always return true */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index a03495d6c95..7576afcf655 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1112,6 +1112,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+pg_block_sampling_streaming_read_next(PgStreamingRead *pgsr,
+									  void *pgsr_private,
+									  void *per_io_data)
+{
+	BlockSamplerData *bs = pgsr_private;
+	BlockNumber *current_block = per_io_data;
+
+	if (BlockSampler_HasMore(bs))
+		*current_block = BlockSampler_Next(bs);
+	else
+		*current_block = InvalidBlockNumber;
+
+	return *current_block;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the table
  *
@@ -1164,10 +1184,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	PgStreamingRead *pgsr = NULL;
 
 	Assert(targrows > 0);
 
@@ -1180,13 +1197,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1197,68 +1207,23 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	pgsr = pg_streaming_read_buffer_alloc(PGSR_FLAG_DEFAULT,
+										  &bs,
+										  sizeof(BlockSamplerData),
+										  vac_strategy,
+										  BMR_REL(scan->rs_rd),
+										  MAIN_FORKNUM,
+										  pg_block_sampling_streaming_read_next);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (nblocks)
 	{
 		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
 
 		vacuum_delay_point();
 
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+		block_accepted = table_scan_analyze_next_block(scan, pgsr);
 
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
 		if (!block_accepted)
 			continue;
 
@@ -1309,7 +1274,9 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
+		nblocks--;
 	}
+	pg_streaming_read_free(pgsr);
 
 	ExecDropSingleTupleTableSlot(slot);
 	table_endscan(scan);
-- 
2.43.0

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#2)

2 attachment(s)

Re: Use streaming read API in ANALYZE

Hi,

On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

The new version of the streaming read API [1] is posted. I updated the
streaming read API changes patch (0001), using the streaming read API
in ANALYZE patch (0002) remains the same. This should make it easier
to review as it can be applied on top of master

The new version of the streaming read API is posted [1]/messages/by-id/CA+hUKGL-ONQnnnp-SONCFfLJzqcpAheuzZ+-yTrD9WBM-GmAcg@mail.gmail.com. I rebased the
patch on top of master and v9 of the streaming read API.

There is a minimal change in the 'using the streaming read API in ANALYZE
patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
to copy exactly the same behavior as before. Also, some benchmarking
results:

I created a 22 GB table and set the size of shared buffers to 30GB, the
rest is default.

╔═══════════════════════════╦═════════════════════╦════════════╗
║ ║ Avg Timings in ms ║ ║
╠═══════════════════════════╬══════════╦══════════╬════════════╣
║ ║ master ║ patched ║ percentage ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ Both OS cache and ║ ║ ║ ║
║ shared buffers are clear ║ 513.9247 ║ 463.1019 ║ %9.9 ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ OS cache is loaded but ║ ║ ║ ║
║ shared buffers are clear ║ 423.1097 ║ 354.3277 ║ %16.3 ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ Shared buffers are loaded ║ ║ ║ ║
║ ║ 89.2846 ║ 84.6952 ║ %5.1 ║
╚═══════════════════════════╩══════════╩══════════╩════════════╝

Any kind of feedback would be appreciated.

[1]: /messages/by-id/CA+hUKGL-ONQnnnp-SONCFfLJzqcpAheuzZ+-yTrD9WBM-GmAcg@mail.gmail.com
/messages/by-id/CA+hUKGL-ONQnnnp-SONCFfLJzqcpAheuzZ+-yTrD9WBM-GmAcg@mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v3-0001-Streaming-read-API-changes-that-are-not-committed.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Streaming-read-API-changes-that-are-not-committed.patchDownload

From 7278c354d80e4d1f21c6fa0d810a723789f2d722 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v3 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h                  |  45 +-
 src/include/storage/streaming_read.h          |  50 ++
 src/backend/storage/Makefile                  |   2 +-
 src/backend/storage/aio/Makefile              |  14 +
 src/backend/storage/aio/meson.build           |   5 +
 src/backend/storage/aio/streaming_read.c      | 678 +++++++++++++++++
 src/backend/storage/buffer/bufmgr.c           | 706 ++++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/storage/meson.build               |   1 +
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  14 +
 src/tools/pgindent/typedefs.list              |   2 +
 13 files changed, 1309 insertions(+), 237 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..1cc198bde21 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_BUFFER_IO_SIZE PG_IOV_MAX
+#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int buffer_io_size;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int16		nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffers(BufferManagerRelation bmr,
+							 Buffer *buffers,
+							 ForkNumber forknum,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 BufferAccessStrategy strategy,
+							 int flags,
+							 ReadBuffersOperation *operation);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +290,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
new file mode 100644
index 00000000000..7991402631a
--- /dev/null
+++ b/src/include/storage/streaming_read.h
@@ -0,0 +1,50 @@
+#ifndef STREAMING_READ_H
+#define STREAMING_READ_H
+
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define STREAMING_READ_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users.
+ */
+#define STREAMING_READ_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define STREAMING_READ_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define STREAMING_READ_FULL 0x04
+
+struct StreamingRead;
+typedef struct StreamingRead StreamingRead;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*StreamingReadBufferCB) (StreamingRead *stream,
+											  void *callback_private_data,
+											  void *per_buffer_data);
+
+extern StreamingRead *streaming_read_buffer_begin(int flags,
+												  BufferAccessStrategy strategy,
+												  BufferManagerRelation bmr,
+												  ForkNumber forknum,
+												  StreamingReadBufferCB callback,
+												  void *callback_private_data,
+												  size_t per_buffer_data_size);
+extern Buffer streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_private);
+extern void streaming_read_buffer_end(StreamingRead *stream);
+
+#endif
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..bcab44c802f
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	streaming_read.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..39aef2a84a2
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'streaming_read.c',
+)
diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
new file mode 100644
index 00000000000..760a231500a
--- /dev/null
+++ b/src/backend/storage/aio/streaming_read.c
@@ -0,0 +1,678 @@
+/*-------------------------------------------------------------------------
+ *
+ * streaming_read.c
+ *	  Mechanism for buffer access with look-ahead
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to size buffer_io_size, by attempting to merge them
+ * with a pending read.  When that isn't possible, the existing pending read
+ * is sent to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * buffer_io_size (the GUC controlling physical read size), because in this
+ * case only goal is larger read system calls.  Looking further ahead would
+ * pin many buffers and perform speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at buffer_io_size.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers, ready to be returned by streaming_read_buffer_next().
+ * Each buffer also has an optional variable sized object that is passed from
+ * the callback to the consumer of buffers.  A third array records whether
+ * WaitReadBuffers() must be called before returning the buffer, and if so,
+ * points to the relevant ReadBuffersOperation object.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data buf/io       ios
+ *
+ *                          +----+  +-----+  +---+        +--------+
+ *                          |    |  |     |  |   |  +---->| 42..44 |
+ *                          +----+  +-----+  +---+  |     +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  |   |  | +-->| 60..60 |
+ *                          +----+  +-----+  +---+  | |   +--------+
+ *                          | 42 |  |  ?  |  | 0 +--+ |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 43 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 44 |  |  ?  |  |   |    |   |        |
+ *                          +----+  +-----+  +---+    |   +--------+
+ *                          | 60 |  |  ?  |  | 1 +----+
+ *                          +----+  +-----+  +---+
+ *     next_buffer_index -> |    |  |     |  |   |
+ *                          +----+  +-----+  +---+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/storage/aio/streaming_read.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/streaming_read.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+/*
+ * Streaming read object.
+ */
+struct StreamingRead
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		started;
+	bool		finished;
+	bool		advice_enabled;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	StreamingReadBufferCB callback;
+	void	   *callback_private_data;
+
+	/* The relation we will read. */
+	BufferAccessStrategy strategy;
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+
+	/* Sometimes we need to buffer one block for flow control. */
+	BlockNumber unget_blocknum;
+	void	   *unget_per_buffer_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	Buffer	   *buffers;
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+	int16	   *buffer_io_indexes;
+
+	/* Read operations that have been started by not waited for yet. */
+	ReadBuffersOperation *ios;
+	int16		next_io_index;
+
+	/* Head and tail of the circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static void *
+get_per_buffer_data(StreamingRead *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow streaming_unget_block() to work.
+ */
+static BlockNumber
+streaming_read_get_block(StreamingRead *stream, void *per_buffer_data)
+{
+	BlockNumber result;
+
+	if (unlikely(stream->unget_blocknum != InvalidBlockNumber))
+	{
+		/*
+		 * If we had to unget a block, now it is time to return that one
+		 * again.
+		 */
+		result = stream->unget_blocknum;
+		stream->unget_blocknum = InvalidBlockNumber;
+
+		/*
+		 * The same per_buffer_data element must have been used, and still
+		 * contains whatever data the callback wrote into it.  So we just
+		 * sanity-check that we were called with the value that
+		 * streaming_unget_block() pushed back.
+		 */
+		Assert(per_buffer_data == stream->unget_per_buffer_data);
+	}
+	else
+	{
+		/* Use the installed callback directly. */
+		result = stream->callback(stream,
+								  stream->callback_private_data,
+								  per_buffer_data);
+	}
+
+	return result;
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.  This *must* be called with the
+ * last value returned by streaming_get_block().
+ */
+static void
+streaming_read_unget_block(StreamingRead *stream, BlockNumber blocknum, void *per_buffer_data)
+{
+	Assert(stream->unget_blocknum == InvalidBlockNumber);
+	stream->unget_blocknum = blocknum;
+	stream->unget_per_buffer_data = per_buffer_data;
+}
+
+static void
+streaming_read_start_pending_read(StreamingRead *stream)
+{
+	bool		need_wait;
+	int			nblocks;
+	int16		io_index;
+	int16		overflow;
+	int			flags;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= buffer_io_size);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, and this system supports it, this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (stream->advice_enabled &&
+		stream->started &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* Suppress advice on the first call, because it's too late to benefit. */
+	if (!stream->started)
+		stream->started = true;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	nblocks = stream->pending_read_nblocks;
+	need_wait =
+		StartReadBuffers(stream->bmr,
+						 &stream->buffers[stream->next_buffer_index],
+						 stream->forknum,
+						 stream->pending_read_blocknum,
+						 &nblocks,
+						 stream->strategy,
+						 flags,
+						 &stream->ios[stream->next_io_index]);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		io_index = -1;
+
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		io_index = stream->next_io_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+	}
+
+	/* Set up the pointer to the I/O for the head buffer, if there is one. */
+	stream->buffer_io_indexes[stream->next_buffer_index] = io_index;
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at max_pinned_buffers.  Move values that
+	 * overflowed into the extra space.  At the same time, put -1 in the I/O
+	 * slots for the rest of the buffers to indicate no I/O.  They are covered
+	 * by the head buffer's I/O, if there is one.  We avoid a % operator.
+	 */
+	overflow = (stream->next_buffer_index + nblocks) - stream->max_pinned_buffers;
+	if (overflow > 0)
+	{
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->max_pinned_buffers],
+				sizeof(stream->buffers[0]) * overflow);
+		for (int i = 0; i < overflow; ++i)
+			stream->buffer_io_indexes[i] = -1;
+		for (int i = 1; i < nblocks - overflow; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+	else
+	{
+		for (int i = 1; i < nblocks; ++i)
+			stream->buffer_io_indexes[stream->next_buffer_index + i] = -1;
+	}
+
+	/*
+	 * Remember where the next block would be after that, so we can detect
+	 * sequential access next time and suppress advice.
+	 */
+	stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+
+	/* Compute location of start of next read, without using % operator. */
+	stream->next_buffer_index += nblocks;
+	if (stream->next_buffer_index >= stream->max_pinned_buffers)
+		stream->next_buffer_index -= stream->max_pinned_buffers;
+	Assert(stream->next_buffer_index >= 0);
+	Assert(stream->next_buffer_index < stream->max_pinned_buffers);
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+streaming_read_look_ahead(StreamingRead *stream)
+{
+	while (!stream->finished &&
+		   stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == buffer_io_size)
+		{
+			streaming_read_start_pending_read(stream);
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index > stream->max_pinned_buffers)
+			buffer_index -= stream->max_pinned_buffers;
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = streaming_read_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			stream->finished = true;
+			continue;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			streaming_read_start_pending_read(stream);
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				streaming_read_unget_block(stream, blocknum, per_buffer_data);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * Normally we don't start the pending read just because we've hit a
+	 * limit, preferring to give it another chance to grow to a larger size
+	 * once more buffers have been consumed.  However, in cases where that
+	 * can't possibly happen, we might as well start the read immediately.
+	 */
+	if (((stream->pending_read_nblocks > 0 && stream->finished) ||
+		 (stream->pending_read_nblocks == stream->distance)) &&
+		stream->ios_in_progress < stream->max_ios)
+		streaming_read_start_pending_read(stream);
+}
+
+/*
+ * Create a new streaming read object that can be used to perform the
+ * equivalent of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+StreamingRead *
+streaming_read_buffer_begin(int flags,
+							BufferAccessStrategy strategy,
+							BufferManagerRelation bmr,
+							ForkNumber forknum,
+							StreamingReadBufferCB callback,
+							void *callback_private_data,
+							size_t per_buffer_data_size)
+{
+	StreamingRead *stream;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/*
+	 * Make sure our bmr's smgr and persistent are populated.  The caller
+	 * asserts that the storage manager will remain valid.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many assumed I/Os we will allow to run concurrently.  That
+	 * is, advice to the kernel to tell it that we will soon read.  This
+	 * number also affects how far we look ahead for opportunities to start
+	 * more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & STREAMING_READ_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+
+	/*
+	 * Choose a maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least buffer_io_size so
+	 * that we can have a chance to build up a full sized read, even when
+	 * max_ios is zero.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, buffer_io_size);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	stream = (StreamingRead *) palloc0(sizeof(StreamingRead));
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & STREAMING_READ_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->strategy = strategy;
+
+	stream->bmr = bmr;
+	stream->forknum = forknum;
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	stream->unget_blocknum = InvalidBlockNumber;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out doing full-sized
+	 * reads.
+	 */
+	if (flags & STREAMING_READ_FULL)
+		stream->distance = stream->max_pinned_buffers;
+	else
+		stream->distance = 1;
+
+	/*
+	 * Space for the buffers we pin.  Though we never pin more than
+	 * max_pinned_buffers, we want to be able to assume that all the buffers
+	 * for a single read are contiguous (i.e. don't wrap around halfway
+	 * through), so we let the final one run past that position temporarily by
+	 * allocating an extra buffer_io_size - 1 elements.
+	 */
+	stream->buffers = palloc((max_pinned_buffers + buffer_io_size - 1) *
+							 sizeof(stream->buffers[0]));
+
+	/* Space for per-buffer data, if configured. */
+	if (per_buffer_data_size)
+		stream->per_buffer_data =
+			palloc(per_buffer_data_size * (max_pinned_buffers +
+										   buffer_io_size - 1));
+
+	/* Space for the IOs that we might run. */
+	stream->buffer_io_indexes = palloc(max_pinned_buffers * sizeof(stream->buffer_io_indexes[0]));
+	stream->ios = palloc(max_ios * sizeof(ReadBuffersOperation));
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream created with
+ * streaming_read_buffer_begin().  Each call returns successive blocks in the
+ * order specified by the callback.  If per_buffer_data_size was set to a
+ * non-zero size, *per_buffer_data receives a pointer to the extra per-buffer
+ * data that the callback had a chance to populate.  When the stream runs out
+ * of data, InvalidBuffer is returned.  The caller may decide to end the
+ * stream early at any time by calling streaming_read_end().
+ */
+Buffer
+streaming_read_buffer_next(StreamingRead *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		io_index;
+	int16		oldest_buffer_index;
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		if (stream->finished)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but we need a special case to prime the stream when
+		 * we're getting started.
+		 */
+		Assert(!stream->started);
+		streaming_read_look_ahead(stream);
+		if (stream->pinned_buffers == 0)
+			return InvalidBuffer;
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->max_pinned_buffers);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	io_index = stream->buffer_io_indexes[oldest_buffer_index];
+	Assert(io_index >= -1 && io_index < stream->max_ios);
+	if (io_index >= 0)
+	{
+		int			distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].buffers == &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index]);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+
+		if (stream->ios[io_index].flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards full I/O size (behavior B). */
+			if (stream->distance > buffer_io_size)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, buffer_io_size);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+	/* Advance the oldest buffer, but clobber it first for debugging. */
+#ifdef USE_ASSERT_CHECKING
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+	stream->buffer_io_indexes[oldest_buffer_index] = -1;
+	if (stream->per_buffer_data)
+		memset(get_per_buffer_data(stream, oldest_buffer_index),
+			   0xff,
+			   stream->per_buffer_data_size);
+#endif
+	if (++stream->oldest_buffer_index == stream->max_pinned_buffers)
+		stream->oldest_buffer_index = 0;
+
+	/* We are transferring ownership of the pin to the caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/*
+	 * When distance is minimal, we finish up with no queued buffers.  As a
+	 * micro-optimization, we can then reset our circular queues, so that
+	 * all-cached streams re-use the same elements instead of rotating through
+	 * memory.
+	 */
+	if (stream->pinned_buffers == 0)
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+		stream->oldest_buffer_index = 0;
+		stream->next_buffer_index = 0;
+		stream->next_io_index = 0;
+	}
+
+	/* Prepare for the next call. */
+	streaming_read_look_ahead(stream);
+
+	return buffer;
+}
+
+/*
+ * Finish streaming blocks and release all resources.
+ */
+void
+streaming_read_buffer_end(StreamingRead *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->finished = true;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = streaming_read_buffer_next(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Release memory. */
+	pfree(stream->buffers);
+	if (stream->per_buffer_data)
+		pfree(stream->per_buffer_data);
+	pfree(stream->ios);
+
+	pfree(stream);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..b5347678726 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -160,6 +165,12 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
+/*
+ * How many buffers should be coalesced into single I/O operations where
+ * possible.
+ */
+int			buffer_io_size = DEFAULT_BUFFER_IO_SIZE;
+
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -471,10 +482,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,7 +510,7 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
@@ -781,7 +791,6 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
 
 	/*
@@ -794,15 +803,9 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	buf = ReadBuffer_common(BMR_REL(reln),
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +825,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									  RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -994,35 +996,66 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
-	IOContext	io_context;
-	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
+	uint32		buf_state;
 
-	*hit = false;
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static Buffer
+ReadBuffer_common(BufferManagerRelation bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			nblocks;
+	int			flags;
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1074,413 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
+	nblocks = 1;
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
+	else
+		flags = 0;
+	if (StartReadBuffers(bmr,
+						 &buffer,
+						 forkNum,
+						 blockNum,
+						 &nblocks,
+						 strategy,
+						 flags,
+						 &operation))
+		WaitReadBuffers(&operation);
+	Assert(nblocks == 1);		/* single block can't be short */
 
+	if (mode == RBM_ZERO_AND_CLEANUP_LOCK || mode == RBM_ZERO_AND_LOCK)
+		ZeroBuffer(buffer, mode);
+
+	return buffer;
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr.smgr);
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
 	if (isLocalBuf)
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
 		io_context = IOCONTEXT_NORMAL;
 		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
 		io_context = IOContextForStrategy(strategy);
 		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
-	/* At this point we do NOT hold any locks. */
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr.smgr->smgr_rlocator.locator.spcOid,
+									   bmr.smgr->smgr_rlocator.locator.dbOid,
+									   bmr.smgr->smgr_rlocator.locator.relNumber,
+									   bmr.smgr->smgr_rlocator.backend);
 
-	/* if it was already in the buffer pool, we're done */
-	if (found)
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr.smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr.smgr, bmr.relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr.rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr.rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr.rel);
+	}
+	if (*foundPtr)
 	{
-		/* Just need to update stats before we exit */
-		*hit = true;
 		VacuumPageHit++;
 		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageHit;
 
 		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
-
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
-		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-		}
-
-		return BufferDescriptorGetBuffer(bufHdr);
+										  bmr.smgr->smgr_rlocator.locator.spcOid,
+										  bmr.smgr->smgr_rlocator.locator.dbOid,
+										  bmr.smgr->smgr_rlocator.locator.relNumber,
+										  bmr.smgr->smgr_rlocator.backend,
+										  true);
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
-
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
-	else
-	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
-
-		smgrread(smgr, forkNum, blockNum, bufBlock);
-
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
-
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
-		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
-
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
-
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
-
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
-
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.
+ *
+ * If false is returned, no I/O is necessary and WaitReadBuffers() is not
+ * necessary.  If true is returned, one I/O has been started, and
+ * WaitReadBuffers() must be called with the same operation object before the
+ * buffers are accessed.  Along with the operation object, the caller-supplied
+ * array of buffers must remain valid until WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice,
+ * and the real I/O happens in WaitReadBuffers().  In future work, true I/O
+ * could be initiated here.
+ */
+bool
+StartReadBuffers(BufferManagerRelation bmr,
+				 Buffer *buffers,
+				 ForkNumber forkNum,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 BufferAccessStrategy strategy,
+				 int flags,
+				 ReadBuffersOperation *operation)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_BUFFER_IO_SIZE);
+
+	if (bmr.rel)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	for (int i = 0; i < actual_nblocks; ++i)
+	{
+		bool		found;
+
+		buffers[i] = PinBufferForBlock(bmr,
+									   forkNum,
+									   blockNum + i,
+									   strategy,
+									   &found);
+
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = operation->nblocks = *nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+
+	if (io_buffers_len > 0)
+	{
+		/* Populate extra information needed for I/O. */
+		operation->io_buffers_len = io_buffers_len;
+		operation->blocknum = blockNum;
+		operation->buffers = buffers;
+		operation->nblocks = actual_nblocks;
+		operation->bmr = bmr;
+		operation->forknum = forkNum;
+		operation->strategy = strategy;
+		operation->flags = flags;
+
+		if (flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/*
+			 * In theory we should only do this if PinBufferForBlock() had to
+			 * allocate new buffers above.  That way, if two calls to
+			 * StartReadBuffers() were made for the same blocks before
+			 * WaitReadBuffers(), only the first would issue the advice.
+			 * That'd be a better simulation of true asynchronous I/O, which
+			 * would only start the I/O once, but isn't done here for
+			 * simplicity.  Note also that the following call might actually
+			 * issue two advice calls if we cross a segment boundary; in a
+			 * true asynchronous version we might choose to process only one
+			 * real I/O at a time in that case.
+			 */
+			smgrprefetch(bmr.smgr, forkNum, blockNum, operation->io_buffers_len);
+		}
+
+		/* Indicate that WaitReadBuffers() should be called. */
+		return true;
+	}
+	else
+	{
+		return false;
+	}
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
+
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+	bmr = operation->bmr;
+
+	isLocalBuf = SmgrIsTemp(bmr.smgr);
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_BUFFER_IO_SIZE];
+		void	   *io_pages[MAX_BUFFER_IO_SIZE];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  bmr.smgr->smgr_rlocator.locator.spcOid,
+											  bmr.smgr->smgr_rlocator.locator.dbOid,
+											  bmr.smgr->smgr_rlocator.locator.relNumber,
+											  bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
+}
+
+/*
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1488,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1286,19 +1547,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1615,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1650,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2006,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2271,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2294,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2609,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2623,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3691,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5426,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5447,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1e71e7db4a0..71889471266 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3112,6 +3112,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"buffer_io_size",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Target size for coalescing reads and writes of buffered data blocks."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&buffer_io_size,
+		DEFAULT_BUFFER_IO_SIZE,
+		1, MAX_BUFFER_IO_SIZE,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2244ee52f79..b7a4143df21 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#buffer_io_size = 128kB
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c4086..3af86c59384 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2719,6 +2719,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-buffer-io-size" xreflabel="buffer_io_size">
+       <term><varname>buffer_io_size</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>buffer_io_size</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the target I/O size in operations that coalesce buffer I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4679660837c..a603b52ed70 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2287,6 +2287,7 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
@@ -2693,6 +2694,7 @@ StopList
 StrategyNumber
 StreamCtl
 StreamStopReason
+StreamingRead
 String
 StringInfo
 StringInfoData
-- 
2.43.0

v3-0002-Use-streaming-read-API-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Use-streaming-read-API-in-ANALYZE.patchDownload

From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 19 Feb 2024 14:30:47 +0300
Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/tableam.h             | 16 ++--
 src/backend/access/heap/heapam_handler.c | 11 +--
 src/backend/commands/analyze.c           | 97 ++++++++----------------
 3 files changed, 45 insertions(+), 79 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fc0e7027157..d83c085afca 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "executor/tuptable.h"
+#include "storage/streaming_read.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -659,9 +660,9 @@ typedef struct TableAmRoutine
 									BufferAccessStrategy bstrategy);
 
 	/*
-	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
-	 * with table_beginscan_analyze().  See also
-	 * table_scan_analyze_next_block().
+	 * Prepare to analyze next block of `scan`. Next block is decided by
+	 * callback function of `pgsr`. The scan has been started with
+	 * table_beginscan_analyze(). See also table_scan_analyze_next_block().
 	 *
 	 * The callback may acquire resources like locks that are held until
 	 * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
@@ -676,8 +677,7 @@ typedef struct TableAmRoutine
 	 * isn't one yet.
 	 */
 	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											BlockNumber blockno,
-											BufferAccessStrategy bstrategy);
+											StreamingRead *stream);
 
 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -1755,11 +1755,9 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							  BufferAccessStrategy bstrategy)
+table_scan_analyze_next_block(TableScanDesc scan, StreamingRead *stream)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
-															bstrategy);
+	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, stream);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6abfe36dec7..0c285213dc8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1053,10 +1053,10 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 static bool
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, StreamingRead *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
+	BlockNumber *current_block;
 
 	/*
 	 * We must maintain a pin on the target page's buffer to ensure that
@@ -1067,10 +1067,11 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
+	hscan->rs_cbuf = streaming_read_buffer_next(stream, (void **) &current_block);
+	hscan->rs_cblock = *current_block;
 	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
 	/* in heap all blocks can contain tuples, so always return true */
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..bdad87c3a3e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+pg_block_sampling_streaming_read_next(StreamingRead *stream,
+									  void *user_data,
+									  void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+	BlockNumber *current_block = per_buffer_data;
+
+	if (BlockSampler_HasMore(bs))
+		*current_block = BlockSampler_Next(bs);
+	else
+		*current_block = InvalidBlockNumber;
+
+	return *current_block;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the table
  *
@@ -1154,10 +1174,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	StreamingRead *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1187,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,68 +1197,23 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = streaming_read_buffer_begin(STREAMING_READ_MAINTENANCE,
+										 vac_strategy,
+										 BMR_REL(scan->rs_rd),
+										 MAIN_FORKNUM,
+										 pg_block_sampling_streaming_read_next,
+										 &bs,
+										 sizeof(BlockSamplerData));
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (nblocks)
 	{
 		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
 
 		vacuum_delay_point();
 
-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+		block_accepted = table_scan_analyze_next_block(scan, stream);
 
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
-		/*
-		 * Don't analyze if table_scan_analyze_next_block() indicated this
-		 * block is unsuitable for analyzing.
-		 */
 		if (!block_accepted)
 			continue;
 
@@ -1299,7 +1264,9 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
+		nblocks--;
 	}
+	streaming_read_buffer_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	table_endscan(scan);
-- 
2.43.0

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#3)

Re: Use streaming read API in ANALYZE

On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:

Hi,

On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

The new version of the streaming read API [1] is posted. I updated the
streaming read API changes patch (0001), using the streaming read API
in ANALYZE patch (0002) remains the same. This should make it easier
to review as it can be applied on top of master

The new version of the streaming read API is posted [1]. I rebased the
patch on top of master and v9 of the streaming read API.

There is a minimal change in the 'using the streaming read API in ANALYZE
patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
to copy exactly the same behavior as before. Also, some benchmarking
results:

I created a 22 GB table and set the size of shared buffers to 30GB, the
rest is default.

╔═══════════════════════════╦═════════════════════╦════════════╗
║ ║ Avg Timings in ms ║ ║
╠═══════════════════════════╬══════════╦══════════╬════════════╣
║ ║ master ║ patched ║ percentage ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ Both OS cache and ║ ║ ║ ║
║ shared buffers are clear ║ 513.9247 ║ 463.1019 ║ %9.9 ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ OS cache is loaded but ║ ║ ║ ║
║ shared buffers are clear ║ 423.1097 ║ 354.3277 ║ %16.3 ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ Shared buffers are loaded ║ ║ ║ ║
║ ║ 89.2846 ║ 84.6952 ║ %5.1 ║
╚═══════════════════════════╩══════════╩══════════╩════════════╝

Any kind of feedback would be appreciated.

Thanks for working on this!

A general review comment: I noticed you have the old streaming read
(pgsr) naming still in a few places (including comments) -- so I would
just make sure and update everywhere when you rebase in Thomas' latest
version of the read stream API.

From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 19 Feb 2024 14:30:47 +0300
Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
return stats;
}
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+pg_block_sampling_streaming_read_next(StreamingRead *stream,
+									  void *user_data,
+									  void *per_buffer_data)

I don't think you need the pg_ prefix

+{
+	BlockSamplerData *bs = user_data;
+	BlockNumber *current_block = per_buffer_data;

Why can't you just do BufferGetBlockNumber() on the buffer returned from
the read stream API instead of allocating per_buffer_data for the block
number?

+
+	if (BlockSampler_HasMore(bs))
+		*current_block = BlockSampler_Next(bs);
+	else
+		*current_block = InvalidBlockNumber;
+
+	return *current_block;

I think we'd like to keep the read stream code in heapam-specific code.
Instead of doing streaming_read_buffer_begin() here, you could put this
in heap_beginscan() or initscan() guarded by
scan->rs_base.rs_flags & SO_TYPE_ANALYZE

same with streaming_read_buffer_end()/heap_endscan().

You'd also then need to save the reference to the read stream in the
HeapScanDescData.

+	stream = streaming_read_buffer_begin(STREAMING_READ_MAINTENANCE,
+										 vac_strategy,
+										 BMR_REL(scan->rs_rd),
+										 MAIN_FORKNUM,
+										 pg_block_sampling_streaming_read_next,
+										 &bs,
+										 sizeof(BlockSamplerData));

/* Outer loop over blocks to sample */

In fact, I think you could use this opportunity to get rid of the block
dependency in acquire_sample_rows() altogether.

Looking at the code now, it seems like you could just invoke
heapam_scan_analyze_next_block() (maybe rename it to
heapam_scan_analyze_next_buffer() or something) from
heapam_scan_analyze_next_tuple() and remove
table_scan_analyze_next_block() entirely.

Then table AMs can figure out how they want to return tuples from
table_scan_analyze_next_tuple().

If you do all this, note that you'll need to update the comments above
acquire_sample_rows() accordingly.

-	while (BlockSampler_HasMore(&bs))
+	while (nblocks)
{
bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif

vacuum_delay_point();

-		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+		block_accepted = table_scan_analyze_next_block(scan, stream);

- Melanie

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#4)

2 attachment(s)

Re: Use streaming read API in ANALYZE

Hi,

Thanks for the review!

On Wed, 27 Mar 2024 at 23:15, Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:

Hi,

On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

The new version of the streaming read API [1] is posted. I updated the
streaming read API changes patch (0001), using the streaming read API
in ANALYZE patch (0002) remains the same. This should make it easier
to review as it can be applied on top of master

The new version of the streaming read API is posted [1]. I rebased the
patch on top of master and v9 of the streaming read API.

There is a minimal change in the 'using the streaming read API in ANALYZE
patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
to copy exactly the same behavior as before. Also, some benchmarking
results:

I created a 22 GB table and set the size of shared buffers to 30GB, the
rest is default.

╔═══════════════════════════╦═════════════════════╦════════════╗
║ ║ Avg Timings in ms ║ ║
╠═══════════════════════════╬══════════╦══════════╬════════════╣
║ ║ master ║ patched ║ percentage ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ Both OS cache and ║ ║ ║ ║
║ shared buffers are clear ║ 513.9247 ║ 463.1019 ║ %9.9 ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ OS cache is loaded but ║ ║ ║ ║
║ shared buffers are clear ║ 423.1097 ║ 354.3277 ║ %16.3 ║
╠═══════════════════════════╬══════════╬══════════╬════════════╣
║ Shared buffers are loaded ║ ║ ║ ║
║ ║ 89.2846 ║ 84.6952 ║ %5.1 ║
╚═══════════════════════════╩══════════╩══════════╩════════════╝

Any kind of feedback would be appreciated.

Thanks for working on this!

A general review comment: I noticed you have the old streaming read
(pgsr) naming still in a few places (including comments) -- so I would
just make sure and update everywhere when you rebase in Thomas' latest
version of the read stream API.

Done.

From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 19 Feb 2024 14:30:47 +0300
Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
return stats;
}
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+pg_block_sampling_streaming_read_next(StreamingRead *stream,
+                                                                       void *user_data,
+                                                                       void *per_buffer_data)
I don't think you need the pg_ prefix

Done.

+{
+     BlockSamplerData *bs = user_data;
+     BlockNumber *current_block = per_buffer_data;
Why can't you just do BufferGetBlockNumber() on the buffer returned from
the read stream API instead of allocating per_buffer_data for the block
number?

Done.

+
+     if (BlockSampler_HasMore(bs))
+             *current_block = BlockSampler_Next(bs);
+     else
+             *current_block = InvalidBlockNumber;
+
+     return *current_block;
I think we'd like to keep the read stream code in heapam-specific code.
Instead of doing streaming_read_buffer_begin() here, you could put this
in heap_beginscan() or initscan() guarded by
scan->rs_base.rs_flags & SO_TYPE_ANALYZE

In the recent changes [1]27bc1772fc814946918a5ac8ccb9b5c5ad0380aa, heapam_scan_analyze_next_[block | tuple]
are removed from tableam. They are directly called from
heapam-specific code now. So, IMO, no need to do this now.

v4 is rebased on top of v14 streaming read API changes.

[1]: 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v4-0001-v14-Streaming-Read-API.patchtext/x-patch; charset=US-ASCII; name=v4-0001-v14-Streaming-Read-API.patchDownload

From 725a3d875fb1d675f5d99d8602a87b5e47b765db Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v4 1/2] v14 Streaming Read API

Part 1:

Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later commits will provide infrastructure to help build up larger calls.

Callers should respect the new GUC io_combine_limit, and the limit on
per-backend pins which is now exposed as a public interface.

With some more infrastructure in later work, StartReadBuffers() could
be extended to start real asynchronous I/O instead of advice.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com

Part 2:

Provide API for streaming relation data.

Introduce an abstraction where relation data can be accessed as a
stream of buffers, with an implementation that is more efficient than
the equivalent sequence of ReadBuffer() calls.

Client code supplies a callback that can say which block number is
wanted next, and then consumes individual buffers one at a time from the
stream.  This division allows read_stream.c to build up large calls to
StartReadBuffers() up to io_combine_limit, and issue posix_fadvise()
advice ahead of time in a systematic way when random access is detected.

This API is based on an idea from Andres Freund to pave the way for
asynchronous I/O in future work as required to support direct I/O.  The
goal is to have an abstraction that insulates client code from future
changes to the I/O subsystem that would benefit from information about
future needs.

An extended API may be necessary in future for more complicated cases
(for example recovery, whose LsnReadQueue device in xlogprefetcher.c is
a distant cousin of this code and should eventually be replaced by
this), but this basic API is sufficient for many common usage patterns
involving predictable access to a single relation fork.

Discussion: https://postgr.es/m/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
---
 src/include/storage/bufmgr.h                  |  41 +-
 src/include/storage/read_stream.h             |  63 ++
 src/backend/storage/Makefile                  |   2 +-
 src/backend/storage/aio/Makefile              |  14 +
 src/backend/storage/aio/meson.build           |   5 +
 src/backend/storage/aio/read_stream.c         | 819 ++++++++++++++++++
 src/backend/storage/buffer/bufmgr.c           | 776 +++++++++++------
 src/backend/storage/buffer/localbuf.c         |  14 +-
 src/backend/storage/meson.build               |   1 +
 src/backend/utils/misc/guc_tables.c           |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml                      |  14 +
 src/tools/pgindent/typedefs.list              |   3 +
 13 files changed, 1509 insertions(+), 258 deletions(-)
 create mode 100644 src/include/storage/read_stream.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/read_stream.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..241f68c45e1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_IO_COMBINE_LIMIT PG_IOV_MAX
+#define DEFAULT_IO_COMBINE_LIMIT Min(MAX_IO_COMBINE_LIMIT, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int io_combine_limit;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,38 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 										ForkNumber forkNum, BlockNumber blockNum,
 										ReadBufferMode mode, BufferAccessStrategy strategy,
 										bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+struct ReadBuffersOperation
+{
+	/* The following members should be set by the caller. */
+	BufferManagerRelation bmr;
+	ForkNumber	forknum;
+	BufferAccessStrategy strategy;
+
+	/* The following private members should not be accessed directly. */
+	Buffer	   *buffers;
+	BlockNumber blocknum;
+	int			flags;
+	int16		nblocks;
+	int16		io_buffers_len;
+};
+
+typedef struct ReadBuffersOperation ReadBuffersOperation;
+
+extern bool StartReadBuffer(ReadBuffersOperation *operation,
+							Buffer *buffer,
+							BlockNumber blocknum,
+							int flags);
+extern bool StartReadBuffers(ReadBuffersOperation *operation,
+							 Buffer *buffers,
+							 BlockNumber blocknum,
+							 int *nblocks,
+							 int flags);
+extern void WaitReadBuffers(ReadBuffersOperation *operation);
+
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern bool BufferIsExclusiveLocked(Buffer buffer);
@@ -250,6 +286,9 @@ extern bool HoldingBufferPinThatDelaysRecovery(void);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
+extern void LimitAdditionalPins(uint32 *additional_pins);
+extern void LimitAdditionalLocalPins(uint32 *additional_pins);
+
 /* in buf_init.c */
 extern void InitBufferPool(void);
 extern Size BufferShmemSize(void);
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
new file mode 100644
index 00000000000..f5dbc087b0b
--- /dev/null
+++ b/src/include/storage/read_stream.h
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.h
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/read_stream.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef READ_STREAM_H
+#define READ_STREAM_H
+
+#include "storage/bufmgr.h"
+
+/* Default tuning, reasonable for many users. */
+#define READ_STREAM_DEFAULT 0x00
+
+/*
+ * I/O streams that are performing maintenance work on behalf of potentially
+ * many users, and thus should be governed by maintenance_io_concurrency
+ * instead of effective_io_concurrency.  For example, VACUUM or CREATE INDEX.
+ */
+#define READ_STREAM_MAINTENANCE 0x01
+
+/*
+ * We usually avoid issuing prefetch advice automatically when sequential
+ * access is detected, but this flag explicitly disables it, for cases that
+ * might not be correctly detected.  Explicit advice is known to perform worse
+ * than letting the kernel (at least Linux) detect sequential access.
+ */
+#define READ_STREAM_SEQUENTIAL 0x02
+
+/*
+ * We usually ramp up from smaller reads to larger ones, to support users who
+ * don't know if it's worth reading lots of buffers yet.  This flag disables
+ * that, declaring ahead of time that we'll be reading all available buffers.
+ */
+#define READ_STREAM_FULL 0x04
+
+struct ReadStream;
+typedef struct ReadStream ReadStream;
+
+/* Callback that returns the next block number to read. */
+typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
+												void *callback_private_data,
+												void *per_buffer_data);
+
+extern ReadStream *read_stream_begin_relation(int flags,
+											  BufferAccessStrategy strategy,
+											  BufferManagerRelation bmr,
+											  ForkNumber forknum,
+											  ReadStreamBlockNumberCB callback,
+											  void *callback_private_data,
+											  size_t per_buffer_data_size);
+extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
+extern void read_stream_reset(ReadStream *stream);
+extern void read_stream_end(ReadStream *stream);
+
+#endif							/* READ_STREAM_H */
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index 8376cdfca20..eec03f6f2b4 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
+SUBDIRS     = aio buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/Makefile b/src/backend/storage/aio/Makefile
new file mode 100644
index 00000000000..2f29a9ec4d1
--- /dev/null
+++ b/src/backend/storage/aio/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for storage/aio
+#
+# src/backend/storage/aio/Makefile
+#
+
+subdir = src/backend/storage/aio
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	read_stream.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/aio/meson.build b/src/backend/storage/aio/meson.build
new file mode 100644
index 00000000000..10e1aa3b20b
--- /dev/null
+++ b/src/backend/storage/aio/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+backend_sources += files(
+  'read_stream.c',
+)
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
new file mode 100644
index 00000000000..f7e9dc1138b
--- /dev/null
+++ b/src/backend/storage/aio/read_stream.c
@@ -0,0 +1,819 @@
+/*-------------------------------------------------------------------------
+ *
+ * read_stream.c
+ *	  Mechanism for accessing buffered relation data with look-ahead
+ *
+ * Code that needs to access relation data typically pins blocks one at a
+ * time, often in a predictable order that might be sequential or data-driven.
+ * Calling the simple ReadBuffer() function for each block is inefficient,
+ * because blocks that are not yet in the buffer pool require I/O operations
+ * that are small and might stall waiting for storage.  This mechanism looks
+ * into the future and calls StartReadBuffers() and WaitReadBuffers() to read
+ * neighboring blocks together and ahead of time, with an adaptive look-ahead
+ * distance.
+ *
+ * A user-provided callback generates a stream of block numbers that is used
+ * to form reads of up to io_combine_limit, by attempting to merge them with a
+ * pending read.  When that isn't possible, the existing pending read is sent
+ * to StartReadBuffers() so that a new one can begin to form.
+ *
+ * The algorithm for controlling the look-ahead distance tries to classify the
+ * stream into three ideal behaviors:
+ *
+ * A) No I/O is necessary, because the requested blocks are fully cached
+ * already.  There is no benefit to looking ahead more than one block, so
+ * distance is 1.  This is the default initial assumption.
+ *
+ * B) I/O is necessary, but fadvise is undesirable because the access is
+ * sequential, or impossible because direct I/O is enabled or the system
+ * doesn't support advice.  There is no benefit in looking ahead more than
+ * io_combine_limit, because in this case only goal is larger read system
+ * calls.  Looking further ahead would pin many buffers and perform
+ * speculative work looking ahead for no benefit.
+ *
+ * C) I/O is necesssary, it appears random, and this system supports fadvise.
+ * We'll look further ahead in order to reach the configured level of I/O
+ * concurrency.
+ *
+ * The distance increases rapidly and decays slowly, so that it moves towards
+ * those levels as different I/O patterns are discovered.  For example, a
+ * sequential scan of fully cached data doesn't bother looking ahead, but a
+ * sequential scan that hits a region of uncached blocks will start issuing
+ * increasingly wide read calls until it plateaus at io_combine_limit.
+ *
+ * The main data structure is a circular queue of buffers of size
+ * max_pinned_buffers plus some extra space for technical reasons, ready to be
+ * returned by read_stream_next_buffer().  Each buffer also has an optional
+ * variable sized object that is passed from the callback to the consumer of
+ * buffers.
+ *
+ * Parallel to the queue of buffers, there is a circular queue of in-progress
+ * I/Os that have been started with StartReadBuffers(), and for which
+ * WaitReadBuffers() must be called before returning the buffer.
+ *
+ * For example, if the callback return block numbers 10, 42, 43, 60 in
+ * successive calls, then these data structures might appear as follows:
+ *
+ *                          buffers buf/data       ios
+ *
+ *                          +----+  +-----+       +--------+
+ *                          |    |  |     |  +----+ 42..44 | <- oldest_io_index
+ *                          +----+  +-----+  |    +--------+
+ *   oldest_buffer_index -> | 10 |  |  ?  |  | +--+ 60..60 |
+ *                          +----+  +-----+  | |  +--------+
+ *                          | 42 |  |  ?  |<-+ |  |        | <- next_io_index
+ *                          +----+  +-----+    |  +--------+
+ *                          | 43 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 44 |  |  ?  |    |  |        |
+ *                          +----+  +-----+    |  +--------+
+ *                          | 60 |  |  ?  |<---+
+ *                          +----+  +-----+
+ *     next_buffer_index -> |    |  |     |
+ *                          +----+  +-----+
+ *
+ * In the example, 5 buffers are pinned, and the next buffer to be streamed to
+ * the client is block 10.  Block 10 was a hit and has no associated I/O, but
+ * the range 42..44 requires an I/O wait before its buffers are returned, as
+ * does block 60.
+ *
+ *
+ * Portions Copyright (c) 2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/aio/read_stream.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "catalog/pg_tablespace.h"
+#include "miscadmin.h"
+#include "storage/fd.h"
+#include "storage/smgr.h"
+#include "storage/read_stream.h"
+#include "utils/memdebug.h"
+#include "utils/rel.h"
+#include "utils/spccache.h"
+
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
+	 * control problems when I/Os are split.  Also useful for batch-loading
+	 * block numbers in the fast path.
+	 */
+	BlockNumber blocknums[16];
+	int16		blocknums_count;
+	int16		blocknums_next;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	bool		fast_path;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Return a pointer to the per-buffer data by index.
+ */
+static inline void *
+get_per_buffer_data(ReadStream *stream, int16 buffer_index)
+{
+	return (char *) stream->per_buffer_data +
+		stream->per_buffer_data_size * buffer_index;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to work in batches.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (stream->blocknums_next < stream->blocknums_count)
+		return stream->blocknums[stream->blocknums_next++];
+
+	/*
+	 * We only bother to fetch one at a time here (but see the fast path which
+	 * uses more).
+	 */
+	return stream->callback(stream,
+							stream->callback_private_data,
+							per_buffer_data);
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	if (stream->blocknums_next == stream->blocknums_count)
+	{
+		/* Never initialized or entirely consumed.  Re-initialize. */
+		stream->blocknums[0] = blocknum;
+		stream->blocknums_count = 1;
+		stream->blocknums_next = 0;
+	}
+	else
+	{
+		/* Must be the last value return from blocknums array. */
+		Assert(stream->blocknums_next > 0);
+		stream->blocknums_next--;
+		Assert(stream->blocknums[stream->blocknums_next] == blocknum);
+	}
+}
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+static void
+read_stream_fill_blocknums(ReadStream *stream)
+{
+	BlockNumber blocknum;
+	int			i = 0;
+
+	do
+	{
+		blocknum = stream->callback(stream,
+									stream->callback_private_data,
+									NULL);
+		stream->blocknums[i++] = blocknum;
+	} while (i < lengthof(stream->blocknums) &&
+			 blocknum != InvalidBlockNumber);
+	stream->blocknums_count = i;
+	stream->blocknums_next = 0;
+}
+#endif
+
+static void
+read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
+{
+	bool		need_wait;
+	int			nblocks;
+	int			flags;
+	int16		io_index;
+	int16		overflow;
+	int16		buffer_index;
+
+	/* This should only be called with a pending read. */
+	Assert(stream->pending_read_nblocks > 0);
+	Assert(stream->pending_read_nblocks <= io_combine_limit);
+
+	/* We had better not exceed the pin limit by starting this read. */
+	Assert(stream->pinned_buffers + stream->pending_read_nblocks <=
+		   stream->max_pinned_buffers);
+
+	/* We had better not be overwriting an existing pinned buffer. */
+	if (stream->pinned_buffers > 0)
+		Assert(stream->next_buffer_index != stream->oldest_buffer_index);
+	else
+		Assert(stream->next_buffer_index == stream->oldest_buffer_index);
+
+	/*
+	 * If advice hasn't been suppressed, this system supports it, and this
+	 * isn't a strictly sequential pattern, then we'll issue advice.
+	 */
+	if (!suppress_advice &&
+		stream->advice_enabled &&
+		stream->pending_read_blocknum != stream->seq_blocknum)
+		flags = READ_BUFFERS_ISSUE_ADVICE;
+	else
+		flags = 0;
+
+	/* We say how many blocks we want to read, but may be smaller on return. */
+	buffer_index = stream->next_buffer_index;
+	io_index = stream->next_io_index;
+	nblocks = stream->pending_read_nblocks;
+	need_wait = StartReadBuffers(&stream->ios[io_index].op,
+								 &stream->buffers[buffer_index],
+								 stream->pending_read_blocknum,
+								 &nblocks,
+								 flags);
+	stream->pinned_buffers += nblocks;
+
+	/* Remember whether we need to wait before returning this buffer. */
+	if (!need_wait)
+	{
+		/* Look-ahead distance decays, no I/O necessary (behavior A). */
+		if (stream->distance > 1)
+			stream->distance--;
+	}
+	else
+	{
+		/*
+		 * Remember to call WaitReadBuffers() before returning head buffer.
+		 * Look-ahead distance will be adjusted after waiting.
+		 */
+		stream->ios[io_index].buffer_index = buffer_index;
+		if (++stream->next_io_index == stream->max_ios)
+			stream->next_io_index = 0;
+		Assert(stream->ios_in_progress < stream->max_ios);
+		stream->ios_in_progress++;
+		stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
+	}
+
+	/*
+	 * We gave a contiguous range of buffer space to StartReadBuffers(), but
+	 * we want it to wrap around at queue_size.  Slide overflowing buffers to
+	 * the front of the array.
+	 */
+	overflow = (buffer_index + nblocks) - stream->queue_size;
+	if (overflow > 0)
+		memmove(&stream->buffers[0],
+				&stream->buffers[stream->queue_size],
+				sizeof(stream->buffers[0]) * overflow);
+
+	/* Compute location of start of next read, without using % operator. */
+	buffer_index += nblocks;
+	if (buffer_index >= stream->queue_size)
+		buffer_index -= stream->queue_size;
+	Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+	stream->next_buffer_index = buffer_index;
+
+	/* Adjust the pending read to cover the remaining portion, if any. */
+	stream->pending_read_blocknum += nblocks;
+	stream->pending_read_nblocks -= nblocks;
+}
+
+static void
+read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
+{
+	while (stream->ios_in_progress < stream->max_ios &&
+		   stream->pinned_buffers + stream->pending_read_nblocks < stream->distance)
+	{
+		BlockNumber blocknum;
+		int16		buffer_index;
+		void	   *per_buffer_data;
+
+		if (stream->pending_read_nblocks == io_combine_limit)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			continue;
+		}
+
+		/*
+		 * See which block the callback wants next in the stream.  We need to
+		 * compute the index of the Nth block of the pending read including
+		 * wrap-around, but we don't want to use the expensive % operator.
+		 */
+		buffer_index = stream->next_buffer_index + stream->pending_read_nblocks;
+		if (buffer_index >= stream->queue_size)
+			buffer_index -= stream->queue_size;
+		Assert(buffer_index >= 0 && buffer_index < stream->queue_size);
+		per_buffer_data = get_per_buffer_data(stream, buffer_index);
+		blocknum = read_stream_get_block(stream, per_buffer_data);
+		if (blocknum == InvalidBlockNumber)
+		{
+			/* End of stream. */
+			stream->distance = 0;
+			break;
+		}
+
+		/* Can we merge it with the pending read? */
+		if (stream->pending_read_nblocks > 0 &&
+			stream->pending_read_blocknum + stream->pending_read_nblocks == blocknum)
+		{
+			stream->pending_read_nblocks++;
+			continue;
+		}
+
+		/* We have to start the pending read before we can build another. */
+		if (stream->pending_read_nblocks > 0)
+		{
+			read_stream_start_pending_read(stream, suppress_advice);
+			suppress_advice = false;
+			if (stream->ios_in_progress == stream->max_ios)
+			{
+				/* And we've hit the limit.  Rewind, and stop here. */
+				read_stream_unget_block(stream, blocknum);
+				return;
+			}
+		}
+
+		/* This is the start of a new pending read. */
+		stream->pending_read_blocknum = blocknum;
+		stream->pending_read_nblocks = 1;
+	}
+
+	/*
+	 * We don't start the pending read just because we've hit the distance
+	 * limit, preferring to give it another chance to grow to full
+	 * io_combine_limit size once more buffers have been consumed.  However,
+	 * if we've already reached io_combine_limit, or we've reached the
+	 * distance limit and there isn't anything pinned yet, or the callback has
+	 * signaled end-of-stream, we start the read immediately.
+	 */
+	if (stream->pending_read_nblocks > 0 &&
+		(stream->pending_read_nblocks == io_combine_limit ||
+		 (stream->pending_read_nblocks == stream->distance &&
+		  stream->pinned_buffers == 0) ||
+		 stream->distance == 0) &&
+		stream->ios_in_progress < stream->max_ios)
+		read_stream_start_pending_read(stream, suppress_advice);
+}
+
+/*
+ * Create a new read stream object that can be used to perform the equivalent
+ * of a series of ReadBuffer() calls for one fork of one relation.
+ * Internally, it generates larger vectored reads where possible by looking
+ * ahead.  The callback should return block numbers or InvalidBlockNumber to
+ * signal end-of-stream, and if per_buffer_data_size is non-zero, it may also
+ * write extra data for each block into the space provided to it.  It will
+ * also receive callback_private_data for its own purposes.
+ */
+ReadStream *
+read_stream_begin_relation(int flags,
+						   BufferAccessStrategy strategy,
+						   BufferManagerRelation bmr,
+						   ForkNumber forknum,
+						   ReadStreamBlockNumberCB callback,
+						   void *callback_private_data,
+						   size_t per_buffer_data_size)
+{
+	ReadStream *stream;
+	size_t		size;
+	int16		queue_size;
+	int16		max_ios;
+	uint32		max_pinned_buffers;
+	Oid			tablespace_id;
+
+	/* Make sure our bmr's smgr and persistent are populated. */
+	if (bmr.smgr == NULL)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Decide how many I/Os we will allow to run at the same time.  That
+	 * currently means advice to the kernel to tell it that we will soon read.
+	 * This number also affects how far we look ahead for opportunities to
+	 * start more I/Os.
+	 */
+	tablespace_id = bmr.smgr->smgr_rlocator.locator.spcOid;
+	if (!OidIsValid(MyDatabaseId) ||
+		(bmr.rel && IsCatalogRelation(bmr.rel)) ||
+		IsCatalogRelationOid(bmr.smgr->smgr_rlocator.locator.relNumber))
+	{
+		/*
+		 * Avoid circularity while trying to look up tablespace settings or
+		 * before spccache.c is ready.
+		 */
+		max_ios = effective_io_concurrency;
+	}
+	else if (flags & READ_STREAM_MAINTENANCE)
+		max_ios = get_tablespace_maintenance_io_concurrency(tablespace_id);
+	else
+		max_ios = get_tablespace_io_concurrency(tablespace_id);
+	max_ios = Min(max_ios, PG_INT16_MAX);
+
+	/*
+	 * Choose the maximum number of buffers we're prepared to pin.  We try to
+	 * pin fewer if we can, though.  We clamp it to at least io_combine_limit
+	 * so that we can have a chance to build up a full io_combine_limit sized
+	 * read, even when max_ios is zero.  Be careful not to allow int16 to
+	 * overflow (even though that's not possible with the current GUC range
+	 * limits), allowing also for the spare entry and the overflow space.
+	 */
+	max_pinned_buffers = Max(max_ios * 4, io_combine_limit);
+	max_pinned_buffers = Min(max_pinned_buffers,
+							 PG_INT16_MAX - io_combine_limit - 1);
+
+	/* Don't allow this backend to pin more than its share of buffers. */
+	if (SmgrIsTemp(bmr.smgr))
+		LimitAdditionalLocalPins(&max_pinned_buffers);
+	else
+		LimitAdditionalPins(&max_pinned_buffers);
+	Assert(max_pinned_buffers > 0);
+
+	/*
+	 * We need one extra entry for buffers and per-buffer data, because users
+	 * of per-buffer data have access to the object until the next call to
+	 * read_stream_next_buffer(), so we need a gap between the head and tail
+	 * of the queue so that we don't clobber it.
+	 */
+	queue_size = max_pinned_buffers + 1;
+
+	/*
+	 * Allocate the object, the buffers, the ios and per_data_data space in
+	 * one big chunk.  Though we have queue_size buffers, we want to be able
+	 * to assume that all the buffers for a single read are contiguous (i.e.
+	 * don't wrap around halfway through), so we allow temporary overflows of
+	 * up to the maximum possible read size by allocating an extra
+	 * io_combine_limit - 1 elements.
+	 */
+	size = offsetof(ReadStream, buffers);
+	size += sizeof(Buffer) * (queue_size + io_combine_limit - 1);
+	size += sizeof(InProgressIO) * Max(1, max_ios);
+	size += per_buffer_data_size * queue_size;
+	size += MAXIMUM_ALIGNOF * 2;
+	stream = (ReadStream *) palloc(size);
+	memset(stream, 0, offsetof(ReadStream, buffers));
+	stream->ios = (InProgressIO *)
+		MAXALIGN(&stream->buffers[queue_size + io_combine_limit - 1]);
+	if (per_buffer_data_size > 0)
+		stream->per_buffer_data = (void *)
+			MAXALIGN(&stream->ios[Max(1, max_ios)]);
+
+#ifdef USE_PREFETCH
+
+	/*
+	 * This system supports prefetching advice.  We can use it as long as
+	 * direct I/O isn't enabled, the caller hasn't promised sequential access
+	 * (overriding our detection heuristics), and max_ios hasn't been set to
+	 * zero.
+	 */
+	if ((io_direct_flags & IO_DIRECT_DATA) == 0 &&
+		(flags & READ_STREAM_SEQUENTIAL) == 0 &&
+		max_ios > 0)
+		stream->advice_enabled = true;
+#endif
+
+	/*
+	 * For now, max_ios = 0 is interpreted as max_ios = 1 with advice disabled
+	 * above.  If we had real asynchronous I/O we might need a slightly
+	 * different definition.
+	 */
+	if (max_ios == 0)
+		max_ios = 1;
+
+	stream->max_ios = max_ios;
+	stream->per_buffer_data_size = per_buffer_data_size;
+	stream->max_pinned_buffers = max_pinned_buffers;
+	stream->queue_size = queue_size;
+	stream->callback = callback;
+	stream->callback_private_data = callback_private_data;
+
+	/*
+	 * Skip the initial ramp-up phase if the caller says we're going to be
+	 * reading the whole relation.  This way we start out assuming we'll be
+	 * doing full io_combine_limit sized reads (behavior B).
+	 */
+	if (flags & READ_STREAM_FULL)
+		stream->distance = Min(max_pinned_buffers, io_combine_limit);
+	else
+		stream->distance = 1;
+
+	/*
+	 * Since we always currently always access the same relation, we can
+	 * initialize parts of the ReadBuffersOperation objects and leave them
+	 * that way, to avoid wasting CPU cycles writing to them for each read.
+	 */
+	if (!bmr.smgr)
+	{
+		bmr.smgr = RelationGetSmgr(bmr.rel);
+		bmr.relpersistence = bmr.rel->rd_rel->relpersistence;
+	}
+	for (int i = 0; i < max_ios; ++i)
+	{
+		stream->ios[i].op.bmr = bmr;
+		stream->ios[i].op.forknum = forknum;
+		stream->ios[i].op.strategy = strategy;
+	}
+
+	return stream;
+}
+
+/*
+ * Pull one pinned buffer out of a stream.  Each call returns successive
+ * blocks in the order specified by the callback.  If per_buffer_data_size was
+ * set to a non-zero size, *per_buffer_data receives a pointer to the extra
+ * per-buffer data that the callback had a chance to populate, which remains
+ * valid until the next call to read_stream_next_buffer().  When the stream
+ * runs out of data, InvalidBuffer is returned.  The caller may decide to end
+ * the stream early at any time by calling read_stream_end().
+ */
+Buffer
+read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
+{
+	Buffer		buffer;
+	int16		oldest_buffer_index;
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+
+	/*
+	 * A fast path for all-cached scans (behavior A).  This is the same as the
+	 * usual algorithm, but it is specialized for no I/O and no per-buffer
+	 * data, so we can skip the queue management code, stay in the same buffer
+	 * slot and use singular StartReadBuffer().
+	 */
+	if (likely(stream->fast_path))
+	{
+		BlockNumber next_blocknum;
+		bool		need_wait;
+
+		/* Fast path assumptions. */
+		Assert(stream->ios_in_progress == 0);
+		Assert(stream->pinned_buffers == 1);
+		Assert(stream->distance == 1);
+		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->per_buffer_data_size == 0);
+
+		/* We're going to return the buffer we pinned last time. */
+		oldest_buffer_index = stream->oldest_buffer_index;
+		Assert((oldest_buffer_index + 1) % stream->queue_size ==
+			   stream->next_buffer_index);
+		buffer = stream->buffers[oldest_buffer_index];
+		Assert(buffer != InvalidBuffer);
+
+		/*
+		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
+		 * I/O entry (they're all free).
+		 */
+		need_wait = StartReadBuffer(&stream->ios[0].op,
+									&stream->buffers[oldest_buffer_index],
+									stream->pending_read_blocknum,
+									stream->advice_enabled ?
+									READ_BUFFERS_ISSUE_ADVICE : 0);
+
+		/* Choose the block the next call will pin. */
+		if (unlikely(stream->blocknums_next == stream->blocknums_count))
+			read_stream_fill_blocknums(stream);
+		next_blocknum = stream->blocknums[stream->blocknums_next++];
+
+		/*
+		 * Fast return if the next call doesn't require I/O for the buffer we
+		 * just pinned, and we have a block number to give it as a pending
+		 * read.
+		 */
+		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		{
+			stream->pending_read_blocknum = next_blocknum;
+			return buffer;
+		}
+
+		/*
+		 * For anything more complex, set up some more state and take the slow
+		 * path next time.
+		 */
+		stream->fast_path = false;
+
+		if (need_wait)
+		{
+			/* Next call must wait for I/O for the newly pinned buffer. */
+			stream->oldest_io_index = 0;
+			stream->next_io_index = stream->max_ios > 1 ? 1 : 0;
+			stream->ios_in_progress = 1;
+			stream->ios[0].buffer_index = oldest_buffer_index;
+			stream->seq_blocknum = next_blocknum + 1;
+		}
+		if (next_blocknum == InvalidBlockNumber)
+		{
+			/* Next call hits end of stream and can't pin anything more. */
+			stream->distance = 0;
+			stream->pending_read_nblocks = 0;
+		}
+		else
+		{
+			/* Set up the pending read. */
+			stream->pending_read_blocknum = next_blocknum;
+		}
+		return buffer;
+	}
+#endif
+
+	if (unlikely(stream->pinned_buffers == 0))
+	{
+		Assert(stream->oldest_buffer_index == stream->next_buffer_index);
+
+		/* End of stream reached?  */
+		if (stream->distance == 0)
+			return InvalidBuffer;
+
+		/*
+		 * The usual order of operations is that we look ahead at the bottom
+		 * of this function after potentially finishing an I/O and making
+		 * space for more, but if we're just starting up we'll need to crank
+		 * the handle to get started.
+		 */
+		read_stream_look_ahead(stream, true);
+
+		/* End of stream reached? */
+		if (stream->pinned_buffers == 0)
+		{
+			Assert(stream->distance == 0);
+			return InvalidBuffer;
+		}
+	}
+
+	/* Grab the oldest pinned buffer and associated per-buffer data. */
+	Assert(stream->pinned_buffers > 0);
+	oldest_buffer_index = stream->oldest_buffer_index;
+	Assert(oldest_buffer_index >= 0 &&
+		   oldest_buffer_index < stream->queue_size);
+	buffer = stream->buffers[oldest_buffer_index];
+	if (per_buffer_data)
+		*per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
+
+	Assert(BufferIsValid(buffer));
+
+	/* Do we have to wait for an associated I/O first? */
+	if (stream->ios_in_progress > 0 &&
+		stream->ios[stream->oldest_io_index].buffer_index == oldest_buffer_index)
+	{
+		int16		io_index = stream->oldest_io_index;
+		int16		distance;
+
+		/* Sanity check that we still agree on the buffers. */
+		Assert(stream->ios[io_index].op.buffers ==
+			   &stream->buffers[oldest_buffer_index]);
+
+		WaitReadBuffers(&stream->ios[io_index].op);
+
+		Assert(stream->ios_in_progress > 0);
+		stream->ios_in_progress--;
+		if (++stream->oldest_io_index == stream->max_ios)
+			stream->oldest_io_index = 0;
+
+		if (stream->ios[io_index].op.flags & READ_BUFFERS_ISSUE_ADVICE)
+		{
+			/* Distance ramps up fast (behavior C). */
+			distance = stream->distance * 2;
+			distance = Min(distance, stream->max_pinned_buffers);
+			stream->distance = distance;
+		}
+		else
+		{
+			/* No advice; move towards io_combine_limit (behavior B). */
+			if (stream->distance > io_combine_limit)
+			{
+				stream->distance--;
+			}
+			else
+			{
+				distance = stream->distance * 2;
+				distance = Min(distance, io_combine_limit);
+				distance = Min(distance, stream->max_pinned_buffers);
+				stream->distance = distance;
+			}
+		}
+	}
+
+#ifdef CLOBBER_FREED_MEMORY
+	/* Clobber old buffer and per-buffer data for debugging purposes. */
+	stream->buffers[oldest_buffer_index] = InvalidBuffer;
+
+	/*
+	 * The caller will get access to the per-buffer data, until the next call.
+	 * We wipe the one before, which is never occupied because queue_size
+	 * allowed one extra element.  This will hopefully trip up client code
+	 * that is holding a dangling pointer to it.
+	 */
+	if (stream->per_buffer_data)
+		wipe_mem(get_per_buffer_data(stream,
+									 oldest_buffer_index == 0 ?
+									 stream->queue_size - 1 :
+									 oldest_buffer_index - 1),
+				 stream->per_buffer_data_size);
+#endif
+
+	/* Pin transferred to caller. */
+	Assert(stream->pinned_buffers > 0);
+	stream->pinned_buffers--;
+
+	/* Advance oldest buffer, with wrap-around. */
+	stream->oldest_buffer_index++;
+	if (stream->oldest_buffer_index == stream->queue_size)
+		stream->oldest_buffer_index = 0;
+
+	/* Prepare for the next call. */
+	read_stream_look_ahead(stream, false);
+
+#ifndef READ_STREAM_DISABLE_FAST_PATH
+	/* See if we can take the fast path for all-cached scans next time. */
+	if (stream->ios_in_progress == 0 &&
+		stream->pinned_buffers == 1 &&
+		stream->distance == 1 &&
+		stream->pending_read_nblocks == 1 &&
+		stream->per_buffer_data_size == 0)
+	{
+		stream->fast_path = true;
+	}
+	else
+	{
+		stream->fast_path = false;
+	}
+#endif
+
+	return buffer;
+}
+
+/*
+ * Reset a read stream by releasing any queued up buffers, allowing the stream
+ * to be used again for different blocks.  This can be used to clear an
+ * end-of-stream condition and start again, or to throw away blocks that were
+ * speculatively read and read some different blocks instead.
+ */
+void
+read_stream_reset(ReadStream *stream)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead. */
+	stream->distance = 0;
+
+	/* Unpin anything that wasn't consumed. */
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+
+	Assert(stream->pinned_buffers == 0);
+	Assert(stream->ios_in_progress == 0);
+
+	/* Start off assuming data is cached. */
+	stream->distance = 1;
+}
+
+/*
+ * Release and free a read stream.
+ */
+void
+read_stream_end(ReadStream *stream)
+{
+	read_stream_reset(stream);
+	pfree(stream);
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c5..70b19238b78 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -19,6 +19,11 @@
  *		and pin it so that no one can destroy it while this process
  *		is using it.
  *
+ * StartReadBuffers() -- as above, but for multiple contiguous blocks in
+ *		two steps.
+ *
+ * WaitReadBuffers() -- second step of StartReadBuffers().
+ *
  * ReleaseBuffer() -- unpin a buffer
  *
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
@@ -152,6 +157,13 @@ int			effective_io_concurrency = DEFAULT_EFFECTIVE_IO_CONCURRENCY;
  */
 int			maintenance_io_concurrency = DEFAULT_MAINTENANCE_IO_CONCURRENCY;
 
+/*
+ * Limit on how many blocks should be handled in single I/O operations.
+ * StartReadBuffers() callers should respect it, as should other operations
+ * that call smgr APIs directly.
+ */
+int			io_combine_limit = DEFAULT_IO_COMBINE_LIMIT;
+
 /*
  * GUC variables about triggering kernel writeback for buffers written; OS
  * dependent defaults are set via the GUC mechanism.
@@ -471,10 +483,9 @@ ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref)
 )
 
 
-static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
+static Buffer ReadBuffer_common(BufferManagerRelation *bmr,
 								ForkNumber forkNum, BlockNumber blockNum,
-								ReadBufferMode mode, BufferAccessStrategy strategy,
-								bool *hit);
+								ReadBufferMode mode, BufferAccessStrategy strategy);
 static BlockNumber ExtendBufferedRelCommon(BufferManagerRelation bmr,
 										   ForkNumber fork,
 										   BufferAccessStrategy strategy,
@@ -500,18 +511,18 @@ static uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
 						  WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
-static bool StartBufferIO(BufferDesc *buf, bool forInput);
+static bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
 static void TerminateBufferIO(BufferDesc *buf, bool clear_dirty,
 							  uint32 set_flag_bits, bool forget_owner);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
 static void local_buffer_write_error_callback(void *arg);
-static BufferDesc *BufferAlloc(SMgrRelation smgr,
-							   char relpersistence,
-							   ForkNumber forkNum,
-							   BlockNumber blockNum,
-							   BufferAccessStrategy strategy,
-							   bool *foundPtr, IOContext io_context);
+static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
+									  char relpersistence,
+									  ForkNumber forkNum,
+									  BlockNumber blockNum,
+									  BufferAccessStrategy strategy,
+									  bool *foundPtr, IOContext io_context);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
@@ -781,8 +792,8 @@ Buffer
 ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				   ReadBufferMode mode, BufferAccessStrategy strategy)
 {
-	bool		hit;
 	Buffer		buf;
+	BufferManagerRelation bmr;
 
 	/*
 	 * Reject attempts to read non-local temporary relations; we would be
@@ -794,15 +805,12 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot access temporary tables of other sessions")));
 
-	/*
-	 * Read the buffer, and update pgstat counters to reflect a cache hit or
-	 * miss.
-	 */
-	pgstat_count_buffer_read(reln);
-	buf = ReadBuffer_common(RelationGetSmgr(reln), reln->rd_rel->relpersistence,
-							forkNum, blockNum, mode, strategy, &hit);
-	if (hit)
-		pgstat_count_buffer_hit(reln);
+	bmr.rel = reln;
+	bmr.smgr = RelationGetSmgr(reln);
+	bmr.relpersistence = reln->rd_rel->relpersistence;
+	buf = ReadBuffer_common(&bmr,
+							forkNum, blockNum, mode, strategy);
+
 	return buf;
 }
 
@@ -822,13 +830,12 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
 						  BufferAccessStrategy strategy, bool permanent)
 {
-	bool		hit;
-
 	SMgrRelation smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
 
-	return ReadBuffer_common(smgr, permanent ? RELPERSISTENCE_PERMANENT :
-							 RELPERSISTENCE_UNLOGGED, forkNum, blockNum,
-							 mode, strategy, &hit);
+	return ReadBuffer_common(&BMR_SMGR(smgr, permanent ? RELPERSISTENCE_PERMANENT :
+									   RELPERSISTENCE_UNLOGGED),
+							 forkNum, blockNum,
+							 mode, strategy);
 }
 
 /*
@@ -875,7 +882,7 @@ ExtendBufferedRelBy(BufferManagerRelation bmr,
 					Buffer *buffers,
 					uint32 *extended_by)
 {
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
+	/* Assert((bmr.rel != NULL) != (bmr.smgr != NULL)); */
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_by > 0);
 
@@ -911,7 +918,7 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	Buffer		buffer = InvalidBuffer;
 	Buffer		buffers[64];
 
-	Assert((bmr.rel != NULL) != (bmr.smgr != NULL));
+	/* Assert((bmr.rel != NULL) != (bmr.smgr != NULL)); */
 	Assert(bmr.smgr == NULL || bmr.relpersistence != 0);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 
@@ -994,35 +1001,149 @@ ExtendBufferedRelTo(BufferManagerRelation bmr,
 	 */
 	if (buffer == InvalidBuffer)
 	{
-		bool		hit;
-
 		Assert(extended_by == 0);
-		buffer = ReadBuffer_common(bmr.smgr, bmr.relpersistence,
-								   fork, extend_to - 1, mode, strategy,
-								   &hit);
+		buffer = ReadBuffer_common(&bmr, fork, extend_to - 1, mode, strategy);
 	}
 
 	return buffer;
 }
 
 /*
- * ReadBuffer_common -- common logic for all ReadBuffer variants
- *
- * *hit is set to true if the request was satisfied from shared buffer cache.
+ * Zero a buffer and lock it, as part of the implementation of
+ * RBM_ZERO_AND_LOCK or RBM_ZERO_AND_CLEANUP_LOCK.  The buffer must be already
+ * pinned.  It does not have to be valid, but it is valid and locked on
+ * return.
  */
-static Buffer
-ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
-				  BlockNumber blockNum, ReadBufferMode mode,
-				  BufferAccessStrategy strategy, bool *hit)
+static void
+ZeroBuffer(Buffer buffer, ReadBufferMode mode)
 {
 	BufferDesc *bufHdr;
-	Block		bufBlock;
-	bool		found;
+	uint32		buf_state;
+
+	Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+
+	if (BufferIsLocal(buffer))
+		bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+	else
+	{
+		bufHdr = GetBufferDescriptor(buffer - 1);
+		if (mode == RBM_ZERO_AND_LOCK)
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		else
+			LockBufferForCleanup(buffer);
+	}
+
+	memset(BufferGetPage(buffer), 0, BLCKSZ);
+
+	if (BufferIsLocal(buffer))
+	{
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+	}
+	else
+	{
+		buf_state = LockBufHdr(bufHdr);
+		buf_state |= BM_VALID;
+		UnlockBufHdr(bufHdr, buf_state);
+	}
+}
+
+/*
+ * Pin a buffer for a given block.  *foundPtr is set to true if the block was
+ * already present, or false if more work is required to either read it in or
+ * zero it.
+ */
+static inline Buffer
+PinBufferForBlock(BufferManagerRelation *bmr,
+				  ForkNumber forkNum,
+				  BlockNumber blockNum,
+				  BufferAccessStrategy strategy,
+				  bool *foundPtr)
+{
+	BufferDesc *bufHdr;
+	bool		isLocalBuf;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isLocalBuf = SmgrIsTemp(smgr);
 
-	*hit = false;
+	Assert(blockNum != P_NEW);
+
+	Assert(bmr->smgr);
+
+	isLocalBuf = bmr->relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
+									   bmr->smgr->smgr_rlocator.locator.spcOid,
+									   bmr->smgr->smgr_rlocator.locator.dbOid,
+									   bmr->smgr->smgr_rlocator.locator.relNumber,
+									   bmr->smgr->smgr_rlocator.backend);
+
+	if (isLocalBuf)
+	{
+		bufHdr = LocalBufferAlloc(bmr->smgr, forkNum, blockNum, foundPtr);
+		if (*foundPtr)
+			pgBufferUsage.local_blks_hit++;
+	}
+	else
+	{
+		bufHdr = BufferAlloc(bmr->smgr, bmr->relpersistence, forkNum, blockNum,
+							 strategy, foundPtr, io_context);
+		if (*foundPtr)
+			pgBufferUsage.shared_blks_hit++;
+	}
+	if (bmr->rel)
+	{
+		/*
+		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
+		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
+		 * zeroed instead), the per-relation stats always count them.
+		 */
+		pgstat_count_buffer_read(bmr->rel);
+		if (*foundPtr)
+			pgstat_count_buffer_hit(bmr->rel);
+	}
+	if (*foundPtr)
+	{
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
+
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  bmr->smgr->smgr_rlocator.locator.spcOid,
+										  bmr->smgr->smgr_rlocator.locator.dbOid,
+										  bmr->smgr->smgr_rlocator.locator.relNumber,
+										  bmr->smgr->smgr_rlocator.backend,
+										  true);
+	}
+
+	return BufferDescriptorGetBuffer(bufHdr);
+}
+
+/*
+ * ReadBuffer_common -- common logic for all ReadBuffer variants
+ */
+static pg_attribute_always_inline Buffer
+ReadBuffer_common(BufferManagerRelation *bmr, ForkNumber forkNum,
+				  BlockNumber blockNum, ReadBufferMode mode,
+				  BufferAccessStrategy strategy)
+{
+	ReadBuffersOperation operation;
+	Buffer		buffer;
+	int			flags;
+
+	/* Caller must make sure smgr is initialized. */
+	Assert(bmr->smgr != NULL);
 
 	/*
 	 * Backward compatibility path, most code should use ExtendBufferedRel()
@@ -1041,181 +1162,353 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			flags |= EB_LOCK_FIRST;
 
-		return ExtendBufferedRel(BMR_SMGR(smgr, relpersistence),
-								 forkNum, strategy, flags);
+		return ExtendBufferedRel(*bmr, forkNum, strategy, flags);
 	}
 
-	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
-									   smgr->smgr_rlocator.locator.spcOid,
-									   smgr->smgr_rlocator.locator.dbOid,
-									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend);
-
-	if (isLocalBuf)
+	if (unlikely(mode == RBM_ZERO_AND_CLEANUP_LOCK ||
+				 mode == RBM_ZERO_AND_LOCK))
 	{
-		/*
-		 * We do not use a BufferAccessStrategy for I/O of temporary tables.
-		 * However, in some cases, the "strategy" may not be NULL, so we can't
-		 * rely on IOContextForStrategy() to set the right IOContext for us.
-		 * This may happen in cases like CREATE TEMPORARY TABLE AS...
-		 */
-		io_context = IOCONTEXT_NORMAL;
-		io_object = IOOBJECT_TEMP_RELATION;
-		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
-		if (found)
-			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
-	}
-	else
-	{
-		/*
-		 * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
-		 * not currently in memory.
-		 */
-		io_context = IOContextForStrategy(strategy);
-		io_object = IOOBJECT_RELATION;
-		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
-							 strategy, &found, io_context);
-		if (found)
-			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
-	}
+		bool		found;
 
-	/* At this point we do NOT hold any locks. */
-
-	/* if it was already in the buffer pool, we're done */
-	if (found)
-	{
-		/* Just need to update stats before we exit */
-		*hit = true;
-		VacuumPageHit++;
-		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
-
-		if (VacuumCostActive)
-			VacuumCostBalance += VacuumCostPageHit;
-
-		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-										  smgr->smgr_rlocator.locator.spcOid,
-										  smgr->smgr_rlocator.locator.dbOid,
-										  smgr->smgr_rlocator.locator.relNumber,
-										  smgr->smgr_rlocator.backend,
-										  found);
-
-		/*
-		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
-		 * on return.
-		 */
-		if (!isLocalBuf)
+#if 0
+		if (bmr->smgr == NULL)
 		{
-			if (mode == RBM_ZERO_AND_LOCK)
-				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-							  LW_EXCLUSIVE);
-			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
+			bmr->smgr = RelationGetSmgr(bmr->rel);
+			bmr->relpersistence = bmr->rel->rd_rel->relpersistence;
 		}
+#endif
 
-		return BufferDescriptorGetBuffer(bufHdr);
+		buffer = PinBufferForBlock(bmr, forkNum, blockNum, strategy, &found);
+		ZeroBuffer(buffer, mode);
+		return buffer;
 	}
 
-	/*
-	 * if we have gotten to this point, we have allocated a buffer for the
-	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
-	 * if it's a shared buffer.
-	 */
-	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
-
-	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-
-	/*
-	 * Read in the page, unless the caller intends to overwrite it and just
-	 * wants us to allocate a buffer.
-	 */
-	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-		MemSet((char *) bufBlock, 0, BLCKSZ);
+	if (mode == RBM_ZERO_ON_ERROR)
+		flags = READ_BUFFERS_ZERO_ON_ERROR;
 	else
-	{
-		instr_time	io_start = pgstat_prepare_io_time(track_io_timing);
+		flags = 0;
+	operation.bmr = *bmr;
+	operation.forknum = forkNum;
+	operation.strategy = strategy;
+	if (StartReadBuffer(&operation,
+						&buffer,
+						blockNum,
+						flags))
+		WaitReadBuffers(&operation);
 
-		smgrread(smgr, forkNum, blockNum, bufBlock);
-
-		pgstat_count_io_op_time(io_object, io_context,
-								IOOP_READ, io_start, 1);
-
-		/* check for garbage data */
-		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-									PIV_LOG_WARNING | PIV_REPORT_STAT))
-		{
-			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-			{
-				ereport(WARNING,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s; zeroing out page",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-				MemSet((char *) bufBlock, 0, BLCKSZ);
-			}
-			else
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg("invalid page in block %u of relation %s",
-								blockNum,
-								relpath(smgr->smgr_rlocator, forkNum))));
-		}
-	}
-
-	/*
-	 * In RBM_ZERO_AND_LOCK / RBM_ZERO_AND_CLEANUP_LOCK mode, grab the buffer
-	 * content lock before marking the page as valid, to make sure that no
-	 * other backend sees the zeroed page before the caller has had a chance
-	 * to initialize it.
-	 *
-	 * Since no-one else can be looking at the page contents yet, there is no
-	 * difference between an exclusive lock and a cleanup-strength lock. (Note
-	 * that we cannot use LockBuffer() or LockBufferForCleanup() here, because
-	 * they assert that the buffer is already valid.)
-	 */
-	if ((mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) &&
-		!isLocalBuf)
-	{
-		LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE);
-	}
-
-	if (isLocalBuf)
-	{
-		/* Only need to adjust flags */
-		uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-		buf_state |= BM_VALID;
-		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-	}
-	else
-	{
-		/* Set BM_VALID, terminate IO, and wake up any waiters */
-		TerminateBufferIO(bufHdr, false, BM_VALID, true);
-	}
-
-	VacuumPageMiss++;
-	if (VacuumCostActive)
-		VacuumCostBalance += VacuumCostPageMiss;
-
-	TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-									  smgr->smgr_rlocator.locator.spcOid,
-									  smgr->smgr_rlocator.locator.dbOid,
-									  smgr->smgr_rlocator.locator.relNumber,
-									  smgr->smgr_rlocator.backend,
-									  found);
-
-	return BufferDescriptorGetBuffer(bufHdr);
+	return buffer;
 }
 
 /*
- * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
- *		buffer.  If no buffer exists already, selects a replacement
- *		victim and evicts the old page, but does NOT read in new page.
+ * Single block version of the StartReadBuffers().  This might save a few
+ * instructions when called from another translation unit, if the compiler
+ * inlines the code and specializes for nblocks == 1.
+ */
+bool
+StartReadBuffer(ReadBuffersOperation *operation,
+				Buffer *buffer,
+				BlockNumber blocknum,
+				int flags)
+{
+	int			nblocks = 1;
+	bool		result;
+
+	result = StartReadBuffers(operation, buffer, blocknum, &nblocks, flags);
+	Assert(nblocks == 1);		/* single block can't be short */
+
+	return result;
+}
+
+/*
+ * Begin reading a range of blocks beginning at blockNum and extending for
+ * *nblocks.  On return, up to *nblocks pinned buffers holding those blocks
+ * are written into the buffers array, and *nblocks is updated to contain the
+ * actual number, which may be fewer than requested.  Caller sets some of the
+ * members of operation; see struct definition.
+ *
+ * If false is returned, no I/O is necessary.  If true is returned, one I/O
+ * has been started, and WaitReadBuffers() must be called with the same
+ * operation object before the buffers are accessed.  Along with the operation
+ * object, the caller-supplied array of buffers must remain valid until
+ * WaitReadBuffers() is called.
+ *
+ * Currently the I/O is only started with optional operating system advice if
+ * requested by the caller with READ_BUFFERS_ISSUE_ADVICE, and the real I/O
+ * happens synchronously in WaitReadBuffers().  In future work, true I/O could
+ * be initiated here.
+ *
+ * Make sure that StartReadBuffer() inlines this function with an attribute.
+ */
+pg_attribute_always_inline bool
+StartReadBuffers(ReadBuffersOperation *operation,
+				 Buffer *buffers,
+				 BlockNumber blockNum,
+				 int *nblocks,
+				 int flags)
+{
+	int			actual_nblocks = *nblocks;
+	int			io_buffers_len = 0;
+
+	Assert(*nblocks > 0);
+	Assert(*nblocks <= MAX_IO_COMBINE_LIMIT);
+
+	for (int i = 0; i < actual_nblocks; ++i)
+	{
+		bool		found;
+
+		buffers[i] = PinBufferForBlock(&operation->bmr,
+									   operation->forknum,
+									   blockNum + i,
+									   operation->strategy,
+									   &found);
+
+		if (found)
+		{
+			/*
+			 * Terminate the read as soon as we get a hit.  It could be a
+			 * single buffer hit, or it could be a hit that follows a readable
+			 * range.  We don't want to create more than one readable range,
+			 * so we stop here.
+			 */
+			actual_nblocks = i + 1;
+			break;
+		}
+		else
+		{
+			/* Extend the readable range to cover this block. */
+			io_buffers_len++;
+		}
+	}
+	*nblocks = actual_nblocks;
+
+	if (likely(io_buffers_len == 0))
+		return false;
+
+	/* Populate information needed for I/O. */
+	operation->buffers = buffers;
+	operation->blocknum = blockNum;
+	operation->flags = flags;
+	operation->nblocks = actual_nblocks;
+	operation->io_buffers_len = io_buffers_len;
+
+	if (flags & READ_BUFFERS_ISSUE_ADVICE)
+	{
+		/*
+		 * In theory we should only do this if PinBufferForBlock() had to
+		 * allocate new buffers above.  That way, if two calls to
+		 * StartReadBuffers() were made for the same blocks before
+		 * WaitReadBuffers(), only the first would issue the advice. That'd be
+		 * a better simulation of true asynchronous I/O, which would only
+		 * start the I/O once, but isn't done here for simplicity.  Note also
+		 * that the following call might actually issue two advice calls if we
+		 * cross a segment boundary; in a true asynchronous version we might
+		 * choose to process only one real I/O at a time in that case.
+		 */
+		smgrprefetch(operation->bmr.smgr,
+					 operation->forknum,
+					 blockNum,
+					 operation->io_buffers_len);
+	}
+
+	/* Indicate that WaitReadBuffers() should be called. */
+	return true;
+}
+
+static inline bool
+WaitReadBuffersCanStartIO(Buffer buffer, bool nowait)
+{
+	if (BufferIsLocal(buffer))
+	{
+		BufferDesc *bufHdr = GetLocalBufferDescriptor(-buffer - 1);
+
+		return (pg_atomic_read_u32(&bufHdr->state) & BM_VALID) == 0;
+	}
+	else
+		return StartBufferIO(GetBufferDescriptor(buffer - 1), true, nowait);
+}
+
+void
+WaitReadBuffers(ReadBuffersOperation *operation)
+{
+	Buffer	   *buffers;
+	int			nblocks;
+	BlockNumber blocknum;
+	ForkNumber	forknum;
+	bool		isLocalBuf;
+	IOContext	io_context;
+	IOObject	io_object;
+
+	/*
+	 * Currently operations are only allowed to include a read of some range,
+	 * with an optional extra buffer that is already pinned at the end.  So
+	 * nblocks can be at most one more than io_buffers_len.
+	 */
+	Assert((operation->nblocks == operation->io_buffers_len) ||
+		   (operation->nblocks == operation->io_buffers_len + 1));
+
+	/* Find the range of the physical read we need to perform. */
+	nblocks = operation->io_buffers_len;
+	if (nblocks == 0)
+		return;					/* nothing to do */
+
+	buffers = &operation->buffers[0];
+	blocknum = operation->blocknum;
+	forknum = operation->forknum;
+
+	isLocalBuf = operation->bmr.relpersistence == RELPERSISTENCE_TEMP;
+	if (isLocalBuf)
+	{
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_TEMP_RELATION;
+	}
+	else
+	{
+		io_context = IOContextForStrategy(operation->strategy);
+		io_object = IOOBJECT_RELATION;
+	}
+
+	/*
+	 * We count all these blocks as read by this backend.  This is traditional
+	 * behavior, but might turn out to be not true if we find that someone
+	 * else has beaten us and completed the read of some of these blocks.  In
+	 * that case the system globally double-counts, but we traditionally don't
+	 * count this as a "hit", and we don't have a separate counter for "miss,
+	 * but another backend completed the read".
+	 */
+	if (isLocalBuf)
+		pgBufferUsage.local_blks_read += nblocks;
+	else
+		pgBufferUsage.shared_blks_read += nblocks;
+
+	for (int i = 0; i < nblocks; ++i)
+	{
+		int			io_buffers_len;
+		Buffer		io_buffers[MAX_IO_COMBINE_LIMIT];
+		void	   *io_pages[MAX_IO_COMBINE_LIMIT];
+		instr_time	io_start;
+		BlockNumber io_first_block;
+
+		/*
+		 * Skip this block if someone else has already completed it.  If an
+		 * I/O is already in progress in another backend, this will wait for
+		 * the outcome: either done, or something went wrong and we will
+		 * retry.
+		 */
+		if (!WaitReadBuffersCanStartIO(buffers[i], false))
+		{
+			/*
+			 * Report this as a 'hit' for this backend, even though it must
+			 * have started out as a miss in PinBufferForBlock().
+			 */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, blocknum + i,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  true);
+			continue;
+		}
+
+		/* We found a buffer that we need to read in. */
+		io_buffers[0] = buffers[i];
+		io_pages[0] = BufferGetBlock(buffers[i]);
+		io_first_block = blocknum + i;
+		io_buffers_len = 1;
+
+		/*
+		 * How many neighboring-on-disk blocks can we can scatter-read into
+		 * other buffers at the same time?  In this case we don't wait if we
+		 * see an I/O already in progress.  We already hold BM_IO_IN_PROGRESS
+		 * for the head block, so we should get on with that I/O as soon as
+		 * possible.  We'll come back to this block again, above.
+		 */
+		while ((i + 1) < nblocks &&
+			   WaitReadBuffersCanStartIO(buffers[i + 1], true))
+		{
+			/* Must be consecutive block numbers. */
+			Assert(BufferGetBlockNumber(buffers[i + 1]) ==
+				   BufferGetBlockNumber(buffers[i]) + 1);
+
+			io_buffers[io_buffers_len] = buffers[++i];
+			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
+		}
+
+		io_start = pgstat_prepare_io_time(track_io_timing);
+		smgrreadv(operation->bmr.smgr, forknum, io_first_block, io_pages, io_buffers_len);
+		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
+								io_buffers_len);
+
+		/* Verify each block we read, and terminate the I/O. */
+		for (int j = 0; j < io_buffers_len; ++j)
+		{
+			BufferDesc *bufHdr;
+			Block		bufBlock;
+
+			if (isLocalBuf)
+			{
+				bufHdr = GetLocalBufferDescriptor(-io_buffers[j] - 1);
+				bufBlock = LocalBufHdrGetBlock(bufHdr);
+			}
+			else
+			{
+				bufHdr = GetBufferDescriptor(io_buffers[j] - 1);
+				bufBlock = BufHdrGetBlock(bufHdr);
+			}
+
+			/* check for garbage data */
+			if (!PageIsVerifiedExtended((Page) bufBlock, io_first_block + j,
+										PIV_LOG_WARNING | PIV_REPORT_STAT))
+			{
+				if ((operation->flags & READ_BUFFERS_ZERO_ON_ERROR) || zero_damaged_pages)
+				{
+					ereport(WARNING,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s; zeroing out page",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+					memset(bufBlock, 0, BLCKSZ);
+				}
+				else
+					ereport(ERROR,
+							(errcode(ERRCODE_DATA_CORRUPTED),
+							 errmsg("invalid page in block %u of relation %s",
+									io_first_block + j,
+									relpath(operation->bmr.smgr->smgr_rlocator, forknum))));
+			}
+
+			/* Terminate I/O and set BM_VALID. */
+			if (isLocalBuf)
+			{
+				uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
+
+				buf_state |= BM_VALID;
+				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			}
+			else
+			{
+				/* Set BM_VALID, terminate IO, and wake up any waiters */
+				TerminateBufferIO(bufHdr, false, BM_VALID, true);
+			}
+
+			/* Report I/Os as completing individually. */
+			TRACE_POSTGRESQL_BUFFER_READ_DONE(forknum, io_first_block + j,
+											  operation->bmr.smgr->smgr_rlocator.locator.spcOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.dbOid,
+											  operation->bmr.smgr->smgr_rlocator.locator.relNumber,
+											  operation->bmr.smgr->smgr_rlocator.backend,
+											  false);
+		}
+
+		VacuumPageMiss += io_buffers_len;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageMiss * io_buffers_len;
+	}
+}
+
+/*
+ * BufferAlloc -- subroutine for PinBufferForBlock.  Handles lookup of a shared
+ *		buffer.  If no buffer exists already, selects a replacement victim and
+ *		evicts the old page, but does NOT read in new page.
  *
  * "strategy" can be a buffer replacement strategy object, or NULL for
  * the default strategy.  The selected buffer's usage_count is advanced when
@@ -1223,11 +1516,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * The returned buffer is pinned and is already marked as holding the
  * desired page.  If it already did have the desired page, *foundPtr is
- * set true.  Otherwise, *foundPtr is set false and the buffer is marked
- * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
- *
- * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
- * we keep it for simplicity in ReadBuffer.
+ * set true.  Otherwise, *foundPtr is set false.
  *
  * io_context is passed as an output parameter to avoid calling
  * IOContextForStrategy() when there is a shared buffers hit and no IO
@@ -1235,7 +1524,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  *
  * No locks are held either at entry or exit.
  */
-static BufferDesc *
+static inline BufferDesc *
 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BlockNumber blockNum,
 			BufferAccessStrategy strategy,
@@ -1286,19 +1575,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(buf, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return buf;
@@ -1363,19 +1643,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		{
 			/*
 			 * We can only get here if (a) someone else is still reading in
-			 * the page, or (b) a previous read attempt failed.  We have to
-			 * wait for any active read attempt to finish, and then set up our
-			 * own read attempt if the page is still not BM_VALID.
-			 * StartBufferIO does it all.
+			 * the page, (b) a previous read attempt failed, or (c) someone
+			 * called StartReadBuffers() but not yet WaitReadBuffers().
 			 */
-			if (StartBufferIO(existing_buf_hdr, true))
-			{
-				/*
-				 * If we get here, previous attempts to read the buffer must
-				 * have failed ... but we shall bravely try again.
-				 */
-				*foundPtr = false;
-			}
+			*foundPtr = false;
 		}
 
 		return existing_buf_hdr;
@@ -1407,15 +1678,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	LWLockRelease(newPartitionLock);
 
 	/*
-	 * Buffer contents are currently invalid.  Try to obtain the right to
-	 * start I/O.  If StartBufferIO returns false, then someone else managed
-	 * to read it before we did, so there's nothing left for BufferAlloc() to
-	 * do.
+	 * Buffer contents are currently invalid.
 	 */
-	if (StartBufferIO(victim_buf_hdr, true))
-		*foundPtr = false;
-	else
-		*foundPtr = true;
+	*foundPtr = false;
 
 	return victim_buf_hdr;
 }
@@ -1769,7 +2034,7 @@ again:
  * pessimistic, but outside of toy-sized shared_buffers it should allow
  * sufficient pins.
  */
-static void
+void
 LimitAdditionalPins(uint32 *additional_pins)
 {
 	uint32		max_backends;
@@ -2034,7 +2299,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 
 				buf_state &= ~BM_VALID;
 				UnlockBufHdr(existing_hdr, buf_state);
-			} while (!StartBufferIO(existing_hdr, true));
+			} while (!StartBufferIO(existing_hdr, true, false));
 		}
 		else
 		{
@@ -2057,7 +2322,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 			LWLockRelease(partition_lock);
 
 			/* XXX: could combine the locked operations in it with the above */
-			StartBufferIO(victim_buf_hdr, true);
+			StartBufferIO(victim_buf_hdr, true, false);
 		}
 	}
 
@@ -2372,7 +2637,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	else
 	{
 		/*
-		 * If we previously pinned the buffer, it must surely be valid.
+		 * If we previously pinned the buffer, it is likely to be valid, but
+		 * it may not be if StartReadBuffers() was called and
+		 * WaitReadBuffers() hasn't been called yet.  We'll check by loading
+		 * the flags without locking.  This is racy, but it's OK to return
+		 * false spuriously: when WaitReadBuffers() calls StartBufferIO(),
+		 * it'll see that it's now valid.
 		 *
 		 * Note: We deliberately avoid a Valgrind client request here.
 		 * Individual access methods can optionally superimpose buffer page
@@ -2381,7 +2651,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 		 * that the buffer page is legitimately non-accessible here.  We
 		 * cannot meddle with that.
 		 */
-		result = true;
+		result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
 	}
 
 	ref->refcount++;
@@ -3449,7 +3719,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false))
+	if (!StartBufferIO(buf, false, false))
 		return;
 
 	/* Setup error traceback support for ereport() */
@@ -5184,9 +5454,15 @@ WaitIO(BufferDesc *buf)
  *
  * Returns true if we successfully marked the buffer as I/O busy,
  * false if someone else already did the work.
+ *
+ * If nowait is true, then we don't wait for an I/O to be finished by another
+ * backend.  In that case, false indicates either that the I/O was already
+ * finished, or is still in progress.  This is useful for callers that want to
+ * find out if they can perform the I/O as part of a larger operation, without
+ * waiting for the answer or distinguishing the reasons why not.
  */
 static bool
-StartBufferIO(BufferDesc *buf, bool forInput)
+StartBufferIO(BufferDesc *buf, bool forInput, bool nowait)
 {
 	uint32		buf_state;
 
@@ -5199,6 +5475,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 		if (!(buf_state & BM_IO_IN_PROGRESS))
 			break;
 		UnlockBufHdr(buf, buf_state);
+		if (nowait)
+			return false;
 		WaitIO(buf);
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a57..985a2c7049c 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -108,10 +108,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum,
  * LocalBufferAlloc -
  *	  Find or create a local buffer for the given page of the given relation.
  *
- * API is similar to bufmgr.c's BufferAlloc, except that we do not need
- * to do any locking since this is all local.   Also, IO_IN_PROGRESS
- * does not get set.  Lastly, we support only default access strategy
- * (hence, usage_count is always advanced).
+ * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do
+ * any locking since this is all local.  We support only default access
+ * strategy (hence, usage_count is always advanced).
  */
 BufferDesc *
 LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
@@ -287,7 +286,7 @@ GetLocalVictimBuffer(void)
 }
 
 /* see LimitAdditionalPins() */
-static void
+void
 LimitAdditionalLocalPins(uint32 *additional_pins)
 {
 	uint32		max_pins;
@@ -297,9 +296,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins)
 
 	/*
 	 * In contrast to LimitAdditionalPins() other backends don't play a role
-	 * here. We can allow up to NLocBuffer pins in total.
+	 * here. We can allow up to NLocBuffer pins in total, but it might not be
+	 * initialized yet so read num_temp_buffers.
 	 */
-	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+	max_pins = (num_temp_buffers - NLocalPinnedBuffers);
 
 	if (*additional_pins >= max_pins)
 		*additional_pins = max_pins;
diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build
index 40345bdca27..739d13293fb 100644
--- a/src/backend/storage/meson.build
+++ b/src/backend/storage/meson.build
@@ -1,5 +1,6 @@
 # Copyright (c) 2022-2024, PostgreSQL Global Development Group
 
+subdir('aio')
 subdir('buffer')
 subdir('file')
 subdir('freespace')
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 92fcd5fa4d5..c12784cbec8 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -3129,6 +3129,20 @@ struct config_int ConfigureNamesInt[] =
 		NULL
 	},
 
+	{
+		{"io_combine_limit",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit on the size of data reads and writes."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		&io_combine_limit,
+		DEFAULT_IO_COMBINE_LIMIT,
+		1, MAX_IO_COMBINE_LIMIT,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"backend_flush_after", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
 			gettext_noop("Number of pages after which previously performed writes are flushed to disk."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index adcc0257f91..baecde28410 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -203,6 +203,7 @@
 #backend_flush_after = 0		# measured in pages, 0 disables
 #effective_io_concurrency = 1		# 1-1000; 0 disables prefetching
 #maintenance_io_concurrency = 10	# 1-1000; 0 disables prefetching
+#io_combine_limit = 128kB		# usually 1-32 blocks (depends on OS)
 #max_worker_processes = 8		# (change requires restart)
 #max_parallel_workers_per_gather = 2	# limited by max_parallel_workers
 #max_parallel_maintenance_workers = 2	# limited by max_parallel_workers
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f65c17e5ae4..241a6079688 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2720,6 +2720,20 @@ include_dir 'conf.d'
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-io-combine-limit" xreflabel="io_combine_limit">
+       <term><varname>io_combine_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>io_combine_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Controls the largest I/O size in operations that combine I/O.
+         The default is 128kB.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="guc-max-worker-processes" xreflabel="max_worker_processes">
        <term><varname>max_worker_processes</varname> (<type>integer</type>)
        <indexterm>
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9add48f9924..1590dc6fddc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1215,6 +1215,7 @@ InjectionPointCacheEntry
 InjectionPointEntry
 InjectionPointSharedState
 InlineCodeBlock
+InProgressIO
 InsertStmt
 Instrumentation
 Int128AggState
@@ -2288,11 +2289,13 @@ ReInitializeDSMForeignScan_function
 ReScanForeignScan_function
 ReadBufPtrType
 ReadBufferMode
+ReadBuffersOperation
 ReadBytePtrType
 ReadExtraTocPtrType
 ReadFunc
 ReadLocalXLogPageNoWaitPrivate
 ReadReplicationSlotCmd
+ReadStream
 ReassignOwnedStmt
 RecheckForeignScan_function
 RecordCacheArrayEntry
-- 
2.43.0

v4-0002-Use-streaming-read-API-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Use-streaming-read-API-in-ANALYZE.patchDownload

From 0e8867f6fd20501ef5853581a9c010c1e2c28e23 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 1 Apr 2024 13:58:21 +0300
Subject: [PATCH v4 2/2] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h              |  4 +-
 src/backend/access/heap/heapam_handler.c | 16 ++---
 src/backend/commands/analyze.c           | 85 ++++++++----------------
 3 files changed, 36 insertions(+), 69 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 32a3fbce961..2eadad3360c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -371,8 +372,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 
 /* in heap/heapam_handler.c*/
 extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
+										   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 41a4bb0981d..c686fd5e122 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1055,16 +1055,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  The scan has been
+ * started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
 void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1077,11 +1076,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..da31b6f2d77 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,59 +1191,20 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										BMR_REL(scan->rs_rd),
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (nblocks)
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
+		heapam_scan_analyze_next_block(scan, stream);
 
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
@@ -1288,7 +1253,9 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
+		nblocks--;
 	}
+	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan(scan);
-- 
2.43.0

Jakub Wartak

jakub.wartak@enterprisedb.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#5)

Re: Use streaming read API in ANALYZE

On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
[..]

v4 is rebased on top of v14 streaming read API changes.

Hi Nazir, so with streaming API committed, I gave a try to this patch.
With autovacuum=off and 30GB table on NVMe (with standard readahead of
256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
default) created using: create table t as select repeat('a', 100) || i
|| repeat('b', 500) as filler from generate_series(1, 45000000) as i;

on master, effect of mainteance_io_concurency [default 10] is like
that (when resetting the fs cache after each ANALYZE):

m_io_c = 0:
Time: 3137.914 ms (00:03.138)
Time: 3094.540 ms (00:03.095)
Time: 3452.513 ms (00:03.453)

m_io_c = 1:
Time: 2972.751 ms (00:02.973)
Time: 2939.551 ms (00:02.940)
Time: 2904.428 ms (00:02.904)

m_io_c = 2:
Time: 1580.260 ms (00:01.580)
Time: 1572.132 ms (00:01.572)
Time: 1558.334 ms (00:01.558)

m_io_c = 4:
Time: 938.304 ms
Time: 931.772 ms
Time: 920.044 ms

m_io_c = 8:
Time: 666.025 ms
Time: 660.241 ms
Time: 648.848 ms

m_io_c = 16:
Time: 542.450 ms
Time: 561.155 ms
Time: 539.683 ms

m_io_c = 32:
Time: 538.487 ms
Time: 541.705 ms
Time: 538.101 ms

with patch applied:

m_io_c = 0:
Time: 3106.469 ms (00:03.106)
Time: 3140.343 ms (00:03.140)
Time: 3044.133 ms (00:03.044)

m_io_c = 1:
Time: 2959.817 ms (00:02.960)
Time: 2920.265 ms (00:02.920)
Time: 2911.745 ms (00:02.912)

m_io_c = 2:
Time: 1581.912 ms (00:01.582)
Time: 1561.444 ms (00:01.561)
Time: 1558.251 ms (00:01.558)

m_io_c = 4:
Time: 908.116 ms
Time: 901.245 ms
Time: 901.071 ms

m_io_c = 8:
Time: 619.870 ms
Time: 620.327 ms
Time: 614.266 ms

m_io_c = 16:
Time: 529.885 ms
Time: 526.958 ms
Time: 528.474 ms

m_io_c = 32:
Time: 521.185 ms
Time: 520.713 ms
Time: 517.729 ms

No difference to me, which seems to be good. I've double checked and
patch used the new way

acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
-> mdreadv -> FileReadV -> pg_preadv (inlined)
acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
-> ...

I gave also io_combine_limit to 32 (max, 256kB) a try and got those
slightly better results:

[..]
m_io_c = 16:
Time: 494.599 ms
Time: 496.345 ms
Time: 973.500 ms

m_io_c = 32:
Time: 461.031 ms
Time: 449.037 ms
Time: 443.375 ms

and that (last one) apparently was able to push it to ~50-60k still
random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
still reading random , so I assume no merging was done:

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz
w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s
drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 61212.00 591.82 0.00 0.00 0.10 9.90
2.00 0.02 0.00 0.00 0.00 12.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 6.28 85.20

So in short it looks good to me.

-Jakub Wartak.

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#5)

1 attachment(s)

Re: Use streaming read API in ANALYZE

Hi,

On Tue, 2 Apr 2024 at 10:23, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

v4 is rebased on top of v14 streaming read API changes.

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v5-0001-Use-streaming-read-API-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Use-streaming-read-API-in-ANALYZE.patchDownload

From 43e2f2b32e2fdb7e1fd787b1d8595768741f4792 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 3 Apr 2024 01:22:59 +0300
Subject: [PATCH v5] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h              |  4 +-
 src/backend/access/heap/heapam_handler.c | 16 ++---
 src/backend/commands/analyze.c           | 85 ++++++++----------------
 3 files changed, 36 insertions(+), 69 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b632fe953c4..4e35caeb42e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -374,8 +375,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 
 /* in heap/heapam_handler.c*/
 extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
+										   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c86000d245b..0533d9660c4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  The scan has been
+ * started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
 void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1075,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..105285c3ea2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,59 +1191,20 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										scan->rs_rd,
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (nblocks)
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
+		heapam_scan_analyze_next_block(scan, stream);
 
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
@@ -1288,7 +1253,9 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
+		nblocks--;
 	}
+	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan(scan);
-- 
2.43.0

Heikki Linnakangas

hlinnaka@iki.fi

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#7)

Re: Use streaming read API in ANALYZE

On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

I'm getting a repeatable segfault / assertion failure with this:

postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
CREATE TABLE
postgres=# insert into tengiga select g, repeat('x', 900) from
generate_series(1, 1400000) g;
INSERT 0 1400000
postgres=# set default_statistics_target = 10; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target = 100; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target =1000; ANALYZE tengiga;
SET
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
"heapam_handler.c", Line: 1079, PID: 262232
postgres: heikki postgres [local]
ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
postgres: heikki postgres [local]
ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
postgres: heikki postgres [local]
ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
postgres: heikki postgres [local]
ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
postgres: heikki postgres [local]
ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
postgres: heikki postgres [local]
ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
postgres: heikki postgres [local]
ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
postgres: heikki postgres [local] ANALYZE(+0x4444b4)[0x564889cdb4b4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
2024-04-03 20:15:49.157 EEST [262101] LOG: server process (PID 262232)
was terminated by signal 6: Aborted

--
Heikki Linnakangas
Neon (https://neon.tech)

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Jakub Wartak (#6)

Re: Use streaming read API in ANALYZE

Hi Jakub,

Thank you for looking into this and doing a performance analysis.

On Wed, 3 Apr 2024 at 11:42, Jakub Wartak <jakub.wartak@enterprisedb.com> wrote:

On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
[..]

v4 is rebased on top of v14 streaming read API changes.

Hi Nazir, so with streaming API committed, I gave a try to this patch.
With autovacuum=off and 30GB table on NVMe (with standard readahead of
256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
default) created using: create table t as select repeat('a', 100) || i
|| repeat('b', 500) as filler from generate_series(1, 45000000) as i;

on master, effect of mainteance_io_concurency [default 10] is like
that (when resetting the fs cache after each ANALYZE):

m_io_c = 0:
Time: 3137.914 ms (00:03.138)
Time: 3094.540 ms (00:03.095)
Time: 3452.513 ms (00:03.453)

m_io_c = 1:
Time: 2972.751 ms (00:02.973)
Time: 2939.551 ms (00:02.940)
Time: 2904.428 ms (00:02.904)

m_io_c = 2:
Time: 1580.260 ms (00:01.580)
Time: 1572.132 ms (00:01.572)
Time: 1558.334 ms (00:01.558)

m_io_c = 4:
Time: 938.304 ms
Time: 931.772 ms
Time: 920.044 ms

m_io_c = 8:
Time: 666.025 ms
Time: 660.241 ms
Time: 648.848 ms

m_io_c = 16:
Time: 542.450 ms
Time: 561.155 ms
Time: 539.683 ms

m_io_c = 32:
Time: 538.487 ms
Time: 541.705 ms
Time: 538.101 ms

with patch applied:

m_io_c = 0:
Time: 3106.469 ms (00:03.106)
Time: 3140.343 ms (00:03.140)
Time: 3044.133 ms (00:03.044)

m_io_c = 1:
Time: 2959.817 ms (00:02.960)
Time: 2920.265 ms (00:02.920)
Time: 2911.745 ms (00:02.912)

m_io_c = 2:
Time: 1581.912 ms (00:01.582)
Time: 1561.444 ms (00:01.561)
Time: 1558.251 ms (00:01.558)

m_io_c = 4:
Time: 908.116 ms
Time: 901.245 ms
Time: 901.071 ms

m_io_c = 8:
Time: 619.870 ms
Time: 620.327 ms
Time: 614.266 ms

m_io_c = 16:
Time: 529.885 ms
Time: 526.958 ms
Time: 528.474 ms

m_io_c = 32:
Time: 521.185 ms
Time: 520.713 ms
Time: 517.729 ms

No difference to me, which seems to be good. I've double checked and
patch used the new way

acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
-> mdreadv -> FileReadV -> pg_preadv (inlined)
acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
-> ...

I gave also io_combine_limit to 32 (max, 256kB) a try and got those
slightly better results:

[..]
m_io_c = 16:
Time: 494.599 ms
Time: 496.345 ms
Time: 973.500 ms

m_io_c = 32:
Time: 461.031 ms
Time: 449.037 ms
Time: 443.375 ms

and that (last one) apparently was able to push it to ~50-60k still
random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
still reading random , so I assume no merging was done:

Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz
w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s
drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
nvme0n1 61212.00 591.82 0.00 0.00 0.10 9.90
2.00 0.02 0.00 0.00 0.00 12.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 6.28 85.20

So in short it looks good to me.

My results are similar to yours, also I realized a bug while working
on your benchmarking cases. I will share the cause and the fix soon.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#10

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Heikki Linnakangas (#8)

1 attachment(s)

Re: Use streaming read API in ANALYZE

Hi,

Thank you for looking into this!

On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

I'm getting a repeatable segfault / assertion failure with this:

postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
CREATE TABLE
postgres=# insert into tengiga select g, repeat('x', 900) from
generate_series(1, 1400000) g;
INSERT 0 1400000
postgres=# set default_statistics_target = 10; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target = 100; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target =1000; ANALYZE tengiga;
SET
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
"heapam_handler.c", Line: 1079, PID: 262232
postgres: heikki postgres [local]
ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
postgres: heikki postgres [local]
ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
postgres: heikki postgres [local]
ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
postgres: heikki postgres [local]
ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
postgres: heikki postgres [local]
ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
postgres: heikki postgres [local]
ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
postgres: heikki postgres [local]
ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
postgres: heikki postgres [local] ANALYZE(+0x4444b4)[0x564889cdb4b4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
2024-04-03 20:15:49.157 EEST [262101] LOG: server process (PID 262232)
was terminated by signal 6: Aborted

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

There are a couple of solutions I thought of:

1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
the acquire_sample_rows():

Streaming API uses this function to prefetch block numbers.
BlockSampler_HasMore() will reach to the end first as it is used while
prefetching, so it will start to return false while there are still
buffers to return from the streaming API. That will cause some buffers
at the end to not be processed.

2- Expose something (function, variable etc.) from the streaming API
to understand if the read is finished and there is no buffer to
return:

I think this works but I am not sure if the streaming API allows
something like that.

3- Check every buffer returned from the streaming API, if it is
invalid stop the main loop in the acquire_sample_rows():

This solves the problem but there will be two if checks for each
buffer returned,
- in heapam_scan_analyze_next_block() to check if the returned buffer is invalid
- to break main loop in acquire_sample_rows() if
heapam_scan_analyze_next_block() returns false
One of the if cases can be bypassed by moving
heapam_scan_analyze_next_block()'s code to the main loop in the
acquire_sample_rows().

I implemented the third solution, here is v6.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v6-0001-Use-streaming-read-API-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Use-streaming-read-API-in-ANALYZE.patchDownload

From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v6] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h              |  6 +-
 src/backend/access/heap/heapam_handler.c | 22 +++---
 src/backend/commands/analyze.c           | 85 ++++++++----------------
 3 files changed, 42 insertions(+), 71 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a307fb5f245..633caee9d95 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
 /* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
+										   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0952d4a98eb..d83fbbe6af3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  If the block returned
+ * from streaming object is valid, true is returned; otherwise false is returned.
+ * The scan has been started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+bool
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	if (hscan->rs_cbuf == InvalidBuffer)
+		return false;
+
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..764520d5aa2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,59 +1191,21 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										scan->rs_rd,
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (true)
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
+		if (!heapam_scan_analyze_next_block(scan, stream))
+			break;
 
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
@@ -1289,6 +1255,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
 	}
+	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan(scan);
-- 
2.43.0

#11

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#10)

2 attachment(s)

Re: Use streaming read API in ANALYZE

On Wed, Apr 03, 2024 at 10:25:01PM +0300, Nazir Bilal Yavuz wrote:

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

There are a couple of solutions I thought of:

1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
the acquire_sample_rows():

Streaming API uses this function to prefetch block numbers.
BlockSampler_HasMore() will reach to the end first as it is used while
prefetching, so it will start to return false while there are still
buffers to return from the streaming API. That will cause some buffers
at the end to not be processed.

2- Expose something (function, variable etc.) from the streaming API
to understand if the read is finished and there is no buffer to
return:

I think this works but I am not sure if the streaming API allows
something like that.

3- Check every buffer returned from the streaming API, if it is
invalid stop the main loop in the acquire_sample_rows():

This solves the problem but there will be two if checks for each
buffer returned,
- in heapam_scan_analyze_next_block() to check if the returned buffer is invalid
- to break main loop in acquire_sample_rows() if
heapam_scan_analyze_next_block() returns false
One of the if cases can be bypassed by moving
heapam_scan_analyze_next_block()'s code to the main loop in the
acquire_sample_rows().

I implemented the third solution, here is v6.

I've reviewed the patches inline below and attached a patch that has
some of my ideas on top of your patch.

From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v6] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
src/include/access/heapam.h | 6 +-
src/backend/access/heap/heapam_handler.c | 22 +++---
src/backend/commands/analyze.c | 85 ++++++++----------------
3 files changed, 42 insertions(+), 71 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a307fb5f245..633caee9d95 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
#include "storage/bufpage.h"
#include "storage/dsm.h"
#include "storage/lockdefs.h"
+#include "storage/read_stream.h"
#include "storage/shm_toc.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);

/* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
+										   ReadStream *stream);
extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
TransactionId OldestXmin,
double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0952d4a98eb..d83fbbe6af3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
}

/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  If the block returned
+ * from streaming object is valid, true is returned; otherwise false is returned.
+ * The scan has been started with SO_TYPE_ANALYZE option.
*
* This routine holds a buffer pin and lock on the heap page.  They are held
* until heapam_scan_analyze_next_tuple() returns false.  That is until all the
* items of the heap page are analyzed.
*/
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+bool
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
{
HeapScanDesc hscan = (HeapScanDesc) scan;

@@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
* doing much work per tuple, the extra lock traffic is probably better
* avoided.

Personally I think heapam_scan_analyze_next_block() should be inlined.
It only has a few lines. I would find it clearer inline. At the least,
there is no reason for it (or heapam_scan_analyze_next_tuple()) to take
a TableScanDesc instead of a HeapScanDesc.

*/
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	if (hscan->rs_cbuf == InvalidBuffer)
+		return false;
+
LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
+	return true;
}

/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..764520d5aa2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
return stats;
}

+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;

I don't see the point of BlockSampler_HasMore() anymore. I removed it in
the attached and made BlockSampler_Next() return InvalidBlockNumber
under the same conditions. Is there a reason not to do this? There
aren't other callers. If the BlockSampler_Next() wasn't part of an API,
we could just make it the streaming read callback, but that might be
weird as it is now.

That and my other ideas in attached. Let me know what you think.

- Melanie

Attachments:

v7-0001-Use-streaming-read-API-in-ANALYZE.patchtext/x-diff; charset=us-asciiDownload

From 960e52eca955086e49d80a1dba20db7db4f43b16 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v7 1/2] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/backend/access/heap/heapam_handler.c | 22 +++---
 src/backend/commands/analyze.c           | 85 ++++++++----------------
 src/include/access/heapam.h              |  6 +-
 3 files changed, 42 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0952d4a98eb..d83fbbe6af3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  If the block returned
+ * from streaming object is valid, true is returned; otherwise false is returned.
+ * The scan has been started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+bool
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	if (hscan->rs_cbuf == InvalidBuffer)
+		return false;
+
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..764520d5aa2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,59 +1191,21 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										scan->rs_rd,
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (true)
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
+		if (!heapam_scan_analyze_next_block(scan, stream))
+			break;
 
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
@@ -1289,6 +1255,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
 	}
+	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan(scan);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a307fb5f245..633caee9d95 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
 /* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
+										   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
-- 
2.40.1

v7-0002-some-ideas.patchtext/x-diff; charset=us-asciiDownload

From d75cfac7a5ca380ad86569a65b678390b9768d4e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 3 Apr 2024 16:28:46 -0400
Subject: [PATCH v7 2/2] some ideas

---
 src/backend/access/heap/heapam_handler.c | 57 +++++-------------------
 src/backend/commands/analyze.c           | 21 +++++----
 src/backend/utils/misc/sampling.c        |  9 +---
 src/include/access/heapam.h              |  4 +-
 src/include/utils/sampling.h             |  1 -
 5 files changed, 27 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d83fbbe6af3..aab392f50e5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1053,40 +1053,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block returned from streaming object.  If the block returned
- * from streaming object is valid, true is returned; otherwise false is returned.
- * The scan has been started with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-bool
-heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
-	if (hscan->rs_cbuf == InvalidBuffer)
-		return false;
-
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-
-	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
-	hscan->rs_cindex = FirstOffsetNumber;
-	return true;
-}
-
 /*
  * Iterate over tuples in the block selected with
  * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
@@ -1098,11 +1064,10 @@ heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
  * tuples.
  */
 bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
+heapam_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
 							   TupleTableSlot *slot)
 {
-	HeapScanDesc hscan = (HeapScanDesc) scan;
 	Page		targpage;
 	OffsetNumber maxoffset;
 	BufferHeapTupleTableSlot *hslot;
@@ -1110,17 +1075,17 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 	hslot = (BufferHeapTupleTableSlot *) slot;
-	targpage = BufferGetPage(hscan->rs_cbuf);
+	targpage = BufferGetPage(scan->rs_cbuf);
 	maxoffset = PageGetMaxOffsetNumber(targpage);
 
 	/* Inner loop over all tuples on the selected page */
-	for (; hscan->rs_cindex <= maxoffset; hscan->rs_cindex++)
+	for (; scan->rs_cindex <= maxoffset; scan->rs_cindex++)
 	{
 		ItemId		itemid;
 		HeapTuple	targtuple = &hslot->base.tupdata;
 		bool		sample_it = false;
 
-		itemid = PageGetItemId(targpage, hscan->rs_cindex);
+		itemid = PageGetItemId(targpage, scan->rs_cindex);
 
 		/*
 		 * We ignore unused and redirect line pointers.  DEAD line pointers
@@ -1135,14 +1100,14 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 			continue;
 		}
 
-		ItemPointerSet(&targtuple->t_self, hscan->rs_cblock, hscan->rs_cindex);
+		ItemPointerSet(&targtuple->t_self, scan->rs_cblock, scan->rs_cindex);
 
-		targtuple->t_tableOid = RelationGetRelid(scan->rs_rd);
+		targtuple->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
 		targtuple->t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
 		targtuple->t_len = ItemIdGetLength(itemid);
 
 		switch (HeapTupleSatisfiesVacuum(targtuple, OldestXmin,
-										 hscan->rs_cbuf))
+										 scan->rs_cbuf))
 		{
 			case HEAPTUPLE_LIVE:
 				sample_it = true;
@@ -1222,8 +1187,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 
 		if (sample_it)
 		{
-			ExecStoreBufferHeapTuple(targtuple, slot, hscan->rs_cbuf);
-			hscan->rs_cindex++;
+			ExecStoreBufferHeapTuple(targtuple, slot, scan->rs_cbuf);
+			scan->rs_cindex++;
 
 			/* note that we leave the buffer locked here! */
 			return true;
@@ -1231,8 +1196,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	}
 
 	/* Now release the lock and pin on the page */
-	UnlockReleaseBuffer(hscan->rs_cbuf);
-	hscan->rs_cbuf = InvalidBuffer;
+	UnlockReleaseBuffer(scan->rs_cbuf);
+	scan->rs_cbuf = InvalidBuffer;
 
 	/* also prevent old slot contents from having pin on page */
 	ExecClearTuple(slot);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 764520d5aa2..7002ece8be8 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1111,9 +1111,7 @@ block_sampling_streaming_read_next(ReadStream *stream,
 								   void *user_data,
 								   void *per_buffer_data)
 {
-	BlockSamplerData *bs = user_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(user_data);
 }
 
 /*
@@ -1165,7 +1163,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	BlockSamplerData bs;
 	ReservoirStateData rstate;
 	TupleTableSlot *slot;
-	TableScanDesc scan;
+	HeapScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
 	ReadStream *stream;
@@ -1188,12 +1186,12 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vac_strategy,
-										scan->rs_rd,
+										onerel,
 										MAIN_FORKNUM,
 										block_sampling_streaming_read_next,
 										&bs,
@@ -1204,9 +1202,16 @@ acquire_sample_rows(Relation onerel, int elevel,
 	{
 		vacuum_delay_point();
 
-		if (!heapam_scan_analyze_next_block(scan, stream))
+		scan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+
+		if (!BufferIsValid(scan->rs_cbuf))
 			break;
 
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		scan->rs_cindex = FirstOffsetNumber;
+
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
@@ -1258,7 +1263,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
-	heap_endscan(scan);
+	heap_endscan(&scan->rs_base);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702c..69b73f5b115 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,8 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;				/* probability to skip block */
 	double		V;				/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+	if (bs->t >= bs->N || bs->m >= bs->n)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 633caee9d95..f76dd18ad5d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -389,9 +389,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
 /* in heap/heapam_handler.c*/
-extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
-										   ReadStream *stream);
-extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
+extern bool heapam_scan_analyze_next_tuple(HeapScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
 										   TupleTableSlot *slot);
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52bac..fb5d6820a24 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 									 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.40.1

#12

Nazir Bilal Yavuz

byavuz81@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#11)

2 attachment(s)

Re: Use streaming read API in ANALYZE

Hi,

On Wed, 3 Apr 2024 at 23:44, Melanie Plageman <melanieplageman@gmail.com> wrote:

I've reviewed the patches inline below and attached a patch that has
some of my ideas on top of your patch.

Thank you!

From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v6] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
src/include/access/heapam.h | 6 +-
src/backend/access/heap/heapam_handler.c | 22 +++---
src/backend/commands/analyze.c | 85 ++++++++----------------
3 files changed, 42 insertions(+), 71 deletions(-)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a307fb5f245..633caee9d95 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
#include "storage/bufpage.h"
#include "storage/dsm.h"
#include "storage/lockdefs.h"
+#include "storage/read_stream.h"
#include "storage/shm_toc.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
struct GlobalVisState *vistest);
/* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-                                                                                BlockNumber blockno,
-                                                                                BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
+                                                                                ReadStream *stream);
extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
TransactionId OldestXmin,
double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0952d4a98eb..d83fbbe6af3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
}
/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  If the block returned
+ * from streaming object is valid, true is returned; otherwise false is returned.
+ * The scan has been started with SO_TYPE_ANALYZE option.
*
* This routine holds a buffer pin and lock on the heap page.  They are held
* until heapam_scan_analyze_next_tuple() returns false.  That is until all the
* items of the heap page are analyzed.
*/
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-                                                        BufferAccessStrategy bstrategy)
+bool
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
{
HeapScanDesc hscan = (HeapScanDesc) scan;
@@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
* doing much work per tuple, the extra lock traffic is probably better
* avoided.
Personally I think heapam_scan_analyze_next_block() should be inlined.
It only has a few lines. I would find it clearer inline. At the least,
there is no reason for it (or heapam_scan_analyze_next_tuple()) to take
a TableScanDesc instead of a HeapScanDesc.

I agree.

*/
-     hscan->rs_cblock = blockno;
-     hscan->rs_cindex = FirstOffsetNumber;
-     hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-                                                                             blockno, RBM_NORMAL, bstrategy);
+     hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+     if (hscan->rs_cbuf == InvalidBuffer)
+             return false;
+
LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+     hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+     hscan->rs_cindex = FirstOffsetNumber;
+     return true;
}

/*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..764520d5aa2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
return stats;
}

+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+                                                                void *user_data,
+                                                                void *per_buffer_data)
+{
+     BlockSamplerData *bs = user_data;
+
+     return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;

I agree. There is no reason to have BlockSampler_HasMore() after
streaming read API changes.

That and my other ideas in attached. Let me know what you think.

I agree with your changes but I am not sure if others agree with all
the changes you have proposed. So, I didn't merge 0001 and your ideas
yet, instead I wrote a commit message, added some comments, changed ->
'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
attached it as 0002.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v8-0001-Use-streaming-read-API-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Use-streaming-read-API-in-ANALYZE.patchDownload

From 9715cc628846697b039d3145fe83082d8756851c Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v8 1/2] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h              |  6 +-
 src/backend/access/heap/heapam_handler.c | 22 +++---
 src/backend/commands/analyze.c           | 85 ++++++++----------------
 3 files changed, 42 insertions(+), 71 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2765efc4e5e..fbadb9eeea1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
 /* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
+										   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3e7a6b5548b..a32c69cf034 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  If the block returned
+ * from streaming object is valid, true is returned; otherwise false is returned.
+ * The scan has been started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+bool
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	if (hscan->rs_cbuf == InvalidBuffer)
+		return false;
+
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..764520d5aa2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,59 +1191,21 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										scan->rs_rd,
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (true)
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
+		if (!heapam_scan_analyze_next_block(scan, stream))
+			break;
 
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
@@ -1289,6 +1255,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
 	}
+	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan(scan);
-- 
2.43.0

v8-0002-Refactorings-on-top-of-using-streaming-read-API-i.patchtext/x-patch; charset=US-ASCII; name=v8-0002-Refactorings-on-top-of-using-streaming-read-API-i.patchDownload

From c5c006322f4fb49b4da152ac8222babcb1e24651 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 4 Apr 2024 13:30:46 +0300
Subject: [PATCH v8 2/2] Refactorings on top of using streaming read API in
 ANALYZE

This patch includes a couple of refactorings on top of using streaming
read API in ANALYZE. So, this patch should be committed after
'Use streaming read API in ANALYZE' patch is committed.

Refactorings:

- heapam_scan_analyze_next_block() is inlined, its content is moved
  under a main loop in acquire_sample_rows(). Comments are updated
  regarding this change.

- heapam_scan_analyze_next_tuple() takes HeapScanDesc instead of
  TableScanDesc now.

- BlockSampler_HasMore() is removed, BlockSampler_Next() returns an
  InvalidBlockNumber if there are no remaining blocks or no blocks to
  sample.

Author: Melanie Plageman <melanieplageman@gmail.com>
---
 src/include/access/heapam.h              |  4 +-
 src/include/utils/sampling.h             |  1 -
 src/backend/access/heap/heapam_handler.c | 66 ++++++------------------
 src/backend/commands/analyze.c           | 33 +++++++++---
 src/backend/utils/misc/sampling.c        | 10 ++--
 5 files changed, 44 insertions(+), 70 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index fbadb9eeea1..8b5fcf4a4a9 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -389,9 +389,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
 /* in heap/heapam_handler.c*/
-extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
-										   ReadStream *stream);
-extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
+extern bool heapam_scan_analyze_next_tuple(HeapScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
 										   TupleTableSlot *slot);
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52bac..fb5d6820a24 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 									 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a32c69cf034..34d99e30c21 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,55 +1054,19 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block returned from streaming object.  If the block returned
- * from streaming object is valid, true is returned; otherwise false is returned.
- * The scan has been started with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-bool
-heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
-	if (hscan->rs_cbuf == InvalidBuffer)
-		return false;
-
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-
-	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
-	hscan->rs_cindex = FirstOffsetNumber;
-	return true;
-}
-
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
+ * Iterate over tuples in the block selected to analyze.  If a tuple that's
+ * suitable for sampling is found, true is returned and a tuple is stored in
+ * `slot`.  When no more tuples for sampling, false is returned and the pin
+ * and lock on the current buffer in scan are released.
  *
  * *liverows and *deadrows are incremented according to the encountered
  * tuples.
  */
 bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
+heapam_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin,
 							   double *liverows, double *deadrows,
 							   TupleTableSlot *slot)
 {
-	HeapScanDesc hscan = (HeapScanDesc) scan;
 	Page		targpage;
 	OffsetNumber maxoffset;
 	BufferHeapTupleTableSlot *hslot;
@@ -1110,17 +1074,17 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	Assert(TTS_IS_BUFFERTUPLE(slot));
 
 	hslot = (BufferHeapTupleTableSlot *) slot;
-	targpage = BufferGetPage(hscan->rs_cbuf);
+	targpage = BufferGetPage(scan->rs_cbuf);
 	maxoffset = PageGetMaxOffsetNumber(targpage);
 
 	/* Inner loop over all tuples on the selected page */
-	for (; hscan->rs_cindex <= maxoffset; hscan->rs_cindex++)
+	for (; scan->rs_cindex <= maxoffset; scan->rs_cindex++)
 	{
 		ItemId		itemid;
 		HeapTuple	targtuple = &hslot->base.tupdata;
 		bool		sample_it = false;
 
-		itemid = PageGetItemId(targpage, hscan->rs_cindex);
+		itemid = PageGetItemId(targpage, scan->rs_cindex);
 
 		/*
 		 * We ignore unused and redirect line pointers.  DEAD line pointers
@@ -1135,14 +1099,14 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 			continue;
 		}
 
-		ItemPointerSet(&targtuple->t_self, hscan->rs_cblock, hscan->rs_cindex);
+		ItemPointerSet(&targtuple->t_self, scan->rs_cblock, scan->rs_cindex);
 
-		targtuple->t_tableOid = RelationGetRelid(scan->rs_rd);
+		targtuple->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
 		targtuple->t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
 		targtuple->t_len = ItemIdGetLength(itemid);
 
 		switch (HeapTupleSatisfiesVacuum(targtuple, OldestXmin,
-										 hscan->rs_cbuf))
+										 scan->rs_cbuf))
 		{
 			case HEAPTUPLE_LIVE:
 				sample_it = true;
@@ -1222,8 +1186,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 
 		if (sample_it)
 		{
-			ExecStoreBufferHeapTuple(targtuple, slot, hscan->rs_cbuf);
-			hscan->rs_cindex++;
+			ExecStoreBufferHeapTuple(targtuple, slot, scan->rs_cbuf);
+			scan->rs_cindex++;
 
 			/* note that we leave the buffer locked here! */
 			return true;
@@ -1231,8 +1195,8 @@ heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
 	}
 
 	/* Now release the lock and pin on the page */
-	UnlockReleaseBuffer(hscan->rs_cbuf);
-	hscan->rs_cbuf = InvalidBuffer;
+	UnlockReleaseBuffer(scan->rs_cbuf);
+	scan->rs_cbuf = InvalidBuffer;
 
 	/* also prevent old slot contents from having pin on page */
 	ExecClearTuple(slot);
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 764520d5aa2..ae4f13be572 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1111,9 +1111,7 @@ block_sampling_streaming_read_next(ReadStream *stream,
 								   void *user_data,
 								   void *per_buffer_data)
 {
-	BlockSamplerData *bs = user_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(user_data);
 }
 
 /*
@@ -1165,7 +1163,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	BlockSamplerData bs;
 	ReservoirStateData rstate;
 	TupleTableSlot *slot;
-	TableScanDesc scan;
+	HeapScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
 	ReadStream *stream;
@@ -1188,12 +1186,12 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
 	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
 										vac_strategy,
-										scan->rs_rd,
+										onerel,
 										MAIN_FORKNUM,
 										block_sampling_streaming_read_next,
 										&bs,
@@ -1204,9 +1202,28 @@ acquire_sample_rows(Relation onerel, int elevel,
 	{
 		vacuum_delay_point();
 
-		if (!heapam_scan_analyze_next_block(scan, stream))
+		/*
+		 * We must maintain a pin on the target page's buffer to ensure that
+		 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out
+		 * from under us.  Hence, pin the page until we are done looking at
+		 * it.
+		 */
+		scan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+
+		if (!BufferIsValid(scan->rs_cbuf))
 			break;
 
+		/*
+		 * We choose to hold sharelock on the buffer throughout --- we could
+		 * release and re-acquire sharelock for each tuple, but since we
+		 * aren't doing much work per tuple, the extra lock traffic is
+		 * probably better avoided.
+		 */
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		scan->rs_cindex = FirstOffsetNumber;
+
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
@@ -1258,7 +1275,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
-	heap_endscan(scan);
+	heap_endscan(&scan->rs_base);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702c..cae89e331fb 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;				/* probability to skip block */
 	double		V;				/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+	/* no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
-- 
2.43.0

#13

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Nazir Bilal Yavuz (#12)

3 attachment(s)

Re: Use streaming read API in ANALYZE

On Thu, Apr 04, 2024 at 02:03:30PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 3 Apr 2024 at 23:44, Melanie Plageman <melanieplageman@gmail.com> wrote:

I don't see the point of BlockSampler_HasMore() anymore. I removed it in
the attached and made BlockSampler_Next() return InvalidBlockNumber
under the same conditions. Is there a reason not to do this? There
aren't other callers. If the BlockSampler_Next() wasn't part of an API,
we could just make it the streaming read callback, but that might be
weird as it is now.

I agree. There is no reason to have BlockSampler_HasMore() after
streaming read API changes.

That and my other ideas in attached. Let me know what you think.

I agree with your changes but I am not sure if others agree with all
the changes you have proposed. So, I didn't merge 0001 and your ideas
yet, instead I wrote a commit message, added some comments, changed ->
'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
attached it as 0002.

I couldn't quite let go of those changes to acquire_sample_rows(), so
attached v9 0001 implements them as a preliminary patch before your
analyze streaming read user. I inlined heapam_scan_analyze_next_block()
entirely and made heapam_scan_analyze_next_tuple() a static function in
commands/analyze.c (and tweaked the name).

I made a few tweaks to your patch since it is on top of those changes
instead of preceding them. Then 0003 is removing BlockSampler_HasMore()
since it doesn't make sense to remove it before the streaming read user
was added.

- Melanie

Attachments:

v9-0001-Make-heapam_scan_analyze_next_-tuple-block-static.patchtext/x-diff; charset=us-asciiDownload

From b301a73e2ede8eb1d95e6a8c8d89f09959d9c22a Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v9 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Move heapam_scan_analyze_next_tuple() to analyze.c as a static helper for
acquire_sample_rows() and inline heapam_scan_analyze_next_block().

Author: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 183 -----------------------
 src/backend/commands/analyze.c           | 181 +++++++++++++++++++++-
 src/include/access/heapam.h              |   9 --
 3 files changed, 174 insertions(+), 199 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..5edac76ceb6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,189 +1054,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							   double *liverows, double *deadrows,
-							   TupleTableSlot *slot)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-	Page		targpage;
-	OffsetNumber maxoffset;
-	BufferHeapTupleTableSlot *hslot;
-
-	Assert(TTS_IS_BUFFERTUPLE(slot));
-
-	hslot = (BufferHeapTupleTableSlot *) slot;
-	targpage = BufferGetPage(hscan->rs_cbuf);
-	maxoffset = PageGetMaxOffsetNumber(targpage);
-
-	/* Inner loop over all tuples on the selected page */
-	for (; hscan->rs_cindex <= maxoffset; hscan->rs_cindex++)
-	{
-		ItemId		itemid;
-		HeapTuple	targtuple = &hslot->base.tupdata;
-		bool		sample_it = false;
-
-		itemid = PageGetItemId(targpage, hscan->rs_cindex);
-
-		/*
-		 * We ignore unused and redirect line pointers.  DEAD line pointers
-		 * should be counted as dead, because we need vacuum to run to get rid
-		 * of them.  Note that this rule agrees with the way that
-		 * heap_page_prune_and_freeze() counts things.
-		 */
-		if (!ItemIdIsNormal(itemid))
-		{
-			if (ItemIdIsDead(itemid))
-				*deadrows += 1;
-			continue;
-		}
-
-		ItemPointerSet(&targtuple->t_self, hscan->rs_cblock, hscan->rs_cindex);
-
-		targtuple->t_tableOid = RelationGetRelid(scan->rs_rd);
-		targtuple->t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
-		targtuple->t_len = ItemIdGetLength(itemid);
-
-		switch (HeapTupleSatisfiesVacuum(targtuple, OldestXmin,
-										 hscan->rs_cbuf))
-		{
-			case HEAPTUPLE_LIVE:
-				sample_it = true;
-				*liverows += 1;
-				break;
-
-			case HEAPTUPLE_DEAD:
-			case HEAPTUPLE_RECENTLY_DEAD:
-				/* Count dead and recently-dead rows */
-				*deadrows += 1;
-				break;
-
-			case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-				/*
-				 * Insert-in-progress rows are not counted.  We assume that
-				 * when the inserting transaction commits or aborts, it will
-				 * send a stats message to increment the proper count.  This
-				 * works right only if that transaction ends after we finish
-				 * analyzing the table; if things happen in the other order,
-				 * its stats update will be overwritten by ours.  However, the
-				 * error will be large only if the other transaction runs long
-				 * enough to insert many tuples, so assuming it will finish
-				 * after us is the safer option.
-				 *
-				 * A special case is that the inserting transaction might be
-				 * our own.  In this case we should count and sample the row,
-				 * to accommodate users who load a table and analyze it in one
-				 * transaction.  (pgstat_report_analyze has to adjust the
-				 * numbers we report to the cumulative stats system to make
-				 * this come out right.)
-				 */
-				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
-				{
-					sample_it = true;
-					*liverows += 1;
-				}
-				break;
-
-			case HEAPTUPLE_DELETE_IN_PROGRESS:
-
-				/*
-				 * We count and sample delete-in-progress rows the same as
-				 * live ones, so that the stats counters come out right if the
-				 * deleting transaction commits after us, per the same
-				 * reasoning given above.
-				 *
-				 * If the delete was done by our own transaction, however, we
-				 * must count the row as dead to make pgstat_report_analyze's
-				 * stats adjustments come out right.  (Note: this works out
-				 * properly when the row was both inserted and deleted in our
-				 * xact.)
-				 *
-				 * The net effect of these choices is that we act as though an
-				 * IN_PROGRESS transaction hasn't happened yet, except if it
-				 * is our own transaction, which we assume has happened.
-				 *
-				 * This approach ensures that we behave sanely if we see both
-				 * the pre-image and post-image rows for a row being updated
-				 * by a concurrent transaction: we will sample the pre-image
-				 * but not the post-image.  We also get sane results if the
-				 * concurrent transaction never commits.
-				 */
-				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
-					*deadrows += 1;
-				else
-				{
-					sample_it = true;
-					*liverows += 1;
-				}
-				break;
-
-			default:
-				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-				break;
-		}
-
-		if (sample_it)
-		{
-			ExecStoreBufferHeapTuple(targtuple, slot, hscan->rs_cbuf);
-			hscan->rs_cindex++;
-
-			/* note that we leave the buffer locked here! */
-			return true;
-		}
-	}
-
-	/* Now release the lock and pin on the page */
-	UnlockReleaseBuffer(hscan->rs_cbuf);
-	hscan->rs_cbuf = InvalidBuffer;
-
-	/* also prevent old slot contents from having pin on page */
-	ExecClearTuple(slot);
-
-	return false;
-}
-
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..335ffb24302 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,157 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Iterate over tuples in the block pinned and locked in scan->rs_cbuf by
+ * acquire_sample_rows(). If a tuple that's suitable for sampling is found,
+ * store the tuple in `slot` and return true. When there are no more tuples for
+ * sampling, return false.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
+static bool
+heap_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin,
+							 double *liverows, double *deadrows,
+							 TupleTableSlot *slot)
+{
+	Page		targpage;
+	OffsetNumber maxoffset;
+	BufferHeapTupleTableSlot *hslot;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
+
+	hslot = (BufferHeapTupleTableSlot *) slot;
+	targpage = BufferGetPage(scan->rs_cbuf);
+	maxoffset = PageGetMaxOffsetNumber(targpage);
+
+	/* Inner loop over all tuples on the selected page */
+	for (; scan->rs_cindex <= maxoffset; scan->rs_cindex++)
+	{
+		ItemId		itemid;
+		HeapTuple	targtuple = &hslot->base.tupdata;
+		bool		sample_it = false;
+
+		itemid = PageGetItemId(targpage, scan->rs_cindex);
+
+		/*
+		 * We ignore unused and redirect line pointers.  DEAD line pointers
+		 * should be counted as dead, because we need vacuum to run to get rid
+		 * of them.  Note that this rule agrees with the way that
+		 * heap_page_prune_and_freeze() counts things.
+		 */
+		if (!ItemIdIsNormal(itemid))
+		{
+			if (ItemIdIsDead(itemid))
+				*deadrows += 1;
+			continue;
+		}
+
+		ItemPointerSet(&targtuple->t_self, scan->rs_cblock, scan->rs_cindex);
+
+		targtuple->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		targtuple->t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
+		targtuple->t_len = ItemIdGetLength(itemid);
+
+		switch (HeapTupleSatisfiesVacuum(targtuple, OldestXmin,
+										 scan->rs_cbuf))
+		{
+			case HEAPTUPLE_LIVE:
+				sample_it = true;
+				*liverows += 1;
+				break;
+
+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+				/* Count dead and recently-dead rows */
+				*deadrows += 1;
+				break;
+
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Insert-in-progress rows are not counted.  We assume that
+				 * when the inserting transaction commits or aborts, it will
+				 * send a stats message to increment the proper count.  This
+				 * works right only if that transaction ends after we finish
+				 * analyzing the table; if things happen in the other order,
+				 * its stats update will be overwritten by ours.  However, the
+				 * error will be large only if the other transaction runs long
+				 * enough to insert many tuples, so assuming it will finish
+				 * after us is the safer option.
+				 *
+				 * A special case is that the inserting transaction might be
+				 * our own.  In this case we should count and sample the row,
+				 * to accommodate users who load a table and analyze it in one
+				 * transaction.  (pgstat_report_analyze has to adjust the
+				 * numbers we report to the cumulative stats system to make
+				 * this come out right.)
+				 */
+				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
+				{
+					sample_it = true;
+					*liverows += 1;
+				}
+				break;
+
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+
+				/*
+				 * We count and sample delete-in-progress rows the same as
+				 * live ones, so that the stats counters come out right if the
+				 * deleting transaction commits after us, per the same
+				 * reasoning given above.
+				 *
+				 * If the delete was done by our own transaction, however, we
+				 * must count the row as dead to make pgstat_report_analyze's
+				 * stats adjustments come out right.  (Note: this works out
+				 * properly when the row was both inserted and deleted in our
+				 * xact.)
+				 *
+				 * The net effect of these choices is that we act as though an
+				 * IN_PROGRESS transaction hasn't happened yet, except if it
+				 * is our own transaction, which we assume has happened.
+				 *
+				 * This approach ensures that we behave sanely if we see both
+				 * the pre-image and post-image rows for a row being updated
+				 * by a concurrent transaction: we will sample the pre-image
+				 * but not the post-image.  We also get sane results if the
+				 * concurrent transaction never commits.
+				 */
+				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
+					*deadrows += 1;
+				else
+				{
+					sample_it = true;
+					*liverows += 1;
+				}
+				break;
+
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (sample_it)
+		{
+			ExecStoreBufferHeapTuple(targtuple, slot, scan->rs_cbuf);
+			scan->rs_cindex++;
+
+			/* note that we leave the buffer locked here! */
+			return true;
+		}
+	}
+
+	/* Now release the lock and pin on the page */
+	UnlockReleaseBuffer(scan->rs_cbuf);
+	scan->rs_cbuf = InvalidBuffer;
+
+	/* Prevent old slot contents from having pin on page */
+	ExecClearTuple(slot);
+
+	return false;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1151,7 +1302,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	BlockSamplerData bs;
 	ReservoirStateData rstate;
 	TupleTableSlot *slot;
-	TableScanDesc scan;
+	HeapScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
 #ifdef USE_PREFETCH
@@ -1184,7 +1335,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
 #ifdef USE_PREFETCH
@@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
 				break;
 
 			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block);
 		}
 	}
 #endif
 
+	scan->rs_cbuf = InvalidBuffer;
+
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
@@ -1229,7 +1382,21 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
+		scan->rs_cblock = targblock;
+		scan->rs_cindex = FirstOffsetNumber;
+
+		/*
+		 * We must maintain a pin on the target page's buffer to ensure that
+		 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out
+		 * from under us.  Hence, pin the page until we are done looking at
+		 * it.  We also choose to hold sharelock on the buffer throughout ---
+		 * we could release and re-acquire sharelock for each tuple, but since
+		 * we aren't doing much work per tuple, the extra lock traffic is
+		 * probably better avoided.
+		 */
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   targblock, RBM_NORMAL, vac_strategy);
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
 #ifdef USE_PREFETCH
 
@@ -1238,10 +1405,10 @@ acquire_sample_rows(Relation onerel, int elevel,
 		 * next one we will want, if there's any left.
 		 */
 		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_targblock);
 #endif
 
-		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (heap_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1291,7 +1458,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
-	heap_endscan(scan);
+	heap_endscan((TableScanDesc) scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 48936826bcc..be630620d0d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -412,15 +412,6 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
-/* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
-extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
-										   TransactionId OldestXmin,
-										   double *liverows, double *deadrows,
-										   TupleTableSlot *slot);
-
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
-- 
2.40.1

v9-0002-Use-streaming-read-API-in-ANALYZE.patchtext/x-diff; charset=us-asciiDownload

From 8ad10261b84775fde97f71cd273041af5160fc28 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:28:32 -0400
Subject: [PATCH v9 2/3] Use streaming read API in ANALYZE

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz
Reviewed-by: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c | 89 ++++++++++------------------------
 1 file changed, 26 insertions(+), 63 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 335ffb24302..3cfad92390d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1253,6 +1253,20 @@ heap_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * Streaming read callback returning the next block number while using
+ * BlockSampling algorithm.
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1305,10 +1319,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	HeapScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1321,13 +1332,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1337,54 +1341,21 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 	scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
-
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										scan->rs_base.rs_rd,
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	scan->rs_cbuf = InvalidBuffer;
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (BufferIsValid(scan->rs_cbuf = read_stream_next_buffer(stream, NULL)))
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		scan->rs_cblock = targblock;
-		scan->rs_cindex = FirstOffsetNumber;
-
 		/*
 		 * We must maintain a pin on the target page's buffer to ensure that
 		 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out
@@ -1394,19 +1365,10 @@ acquire_sample_rows(Relation onerel, int elevel,
 		 * we aren't doing much work per tuple, the extra lock traffic is
 		 * probably better avoided.
 		 */
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   targblock, RBM_NORMAL, vac_strategy);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		scan->rs_cindex = FirstOffsetNumber;
 
 		while (heap_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
@@ -1456,6 +1418,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
 	}
+	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan((TableScanDesc) scan);
-- 
2.40.1

v9-0003-Obsolete-BlockSampler_HasMore.patchtext/x-diff; charset=us-asciiDownload

From 72ea11920c9e32b14326577571046d55a254b0fa Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:38:41 -0400
Subject: [PATCH v9 3/3] Obsolete BlockSampler_HasMore()

A previous commit stopped using BlockSampler_HasMore() for flow control
in acquire_sample_rows(). There seems little use now for
BlockSampler_HasMore(). It should be sufficient to return
InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
would have returned false. Remove BlockSampler_HasMore().

Author: Melanie Plageman
Reviewed-by: Nazir Bilal Yavuz
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c    |  4 +---
 src/backend/utils/misc/sampling.c | 11 ++++-------
 src/include/utils/sampling.h      |  1 -
 3 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 3cfad92390d..95d113bb907 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1262,9 +1262,7 @@ block_sampling_streaming_read_next(ReadStream *stream,
 								   void *user_data,
 								   void *per_buffer_data)
 {
-	BlockSamplerData *bs = user_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(user_data);
 }
 
 /*
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702c..245d826affe 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,10 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;				/* probability to skip block */
 	double		V;				/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+
+	/* Return if no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52bac..fb5d6820a24 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 									 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.40.1

#14

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#13)

3 attachment(s)

Re: Use streaming read API in ANALYZE

On Sun, Apr 7, 2024 at 3:57 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Apr 04, 2024 at 02:03:30PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 3 Apr 2024 at 23:44, Melanie Plageman <melanieplageman@gmail.com> wrote:

I don't see the point of BlockSampler_HasMore() anymore. I removed it in
the attached and made BlockSampler_Next() return InvalidBlockNumber
under the same conditions. Is there a reason not to do this? There
aren't other callers. If the BlockSampler_Next() wasn't part of an API,
we could just make it the streaming read callback, but that might be
weird as it is now.

I agree. There is no reason to have BlockSampler_HasMore() after
streaming read API changes.

That and my other ideas in attached. Let me know what you think.

I agree with your changes but I am not sure if others agree with all
the changes you have proposed. So, I didn't merge 0001 and your ideas
yet, instead I wrote a commit message, added some comments, changed ->
'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
attached it as 0002.

I couldn't quite let go of those changes to acquire_sample_rows(), so
attached v9 0001 implements them as a preliminary patch before your
analyze streaming read user. I inlined heapam_scan_analyze_next_block()
entirely and made heapam_scan_analyze_next_tuple() a static function in
commands/analyze.c (and tweaked the name).

I made a few tweaks to your patch since it is on top of those changes
instead of preceding them. Then 0003 is removing BlockSampler_HasMore()
since it doesn't make sense to remove it before the streaming read user
was added.

I realized there were a few outdated comments. Fixed in attached v10.

- Melanie

Attachments:

v10-0001-Make-heapam_scan_analyze_next_-tuple-block-stati.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Make-heapam_scan_analyze_next_-tuple-block-stati.patchDownload

From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Move heapam_scan_analyze_next_tuple() to analyze.c as a static helper for
acquire_sample_rows() and inline heapam_scan_analyze_next_block().

Author: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 193 +----------------------
 src/backend/commands/analyze.c           | 181 ++++++++++++++++++++-
 src/include/access/heapam.h              |   9 --
 3 files changed, 179 insertions(+), 204 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..364dd0b165b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,189 +1054,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-							   double *liverows, double *deadrows,
-							   TupleTableSlot *slot)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-	Page		targpage;
-	OffsetNumber maxoffset;
-	BufferHeapTupleTableSlot *hslot;
-
-	Assert(TTS_IS_BUFFERTUPLE(slot));
-
-	hslot = (BufferHeapTupleTableSlot *) slot;
-	targpage = BufferGetPage(hscan->rs_cbuf);
-	maxoffset = PageGetMaxOffsetNumber(targpage);
-
-	/* Inner loop over all tuples on the selected page */
-	for (; hscan->rs_cindex <= maxoffset; hscan->rs_cindex++)
-	{
-		ItemId		itemid;
-		HeapTuple	targtuple = &hslot->base.tupdata;
-		bool		sample_it = false;
-
-		itemid = PageGetItemId(targpage, hscan->rs_cindex);
-
-		/*
-		 * We ignore unused and redirect line pointers.  DEAD line pointers
-		 * should be counted as dead, because we need vacuum to run to get rid
-		 * of them.  Note that this rule agrees with the way that
-		 * heap_page_prune_and_freeze() counts things.
-		 */
-		if (!ItemIdIsNormal(itemid))
-		{
-			if (ItemIdIsDead(itemid))
-				*deadrows += 1;
-			continue;
-		}
-
-		ItemPointerSet(&targtuple->t_self, hscan->rs_cblock, hscan->rs_cindex);
-
-		targtuple->t_tableOid = RelationGetRelid(scan->rs_rd);
-		targtuple->t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
-		targtuple->t_len = ItemIdGetLength(itemid);
-
-		switch (HeapTupleSatisfiesVacuum(targtuple, OldestXmin,
-										 hscan->rs_cbuf))
-		{
-			case HEAPTUPLE_LIVE:
-				sample_it = true;
-				*liverows += 1;
-				break;
-
-			case HEAPTUPLE_DEAD:
-			case HEAPTUPLE_RECENTLY_DEAD:
-				/* Count dead and recently-dead rows */
-				*deadrows += 1;
-				break;
-
-			case HEAPTUPLE_INSERT_IN_PROGRESS:
-
-				/*
-				 * Insert-in-progress rows are not counted.  We assume that
-				 * when the inserting transaction commits or aborts, it will
-				 * send a stats message to increment the proper count.  This
-				 * works right only if that transaction ends after we finish
-				 * analyzing the table; if things happen in the other order,
-				 * its stats update will be overwritten by ours.  However, the
-				 * error will be large only if the other transaction runs long
-				 * enough to insert many tuples, so assuming it will finish
-				 * after us is the safer option.
-				 *
-				 * A special case is that the inserting transaction might be
-				 * our own.  In this case we should count and sample the row,
-				 * to accommodate users who load a table and analyze it in one
-				 * transaction.  (pgstat_report_analyze has to adjust the
-				 * numbers we report to the cumulative stats system to make
-				 * this come out right.)
-				 */
-				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
-				{
-					sample_it = true;
-					*liverows += 1;
-				}
-				break;
-
-			case HEAPTUPLE_DELETE_IN_PROGRESS:
-
-				/*
-				 * We count and sample delete-in-progress rows the same as
-				 * live ones, so that the stats counters come out right if the
-				 * deleting transaction commits after us, per the same
-				 * reasoning given above.
-				 *
-				 * If the delete was done by our own transaction, however, we
-				 * must count the row as dead to make pgstat_report_analyze's
-				 * stats adjustments come out right.  (Note: this works out
-				 * properly when the row was both inserted and deleted in our
-				 * xact.)
-				 *
-				 * The net effect of these choices is that we act as though an
-				 * IN_PROGRESS transaction hasn't happened yet, except if it
-				 * is our own transaction, which we assume has happened.
-				 *
-				 * This approach ensures that we behave sanely if we see both
-				 * the pre-image and post-image rows for a row being updated
-				 * by a concurrent transaction: we will sample the pre-image
-				 * but not the post-image.  We also get sane results if the
-				 * concurrent transaction never commits.
-				 */
-				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
-					*deadrows += 1;
-				else
-				{
-					sample_it = true;
-					*liverows += 1;
-				}
-				break;
-
-			default:
-				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
-				break;
-		}
-
-		if (sample_it)
-		{
-			ExecStoreBufferHeapTuple(targtuple, slot, hscan->rs_cbuf);
-			hscan->rs_cindex++;
-
-			/* note that we leave the buffer locked here! */
-			return true;
-		}
-	}
-
-	/* Now release the lock and pin on the page */
-	UnlockReleaseBuffer(hscan->rs_cbuf);
-	hscan->rs_cbuf = InvalidBuffer;
-
-	/* also prevent old slot contents from having pin on page */
-	ExecClearTuple(slot);
-
-	return false;
-}
-
 static double
 heapam_index_build_range_scan(Relation heapRelation,
 							  Relation indexRelation,
@@ -1476,7 +1293,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 			/*
 			 * The criteria for counting a tuple as live in this block need to
-			 * match what analyze.c's heapam_scan_analyze_next_tuple() does,
+			 * match what analyze.c's heap_scan_analyze_next_tuple() does,
 			 * otherwise CREATE INDEX and ANALYZE may produce wildly different
 			 * reltuples values, e.g. when there are many recently-dead
 			 * tuples.
@@ -1510,7 +1327,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 					 * index as unusable for them.
 					 *
 					 * We don't count recently-dead tuples in reltuples, even
-					 * if we index them; see heapam_scan_analyze_next_tuple().
+					 * if we index them; see heap_scan_analyze_next_tuple().
 					 */
 					if (HeapTupleIsHotUpdated(heapTuple))
 					{
@@ -1575,7 +1392,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 					{
 						/*
 						 * For consistency with
-						 * heapam_scan_analyze_next_tuple(), count
+						 * heap_scan_analyze_next_tuple(), count
 						 * HEAPTUPLE_INSERT_IN_PROGRESS tuples as live only
 						 * when inserted by our own transaction.
 						 */
@@ -1648,8 +1465,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 						/*
 						 * Count HEAPTUPLE_DELETE_IN_PROGRESS tuples as live,
 						 * if they were not deleted by the current
-						 * transaction.  That's what
-						 * heapam_scan_analyze_next_tuple() does, and we want
+						 * transaction. That's what
+						 * heap_scan_analyze_next_tuple() does, and we want
 						 * the behavior to be consistent.
 						 */
 						reltuples += 1;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..335ffb24302 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,157 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Iterate over tuples in the block pinned and locked in scan->rs_cbuf by
+ * acquire_sample_rows(). If a tuple that's suitable for sampling is found,
+ * store the tuple in `slot` and return true. When there are no more tuples for
+ * sampling, return false.
+ *
+ * *liverows and *deadrows are incremented according to the encountered
+ * tuples.
+ */
+static bool
+heap_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin,
+							 double *liverows, double *deadrows,
+							 TupleTableSlot *slot)
+{
+	Page		targpage;
+	OffsetNumber maxoffset;
+	BufferHeapTupleTableSlot *hslot;
+
+	Assert(TTS_IS_BUFFERTUPLE(slot));
+
+	hslot = (BufferHeapTupleTableSlot *) slot;
+	targpage = BufferGetPage(scan->rs_cbuf);
+	maxoffset = PageGetMaxOffsetNumber(targpage);
+
+	/* Inner loop over all tuples on the selected page */
+	for (; scan->rs_cindex <= maxoffset; scan->rs_cindex++)
+	{
+		ItemId		itemid;
+		HeapTuple	targtuple = &hslot->base.tupdata;
+		bool		sample_it = false;
+
+		itemid = PageGetItemId(targpage, scan->rs_cindex);
+
+		/*
+		 * We ignore unused and redirect line pointers.  DEAD line pointers
+		 * should be counted as dead, because we need vacuum to run to get rid
+		 * of them.  Note that this rule agrees with the way that
+		 * heap_page_prune_and_freeze() counts things.
+		 */
+		if (!ItemIdIsNormal(itemid))
+		{
+			if (ItemIdIsDead(itemid))
+				*deadrows += 1;
+			continue;
+		}
+
+		ItemPointerSet(&targtuple->t_self, scan->rs_cblock, scan->rs_cindex);
+
+		targtuple->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		targtuple->t_data = (HeapTupleHeader) PageGetItem(targpage, itemid);
+		targtuple->t_len = ItemIdGetLength(itemid);
+
+		switch (HeapTupleSatisfiesVacuum(targtuple, OldestXmin,
+										 scan->rs_cbuf))
+		{
+			case HEAPTUPLE_LIVE:
+				sample_it = true;
+				*liverows += 1;
+				break;
+
+			case HEAPTUPLE_DEAD:
+			case HEAPTUPLE_RECENTLY_DEAD:
+				/* Count dead and recently-dead rows */
+				*deadrows += 1;
+				break;
+
+			case HEAPTUPLE_INSERT_IN_PROGRESS:
+
+				/*
+				 * Insert-in-progress rows are not counted.  We assume that
+				 * when the inserting transaction commits or aborts, it will
+				 * send a stats message to increment the proper count.  This
+				 * works right only if that transaction ends after we finish
+				 * analyzing the table; if things happen in the other order,
+				 * its stats update will be overwritten by ours.  However, the
+				 * error will be large only if the other transaction runs long
+				 * enough to insert many tuples, so assuming it will finish
+				 * after us is the safer option.
+				 *
+				 * A special case is that the inserting transaction might be
+				 * our own.  In this case we should count and sample the row,
+				 * to accommodate users who load a table and analyze it in one
+				 * transaction.  (pgstat_report_analyze has to adjust the
+				 * numbers we report to the cumulative stats system to make
+				 * this come out right.)
+				 */
+				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(targtuple->t_data)))
+				{
+					sample_it = true;
+					*liverows += 1;
+				}
+				break;
+
+			case HEAPTUPLE_DELETE_IN_PROGRESS:
+
+				/*
+				 * We count and sample delete-in-progress rows the same as
+				 * live ones, so that the stats counters come out right if the
+				 * deleting transaction commits after us, per the same
+				 * reasoning given above.
+				 *
+				 * If the delete was done by our own transaction, however, we
+				 * must count the row as dead to make pgstat_report_analyze's
+				 * stats adjustments come out right.  (Note: this works out
+				 * properly when the row was both inserted and deleted in our
+				 * xact.)
+				 *
+				 * The net effect of these choices is that we act as though an
+				 * IN_PROGRESS transaction hasn't happened yet, except if it
+				 * is our own transaction, which we assume has happened.
+				 *
+				 * This approach ensures that we behave sanely if we see both
+				 * the pre-image and post-image rows for a row being updated
+				 * by a concurrent transaction: we will sample the pre-image
+				 * but not the post-image.  We also get sane results if the
+				 * concurrent transaction never commits.
+				 */
+				if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(targtuple->t_data)))
+					*deadrows += 1;
+				else
+				{
+					sample_it = true;
+					*liverows += 1;
+				}
+				break;
+
+			default:
+				elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result");
+				break;
+		}
+
+		if (sample_it)
+		{
+			ExecStoreBufferHeapTuple(targtuple, slot, scan->rs_cbuf);
+			scan->rs_cindex++;
+
+			/* note that we leave the buffer locked here! */
+			return true;
+		}
+	}
+
+	/* Now release the lock and pin on the page */
+	UnlockReleaseBuffer(scan->rs_cbuf);
+	scan->rs_cbuf = InvalidBuffer;
+
+	/* Prevent old slot contents from having pin on page */
+	ExecClearTuple(slot);
+
+	return false;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1151,7 +1302,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	BlockSamplerData bs;
 	ReservoirStateData rstate;
 	TupleTableSlot *slot;
-	TableScanDesc scan;
+	HeapScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
 #ifdef USE_PREFETCH
@@ -1184,7 +1335,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	/* Prepare for sampling rows */
 	reservoir_init_selection_state(&rstate, targrows);
 
-	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
+	scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
 #ifdef USE_PREFETCH
@@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
 				break;
 
 			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block);
 		}
 	}
 #endif
 
+	scan->rs_cbuf = InvalidBuffer;
+
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
@@ -1229,7 +1382,21 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
+		scan->rs_cblock = targblock;
+		scan->rs_cindex = FirstOffsetNumber;
+
+		/*
+		 * We must maintain a pin on the target page's buffer to ensure that
+		 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out
+		 * from under us.  Hence, pin the page until we are done looking at
+		 * it.  We also choose to hold sharelock on the buffer throughout ---
+		 * we could release and re-acquire sharelock for each tuple, but since
+		 * we aren't doing much work per tuple, the extra lock traffic is
+		 * probably better avoided.
+		 */
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
+										   targblock, RBM_NORMAL, vac_strategy);
+		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
 #ifdef USE_PREFETCH
 
@@ -1238,10 +1405,10 @@ acquire_sample_rows(Relation onerel, int elevel,
 		 * next one we will want, if there's any left.
 		 */
 		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_targblock);
 #endif
 
-		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
+		while (heap_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
 			 * The first targrows sample rows are simply copied into the
@@ -1291,7 +1458,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
-	heap_endscan(scan);
+	heap_endscan((TableScanDesc) scan);
 
 	/*
 	 * If we didn't find as many tuples as we wanted then we're done. No sort
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 48936826bcc..be630620d0d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -412,15 +412,6 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple);
 extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
-/* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
-extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
-										   TransactionId OldestXmin,
-										   double *liverows, double *deadrows,
-										   TupleTableSlot *slot);
-
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
  * details this is implemented in reorderbuffer.c not heapam_visibility.c
-- 
2.40.1

v10-0002-Use-streaming-read-API-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=v10-0002-Use-streaming-read-API-in-ANALYZE.patchDownload

From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:28:32 -0400
Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz
Reviewed-by: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c | 89 ++++++++++------------------------
 1 file changed, 26 insertions(+), 63 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 335ffb24302..3cfad92390d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1253,6 +1253,20 @@ heap_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin,
 	return false;
 }
 
+/*
+ * Streaming read callback returning the next block number while using
+ * BlockSampling algorithm.
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1305,10 +1319,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	HeapScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1321,13 +1332,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1337,54 +1341,21 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 	scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
-
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										scan->rs_base.rs_rd,
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	scan->rs_cbuf = InvalidBuffer;
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (BufferIsValid(scan->rs_cbuf = read_stream_next_buffer(stream, NULL)))
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		scan->rs_cblock = targblock;
-		scan->rs_cindex = FirstOffsetNumber;
-
 		/*
 		 * We must maintain a pin on the target page's buffer to ensure that
 		 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out
@@ -1394,19 +1365,10 @@ acquire_sample_rows(Relation onerel, int elevel,
 		 * we aren't doing much work per tuple, the extra lock traffic is
 		 * probably better avoided.
 		 */
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-										   targblock, RBM_NORMAL, vac_strategy);
 		LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
+		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		scan->rs_cindex = FirstOffsetNumber;
 
 		while (heap_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
@@ -1456,6 +1418,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
 	}
+	read_stream_end(stream);
 
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan((TableScanDesc) scan);
-- 
2.40.1

v10-0003-Obsolete-BlockSampler_HasMore.patchtext/x-patch; charset=US-ASCII; name=v10-0003-Obsolete-BlockSampler_HasMore.patchDownload

From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:38:41 -0400
Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()

A previous commit stopped using BlockSampler_HasMore() for flow control
in acquire_sample_rows(). There seems little use now for
BlockSampler_HasMore(). It should be sufficient to return
InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
would have returned false. Remove BlockSampler_HasMore().

Author: Melanie Plageman, Nazir Bilal Yavuz
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c    |  4 +---
 src/backend/utils/misc/sampling.c | 11 ++++-------
 src/include/utils/sampling.h      |  1 -
 3 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 3cfad92390d..95d113bb907 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1262,9 +1262,7 @@ block_sampling_streaming_read_next(ReadStream *stream,
 								   void *user_data,
 								   void *per_buffer_data)
 {
-	BlockSamplerData *bs = user_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(user_data);
 }
 
 /*
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702c..245d826affe 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,10 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;				/* probability to skip block */
 	double		V;				/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+
+	/* Return if no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52bac..fb5d6820a24 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 									 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.40.1

#15

Andres Freund

andres@anarazel.de

almost 2 years ago

In reply to: Melanie Plageman (#14)

Re: Use streaming read API in ANALYZE

Hi,

On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:

From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this
thread. I did raise that separately
/messages/by-id/20240407214001.jgpg5q3yv33ve6y3@awork3.anarazel.de

Unless I seriously missed something, I see no alternative to reverting that
commit.

@@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
break;
prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block);
}
}
#endif
+ scan->rs_cbuf = InvalidBuffer;
+
/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{

I don't think it's good to move a lot of code *and* change how it is
structured in the same commit. Makes it much harder to actually see changes /
makes git blame harder to use / etc.

From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:28:32 -0400
Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz
Reviewed-by: Melanie Plageman
Discussion: /messages/by-id/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM+6M+mJg@mail.gmail.com
---
src/backend/commands/analyze.c | 89 ++++++++++------------------------
1 file changed, 26 insertions(+), 63 deletions(-)

That's a very nice demonstration of how this makes good prefetching easier...

From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:38:41 -0400
Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()

A previous commit stopped using BlockSampler_HasMore() for flow control
in acquire_sample_rows(). There seems little use now for
BlockSampler_HasMore(). It should be sufficient to return
InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
would have returned false. Remove BlockSampler_HasMore().

Author: Melanie Plageman, Nazir Bilal Yavuz
Discussion: /messages/by-id/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM+6M+mJg@mail.gmail.com

The justification here seems somewhat odd. Sure, the previous commit stopped
using BlockSampler_HasMore in acquire_sample_rows - but only because it was
moved to block_sampling_streaming_read_next()?

Greetings,

Andres Freund

#16

Melanie Plageman

melanieplageman@gmail.com

almost 2 years ago

In reply to: Andres Freund (#15)

2 attachment(s)

Re: Use streaming read API in ANALYZE

On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:

Hi,

On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:

From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this
thread. I did raise that separately
/messages/by-id/20240407214001.jgpg5q3yv33ve6y3@awork3.anarazel.de

Unless I seriously missed something, I see no alternative to reverting that
commit.

Noted. I'll give up on this refactor then. Lots of churn for no gain.
Attached v11 is just Bilal's v8 patch rebased to apply cleanly and with
a few tweaks (I changed one of the loop conditions. All other changes
are to comments and commit message).

@@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
break;
prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+			PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block);
}
}
#endif
+ scan->rs_cbuf = InvalidBuffer;
+
/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{
I don't think it's good to move a lot of code *and* change how it is
structured in the same commit. Makes it much harder to actually see changes /
makes git blame harder to use / etc.

Yep.

From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:28:32 -0400
Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz
Reviewed-by: Melanie Plageman
Discussion: /messages/by-id/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM+6M+mJg@mail.gmail.com
---
src/backend/commands/analyze.c | 89 ++++++++++------------------------
1 file changed, 26 insertions(+), 63 deletions(-)

That's a very nice demonstration of how this makes good prefetching easier...

Agreed. Yay streaming read API and Bilal!

From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:38:41 -0400
Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()

A previous commit stopped using BlockSampler_HasMore() for flow control
in acquire_sample_rows(). There seems little use now for
BlockSampler_HasMore(). It should be sufficient to return
InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
would have returned false. Remove BlockSampler_HasMore().

Author: Melanie Plageman, Nazir Bilal Yavuz
Discussion: /messages/by-id/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM+6M+mJg@mail.gmail.com

The justification here seems somewhat odd. Sure, the previous commit stopped
using BlockSampler_HasMore in acquire_sample_rows - but only because it was
moved to block_sampling_streaming_read_next()?

It didn't stop using it. It stopped being useful. The reason it existed,
as far as I can tell, was to use it as the while() loop condition in
acquire_sample_rows(). I think it makes much more sense for
BlockSampler_Next() to return InvalidBlockNumber when there are no more
blocks -- not to assert you don't call it when there aren't any more
blocks.

I didn't want to change BlockSampler_Next() in the same commit as the
streaming read user and we can't remove BlockSampler_HasMore() without
changing BlockSampler_Next().

- Melanie

Attachments:

v11-0001-Use-streaming-read-API-in-ANALYZE.patchtext/x-diff; charset=us-asciiDownload

From 3cb43693c04554f5d46e0dc9156bef36af642593 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 18:17:01 -0400
Subject: [PATCH v11 1/2] Use streaming read API in ANALYZE

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz
Reviewed-by: Melanie Plageman, Andres Freund
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 20 +++---
 src/backend/commands/analyze.c           | 85 +++++++-----------------
 src/include/access/heapam.h              |  5 +-
 3 files changed, 39 insertions(+), 71 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..d9f053022f2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1055,16 +1055,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
+ * Prepare to analyze the next block in the a read stream.  Returns false if
+ * the stream is exhausted and true otherwise. The scan must have been started
  * with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-							   BufferAccessStrategy bstrategy)
+bool
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1077,11 +1077,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	if (!BufferIsValid(hscan->rs_cbuf))
+		return false;
+
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
+	return true;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..58f6ad96136 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Streaming read callback returning the next BlockNumber as chosen by the
+ * BlockSampling algorithm.
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+								   void *user_data,
+								   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,60 +1191,19 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
-
-	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
-	 */
-	if (prefetch_maximum)
-	{
-		for (int i = 0; i < prefetch_maximum; i++)
-		{
-			BlockNumber prefetch_block;
-
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
-
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
-		}
-	}
-#endif
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+										vac_strategy,
+										scan->rs_rd,
+										MAIN_FORKNUM,
+										block_sampling_streaming_read_next,
+										&bs,
+										0);
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (heapam_scan_analyze_next_block(scan, stream))
 	{
-		BlockNumber targblock = BlockSampler_Next(&bs);
-#ifdef USE_PREFETCH
-		BlockNumber prefetch_targblock = InvalidBlockNumber;
-
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
-#endif
-
 		vacuum_delay_point();
 
-		heapam_scan_analyze_next_block(scan, targblock, vac_strategy);
-
-#ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
-#endif
-
 		while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
@@ -1290,6 +1253,8 @@ acquire_sample_rows(Relation onerel, int elevel,
 									 ++blksdone);
 	}
 
+	read_stream_end(stream);
+
 	ExecDropSingleTupleTableSlot(slot);
 	heap_endscan(scan);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 48936826bcc..f84dbe629fe 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -413,9 +413,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 								  struct GlobalVisState *vistest);
 
 /* in heap/heapam_handler.c*/
-extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-										   BlockNumber blockno,
-										   BufferAccessStrategy bstrategy);
+extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
+										   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 										   TransactionId OldestXmin,
 										   double *liverows, double *deadrows,
-- 
2.40.1

v11-0002-Obsolete-BlockSampler_HasMore.patchtext/x-diff; charset=us-asciiDownload

From e7de413aedc33a77f9758fd37872f2eb2b6478f5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 18:19:38 -0400
Subject: [PATCH v11 2/2] Obsolete BlockSampler_HasMore()

A previous commit stopped using BlockSampler_HasMore() for flow control
in acquire_sample_rows(). There seems little use now for
BlockSampler_HasMore(). It should be sufficient to return
InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
would have returned false. Remove BlockSampler_HasMore().

Author: Melanie Plageman, Nazir Bilal Yavuz
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c    |  4 +---
 src/backend/utils/misc/sampling.c | 10 +++-------
 src/include/utils/sampling.h      |  1 -
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 58f6ad96136..602d0f35e7c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1111,9 +1111,7 @@ block_sampling_streaming_read_next(ReadStream *stream,
 								   void *user_data,
 								   void *per_buffer_data)
 {
-	BlockSamplerData *bs = user_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(user_data);
 }
 
 /*
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702c..6e2bca9739c 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;				/* probability to skip block */
 	double		V;				/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+	/* Return if no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52bac..fb5d6820a24 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 									 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.40.1

#17

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#16)

Re: Use streaming read API in ANALYZE

On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:

src/backend/commands/analyze.c | 89 ++++++++++------------------------
1 file changed, 26 insertions(+), 63 deletions(-)

That's a very nice demonstration of how this makes good prefetching easier...

Agreed. Yay streaming read API and Bilal!

I found a few comments to tweak, just a couple of places that hadn't
got the memo after we renamed "read stream", and an obsolete mention
of pinning buffers. I adjusted those directly.

I ran some tests on a random basic Linux/ARM cloud box with a 7.6GB
table, and I got:

cold hot
master: 9025ms 199ms
patched, io_combine_limit=1: 9025ms 191ms
patched, io_combine_limit=default: 8729ms 191ms

Despite being random, occasionally some I/Os must get merged, allowing
slightly better random throughput when accessing disk blocks through a
3000 IOPS drinking straw. Looking at strace, I see 29144 pread* calls
instead of 30071, which fits that theory. Let's see... if you roll a
fair 973452-sided dice 30071 times, how many times do you expect to
roll consecutive numbers? Each time you roll there is a 1/973452
chance that you get the last number + 1, and we have 30071 tries
giving 30071/973452 = ~3%. 9025ms minus 3% is 8754ms. Seems about
right.

I am not sure why the hot number is faster exactly. (Anecdotally, I
did notice that in the cases that beat master semi-unexpectedly like
this, my software memory prefetch patch doesn't help or hurt, while in
some cases and on some CPUs there is little difference, and then that
patch seems to get a speed-up like this, which might be a clue.
*Shrug*, investigation needed.)

Pushed. Thanks Bilal and reviewers!

#18

Thomas Munro

thomas.munro@gmail.com

almost 2 years ago

In reply to: Melanie Plageman (#16)

1 attachment(s)

Re: Use streaming read API in ANALYZE

On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:

On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:

From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Sun, 7 Apr 2024 15:38:41 -0400
Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()

A previous commit stopped using BlockSampler_HasMore() for flow control
in acquire_sample_rows(). There seems little use now for
BlockSampler_HasMore(). It should be sufficient to return
InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
would have returned false. Remove BlockSampler_HasMore().

Author: Melanie Plageman, Nazir Bilal Yavuz
Discussion: /messages/by-id/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM+6M+mJg@mail.gmail.com

The justification here seems somewhat odd. Sure, the previous commit stopped
using BlockSampler_HasMore in acquire_sample_rows - but only because it was
moved to block_sampling_streaming_read_next()?

It didn't stop using it. It stopped being useful. The reason it existed,
as far as I can tell, was to use it as the while() loop condition in
acquire_sample_rows(). I think it makes much more sense for
BlockSampler_Next() to return InvalidBlockNumber when there are no more
blocks -- not to assert you don't call it when there aren't any more
blocks.

I didn't want to change BlockSampler_Next() in the same commit as the
streaming read user and we can't remove BlockSampler_HasMore() without
changing BlockSampler_Next().

I agree that the code looks useless if one condition implies the
other, but isn't it good to keep that cross-check, perhaps
reformulated as an assertion? I didn't look too hard at the maths, I
just saw the words "It is not obvious that this code matches Knuth's
Algorithm S ..." and realised I'm not sure I have time to develop a
good opinion about this today. So I'll leave the 0002 change out for
now, as it's a tidy-up that can easily be applied in the next cycle.

Attachments:

v12-0001-Remove-obsolete-BlockSampler_HasMore.patchtext/x-patch; charset=US-ASCII; name=v12-0001-Remove-obsolete-BlockSampler_HasMore.patchDownload

From c3b8df8e4720d8b0dfb4c892c0aa3ddaef8f401f Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 8 Apr 2024 14:38:58 +1200
Subject: [PATCH v12] Remove obsolete BlockSampler_HasMore().

Commit 041b9680 stopped using BlockSampler_HasMore() for flow control in
acquire_sample_rows(). There seems to be little use for it now. We can
just return InvalidBlockNumber from BlockSampler_Next() when
BlockSample_HasMore() would have returned false.

Author: Melanie Plageman <melanieplageman@gmail.com>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c    |  4 +---
 src/backend/utils/misc/sampling.c | 10 +++-------
 src/include/utils/sampling.h      |  1 -
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index da27a13a3f..e9fa3470cf 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1111,9 +1111,7 @@ block_sampling_read_stream_next(ReadStream *stream,
 								void *callback_private_data,
 								void *per_buffer_data)
 {
-	BlockSamplerData *bs = callback_private_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(callback_private_data);
 }
 
 /*
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702..6e2bca9739 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;				/* probability to skip block */
 	double		V;				/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+	/* Return if no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52ba..fb5d6820a2 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 									 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.44.0

#19

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#10)

Re: Use streaming read API in ANALYZE

Hi,

On Wed, 3 Apr 2024 at 22:25, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

Thank you for looking into this!

On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

I'm getting a repeatable segfault / assertion failure with this:

postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
CREATE TABLE
postgres=# insert into tengiga select g, repeat('x', 900) from
generate_series(1, 1400000) g;
INSERT 0 1400000
postgres=# set default_statistics_target = 10; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target = 100; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target =1000; ANALYZE tengiga;
SET
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
"heapam_handler.c", Line: 1079, PID: 262232
postgres: heikki postgres [local]
ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
postgres: heikki postgres [local]
ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
postgres: heikki postgres [local]
ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
postgres: heikki postgres [local]
ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
postgres: heikki postgres [local]
ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
postgres: heikki postgres [local]
ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
postgres: heikki postgres [local]
ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
postgres: heikki postgres [local] ANALYZE(+0x4444b4)[0x564889cdb4b4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
2024-04-03 20:15:49.157 EEST [262101] LOG: server process (PID 262232)
was terminated by signal 6: Aborted

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

I wanted to re-check this problem and I realized that I was wrong. I
tried using nblocks again and this time there was no failure. I looked
at block sampling logic and I am pretty sure that BlockSampler_Init()
function correctly returns the number of blocks that
BlockSampler_Next() will return. It seems 158f581923 fixed this issue
as well.

--
Regards,
Nazir Bilal Yavuz
Microsoft

#20

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Thomas Munro (#17)

Re: Use streaming read API in ANALYZE

Hi,

On Mon, 8 Apr 2024 at 04:21, Thomas Munro <thomas.munro@gmail.com> wrote:

Pushed. Thanks Bilal and reviewers!

I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

After the revert of 27bc1772fc8, acquire_sample_rows() became table-AM
agnostic again. So, read stream changes might have to be pushed down
now but there are a couple of roadblocks like Melanie mentioned [1]/messages/by-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG+ROg@mail.gmail.com
before.

Quote from Melanie [1]/messages/by-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG+ROg@mail.gmail.com:

On Thu, 11 Apr 2024 at 19:19, Melanie Plageman
<melanieplageman@gmail.com> wrote:

I am working on pushing streaming ANALYZE into heap AM code, and I ran
into a few roadblocks.

If we want ANALYZE to make the ReadStream object in heap_beginscan()
(like the read stream implementation of heap sequential and TID range
scans do), I don't see any way around changing the scan_begin table AM
callback to take a BufferAccessStrategy at the least (and perhaps also
the BlockSamplerData).

read_stream_begin_relation() doesn't just save the
BufferAccessStrategy in the ReadStream, it uses it to set various
other things in the ReadStream object. callback_private_data (which in
ANALYZE's case is the BlockSamplerData) is simply saved in the
ReadStream, so it could be set later, but that doesn't sound very
clean to me.

As such, it seems like a cleaner alternative would be to add a table
AM callback for creating a read stream object that takes the
parameters of read_stream_begin_relation(). But, perhaps it is a bit
late for such additions.

If we do not want to add a new table AM callback like Melanie
mentioned, it is pretty much required to pass BufferAccessStrategy and
BlockSamplerData to the initscan().

It also opens us up to the question of whether or not sequential scan
should use such a callback instead of making the read stream object in
heap_beginscan().

I am happy to write a patch that does any of the above. But, I want to
raise these questions, because perhaps I am simply missing an obvious
alternative solution.

I wonder the same, I could not think of any alternative solution to
this problem.

Another quote from Melanie [2]/messages/by-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY=0R_oG5LHBH7Gw@mail.gmail.com in the same thread:

On Thu, 11 Apr 2024 at 20:48, Melanie Plageman
<melanieplageman@gmail.com> wrote:

I will also say that, had this been 6 months ago, I would probably
suggest we restructure ANALYZE's table AM interface to accommodate
read stream setup and to address a few other things I find odd about
the current code. For example, I think creating a scan descriptor for
the analyze scan in acquire_sample_rows() is quite odd. It seems like
it would be better done in the relation_analyze callback. The
relation_analyze callback saves some state like the callbacks for
acquire_sample_rows() and the Buffer Access Strategy. But at least in
the heap implementation, it just saves them in static variables in
analyze.c. It seems like it would be better to save them in a useful
data structure that could be accessed later. We have access to pretty
much everything we need at that point (in the relation_analyze
callback). I also think heap's implementation of
table_beginscan_analyze() doesn't need most of
heap_beginscan()/initscan(), so doing this instead of something
ANALYZE specific seems more confusing than helpful.

If we want to implement ANALYZE specific counterparts of
heap_beginscan()/initscan(); we may think of passing
BufferAccessStrategy and BlockSamplerData to them.

Also, there is an ongoing(?) discussion about a few problems /
improvements about the acquire_sample_rows() mentioned at the end of
the 'Table AM Interface Enhancements' thread [3]/messages/by-id/CAPpHfdurb9ycV8udYqM=o0sPS66PJ4RCBM1g-bBpvzUfogY0EA@mail.gmail.com. Should we wait for
these discussions to be resolved or can we resume working on this
patch?

Any kind of feedback would be appreciated.

[1]: /messages/by-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG+ROg@mail.gmail.com
[2]: /messages/by-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY=0R_oG5LHBH7Gw@mail.gmail.com
[3]: /messages/by-id/CAPpHfdurb9ycV8udYqM=o0sPS66PJ4RCBM1g-bBpvzUfogY0EA@mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

#21

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#20)

2 attachment(s)

Re: Use streaming read API in ANALYZE

Hi,

On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

Hi,

On Mon, 8 Apr 2024 at 04:21, Thomas Munro <thomas.munro@gmail.com> wrote:

Pushed. Thanks Bilal and reviewers!

I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

041b96802ef is discussed in the 'Table AM Interface Enhancements'
thread [1]/messages/by-id/CAPpHfdurb9ycV8udYqM=o0sPS66PJ4RCBM1g-bBpvzUfogY0EA@mail.gmail.com. The main problems discussed about this commit is that the
read stream API is not pushed to the heap-specific code and, because
of that, the other AM implementations need to use read streams. To
push read stream API to the heap-specific code, it is pretty much
required to pass BufferAccessStrategy and BlockSamplerData to the
initscan().

I am sharing the alternative version of this patch. The first patch
just reverts 041b96802ef and the second patch is the alternative
version.

In this alternative version, the read stream API is not pushed to the
heap-specific code, but it is controlled by the heap-specific code.
The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
heap-specific code if the scan type is 'ANALYZE'. This flag is used to
decide whether streaming API in ANALYZE will be used or not. If this
flag is set, this means heap AMs and read stream API will be used. If
it is not set, this means heap AMs will not be used and code falls
back to the version before read streams.

Pros of the alternative version:

* The existing AM implementations other than heap AM can continue to
use their AMs without any change.
* AM implementations other than heap do not need to use read streams.
* Upstream code uses the read stream API and benefits from that.

Cons of the alternative version:

* 6 if cases are added to the acquire_sample_rows() function and 3 of
them are in the while loop.
* Because of these changes, the code looks messy.

Any kind of feedback would be appreciated.

[1]: /messages/by-id/CAPpHfdurb9ycV8udYqM=o0sPS66PJ4RCBM1g-bBpvzUfogY0EA@mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v13-0001-Revert-Use-streaming-I-O-in-ANALYZE.patchtext/x-patch; charset=US-ASCII; name=v13-0001-Revert-Use-streaming-I-O-in-ANALYZE.patchDownload

From 323f28ff979cde8e4dbde8b4654bded74abf1fbc Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 15 May 2024 00:03:56 +0300
Subject: [PATCH v13 1/2] Revert "Use streaming I/O in ANALYZE."

This commit reverts 041b96802ef.

041b96802ef revised the changes on 27bc1772fc8 but 27bc1772fc8 and
dd1f6b0c172 are reverted together in 6377e12a5a5. So, this commit
reverts all 27bc1772fc, 041b96802ef and dd1f6b0c172 together.

Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/include/access/tableam.h             | 26 +++----
 src/backend/access/heap/heapam_handler.c | 38 +++++-----
 src/backend/commands/analyze.c           | 96 ++++++++++++++++++------
 3 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd5..e08b9627f30 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,7 +21,6 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "executor/tuptable.h"
-#include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -655,16 +654,6 @@ typedef struct TableAmRoutine
 									struct VacuumParams *params,
 									BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze the next block in the read stream.  Returns false if
-	 * the stream is exhausted and true otherwise. The scan must have been
-	 * started with SO_TYPE_ANALYZE option.
-	 *
-	 * This routine holds a buffer pin and lock on the heap page.  They are
-	 * held until heapam_scan_analyze_next_tuple() returns false.  That is
-	 * until all the items of the heap page are analyzed.
-	 */
-
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -683,7 +672,8 @@ typedef struct TableAmRoutine
 	 * isn't one yet.
 	 */
 	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-											ReadStream *stream);
+											BlockNumber blockno,
+											BufferAccessStrategy bstrategy);
 
 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -1721,17 +1711,19 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 }
 
 /*
- * Prepare to analyze the next block in the read stream. The scan needs to
- * have been  started with table_beginscan_analyze().  Note that this routine
- * might acquire resources like locks that are held until
+ * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
+ * started with table_beginscan_analyze().  Note that this routine might
+ * acquire resources like locks that are held until
  * table_scan_analyze_next_tuple() returns false.
  *
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
+table_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
+							  BufferAccessStrategy bstrategy)
 {
-	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, stream);
+	return scan->rs_rd->rd_tableam->scan_analyze_next_block(scan, blockno,
+															bstrategy);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 6f8b1b79298..a9c8cd4306c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -46,6 +46,12 @@
 #include "utils/builtins.h"
 #include "utils/rel.h"
 
+static TM_Result heapam_tuple_lock(Relation relation, ItemPointer tid,
+								   Snapshot snapshot, TupleTableSlot *slot,
+								   CommandId cid, LockTupleMode mode,
+								   LockWaitPolicy wait_policy, uint8 flags,
+								   TM_FailureData *tmfd);
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -993,36 +999,28 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze the next block in the read stream.  Returns false if
- * the stream is exhausted and true otherwise. The scan must have been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
 static bool
-heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
+heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
+							   BufferAccessStrategy bstrategy)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
 	/*
 	 * We must maintain a pin on the target page's buffer to ensure that
 	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  It comes from the stream already pinned.   We also choose to
-	 * hold sharelock on the buffer throughout --- we could release and
-	 * re-acquire sharelock for each tuple, but since we aren't doing much
-	 * work per tuple, the extra lock traffic is probably better avoided.
+	 * under us.  Hence, pin the page until we are done looking at it.  We
+	 * also choose to hold sharelock on the buffer throughout --- we could
+	 * release and re-acquire sharelock for each tuple, but since we aren't
+	 * doing much work per tuple, the extra lock traffic is probably better
+	 * avoided.
 	 */
-	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
-	if (!BufferIsValid(hscan->rs_cbuf))
-		return false;
-
+	hscan->rs_cblock = blockno;
+	hscan->rs_cindex = FirstOffsetNumber;
+	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
+										blockno, RBM_NORMAL, bstrategy);
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
-	hscan->rs_cindex = FirstOffsetNumber;
+	/* in heap all blocks can contain tuples, so always return true */
 	return true;
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7d2cd249972..8a82af4a4ca 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,20 +1102,6 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
-/*
- * Read stream callback returning the next BlockNumber as chosen by the
- * BlockSampling algorithm.
- */
-static BlockNumber
-block_sampling_read_stream_next(ReadStream *stream,
-								void *callback_private_data,
-								void *per_buffer_data)
-{
-	BlockSamplerData *bs = callback_private_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
-}
-
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the table
  *
@@ -1168,7 +1154,10 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-	ReadStream *stream;
+#ifdef USE_PREFETCH
+	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
+	BlockSamplerData prefetch_bs;
+#endif
 
 	Assert(targrows > 0);
 
@@ -1181,6 +1170,13 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
+#ifdef USE_PREFETCH
+	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
+	/* Create another BlockSampler, using the same seed, for prefetching */
+	if (prefetch_maximum)
+		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+#endif
+
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1191,19 +1187,71 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
-										vac_strategy,
-										scan->rs_rd,
-										MAIN_FORKNUM,
-										block_sampling_read_stream_next,
-										&bs,
-										0);
+#ifdef USE_PREFETCH
+
+	/*
+	 * If we are doing prefetching, then go ahead and tell the kernel about
+	 * the first set of pages we are going to want.  This also moves our
+	 * iterator out ahead of the main one being used, where we will keep it so
+	 * that we're always pre-fetching out prefetch_maximum number of blocks
+	 * ahead.
+	 */
+	if (prefetch_maximum)
+	{
+		for (int i = 0; i < prefetch_maximum; i++)
+		{
+			BlockNumber prefetch_block;
+
+			if (!BlockSampler_HasMore(&prefetch_bs))
+				break;
+
+			prefetch_block = BlockSampler_Next(&prefetch_bs);
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+		}
+	}
+#endif
 
 	/* Outer loop over blocks to sample */
-	while (table_scan_analyze_next_block(scan, stream))
+	while (BlockSampler_HasMore(&bs))
 	{
+		bool		block_accepted;
+		BlockNumber targblock = BlockSampler_Next(&bs);
+#ifdef USE_PREFETCH
+		BlockNumber prefetch_targblock = InvalidBlockNumber;
+
+		/*
+		 * Make sure that every time the main BlockSampler is moved forward
+		 * that our prefetch BlockSampler also gets moved forward, so that we
+		 * always stay out ahead.
+		 */
+		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
+			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+#endif
+
 		vacuum_delay_point();
 
+		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
+
+#ifdef USE_PREFETCH
+
+		/*
+		 * When pre-fetching, after we get a block, tell the kernel about the
+		 * next one we will want, if there's any left.
+		 *
+		 * We want to do this even if the table_scan_analyze_next_block() call
+		 * above decides against analyzing the block it picked.
+		 */
+		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+#endif
+
+		/*
+		 * Don't analyze if table_scan_analyze_next_block() indicated this
+		 * block is unsuitable for analyzing.
+		 */
+		if (!block_accepted)
+			continue;
+
 		while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot))
 		{
 			/*
@@ -1253,8 +1301,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 									 ++blksdone);
 	}
 
-	read_stream_end(stream);
-
 	ExecDropSingleTupleTableSlot(slot);
 	table_endscan(scan);
 
-- 
2.43.0

v13-0002-Use-streaming-I-O-in-ANALYZE-alternative.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Use-streaming-I-O-in-ANALYZE-alternative.patchDownload

From 923f0bd32e2d30bc2a2eec4a5b0e308fc45dc5cf Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Wed, 15 May 2024 16:25:32 +0300
Subject: [PATCH v13 2/2] Use streaming I/O in ANALYZE (alternative).

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

SO_USE_READ_STREAMS_IN_ANALYZE is introduced and is used to decide
whether streaming API in ANALYZE will be used or not. This flag is set
in heap AM code. Other AMs does not set this flag. If this flag is not
set, ANALYZE code falls back to version before read stream API.

Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/include/access/tableam.h             |   3 +
 src/backend/access/heap/heapam.c         |   8 ++
 src/backend/access/heap/heapam_handler.c |  21 ++--
 src/backend/commands/analyze.c           | 131 ++++++++++++++++-------
 4 files changed, 114 insertions(+), 49 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e08b9627f30..e30ab077ade 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -69,6 +69,9 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+
+	/* use read streams in ANALYZE */
+	SO_USE_READ_STREAMS_IN_ANALYZE = 1 << 11,
 }			ScanOptions;
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4be0dee4de0..41a5b19d0a2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1143,6 +1143,14 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 														  0);
 	}
 
+	/*
+	 * Set USE_READ_STREAMS_IN_ANALYZE flag here, then use this flag to decide
+	 * if read streams will be used in the acquire_sample_rows() function.
+	 */
+	if (scan->rs_base.rs_flags & SO_TYPE_ANALYZE)
+	{
+		scan->rs_base.rs_flags |= SO_USE_READ_STREAMS_IN_ANALYZE;
+	}
 
 	return (TableScanDesc) scan;
 }
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a9c8cd4306c..6200a9d9068 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -42,6 +42,7 @@
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
@@ -1008,19 +1009,19 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	/*
 	 * We must maintain a pin on the target page's buffer to ensure that
 	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
+	 * under us.  It comes from the stream already pinned.   We also choose to
+	 * hold sharelock on the buffer throughout --- we could release and
+	 * re-acquire sharelock for each tuple, but since we aren't doing much
+	 * work per tuple, the extra lock traffic is probably better avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-										blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(hscan->rs_read_stream, NULL);
+	if (!BufferIsValid(hscan->rs_cbuf))
+		return false;
+
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
 
-	/* in heap all blocks can contain tuples, so always return true */
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
 	return true;
 }
 
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8a82af4a4ca..d858cc8160a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -18,6 +18,7 @@
 
 #include "access/detoast.h"
 #include "access/genam.h"
+#include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/relation.h"
 #include "access/table.h"
@@ -1102,6 +1103,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Read stream callback returning the next BlockNumber as chosen by the
+ * BlockSampling algorithm.
+ */
+static BlockNumber
+block_sampling_read_stream_next(ReadStream *stream,
+								void *callback_private_data,
+								void *per_buffer_data)
+{
+	BlockSamplerData *bs = callback_private_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the table
  *
@@ -1146,6 +1161,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	double		deadrows = 0;	/* # dead rows seen */
 	double		rowstoskip = -1;	/* -1 means not set yet */
 	uint32		randseed;		/* Seed for block sampler(s) */
+	bool		use_read_streams = false;
 	BlockNumber totalblocks;
 	TransactionId OldestXmin;
 	BlockSamplerData bs;
@@ -1170,13 +1186,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(&pg_global_prng_state);
 	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
-#endif
-
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
 								 nblocks);
@@ -1187,62 +1196,105 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
-#ifdef USE_PREFETCH
+	/*
+	 * SO_USE_READ_STREAMS_IN_ANALYZE flag is set in heap AM constructor,
+	 * which means heap AMs will be used. So, it is safe to use read streams
+	 * now.
+	 */
+	use_read_streams = scan->rs_flags & SO_USE_READ_STREAMS_IN_ANALYZE;
+
+	if (use_read_streams)
+	{
+		((HeapScanDesc) scan)->rs_read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+																		   vac_strategy,
+																		   scan->rs_rd,
+																		   MAIN_FORKNUM,
+																		   block_sampling_read_stream_next,
+																		   &bs,
+																		   0);
+	}
 
 	/*
-	 * If we are doing prefetching, then go ahead and tell the kernel about
-	 * the first set of pages we are going to want.  This also moves our
-	 * iterator out ahead of the main one being used, where we will keep it so
-	 * that we're always pre-fetching out prefetch_maximum number of blocks
-	 * ahead.
+	 * Read streams have their own prefetch mechanism, so do not prefetch when
+	 * the read streams are used. This applies for all of the prefetch code in
+	 * this function.
 	 */
-	if (prefetch_maximum)
+#ifdef USE_PREFETCH
+	if (!use_read_streams)
 	{
-		for (int i = 0; i < prefetch_maximum; i++)
+		prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
+		/* Create another BlockSampler, using the same seed, for prefetching */
+		if (prefetch_maximum)
+			(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+	}
+#endif
+
+#ifdef USE_PREFETCH
+	if (!use_read_streams)
+	{
+		/*
+		 * If we are doing prefetching, then go ahead and tell the kernel
+		 * about the first set of pages we are going to want.  This also moves
+		 * our iterator out ahead of the main one being used, where we will
+		 * keep it so that we're always pre-fetching out prefetch_maximum
+		 * number of blocks ahead.
+		 */
+		if (prefetch_maximum)
 		{
-			BlockNumber prefetch_block;
+			for (int i = 0; i < prefetch_maximum; i++)
+			{
+				BlockNumber prefetch_block;
 
-			if (!BlockSampler_HasMore(&prefetch_bs))
-				break;
+				if (!BlockSampler_HasMore(&prefetch_bs))
+					break;
 
-			prefetch_block = BlockSampler_Next(&prefetch_bs);
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+				prefetch_block = BlockSampler_Next(&prefetch_bs);
+				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+			}
 		}
 	}
 #endif
 
 	/* Outer loop over blocks to sample */
-	while (BlockSampler_HasMore(&bs))
+	while (nblocks)
 	{
 		bool		block_accepted;
-		BlockNumber targblock = BlockSampler_Next(&bs);
+		BlockNumber targblock = InvalidBlockNumber;
 #ifdef USE_PREFETCH
 		BlockNumber prefetch_targblock = InvalidBlockNumber;
 
-		/*
-		 * Make sure that every time the main BlockSampler is moved forward
-		 * that our prefetch BlockSampler also gets moved forward, so that we
-		 * always stay out ahead.
-		 */
-		if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
-			prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+		if (!use_read_streams)
+		{
+			/*
+			 * Make sure that every time the main BlockSampler is moved
+			 * forward that our prefetch BlockSampler also gets moved forward,
+			 * so that we always stay out ahead.
+			 */
+			if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs))
+				prefetch_targblock = BlockSampler_Next(&prefetch_bs);
+		}
 #endif
 
+		if (!use_read_streams)
+			targblock = BlockSampler_Next(&bs);
+
 		vacuum_delay_point();
 
 		block_accepted = table_scan_analyze_next_block(scan, targblock, vac_strategy);
 
 #ifdef USE_PREFETCH
-
-		/*
-		 * When pre-fetching, after we get a block, tell the kernel about the
-		 * next one we will want, if there's any left.
-		 *
-		 * We want to do this even if the table_scan_analyze_next_block() call
-		 * above decides against analyzing the block it picked.
-		 */
-		if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
-			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+		if (!use_read_streams)
+		{
+			/*
+			 * When pre-fetching, after we get a block, tell the kernel about
+			 * the next one we will want, if there's any left.
+			 *
+			 * We want to do this even if the table_scan_analyze_next_block()
+			 * call above decides against analyzing the block it picked.
+			 */
+			if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber)
+				PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock);
+		}
 #endif
 
 		/*
@@ -1299,6 +1351,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 
 		pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 									 ++blksdone);
+		nblocks--;
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
-- 
2.43.0

#22

Melanie Plageman

melanieplageman@gmail.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#21)

Re: Use streaming read API in ANALYZE

On Wed, May 15, 2024 at 2:18 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:

On Mon, 8 Apr 2024 at 04:21, Thomas Munro <thomas.munro@gmail.com> wrote:
I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

041b96802ef is discussed in the 'Table AM Interface Enhancements'
thread [1]. The main problems discussed about this commit is that the
read stream API is not pushed to the heap-specific code and, because
of that, the other AM implementations need to use read streams. To
push read stream API to the heap-specific code, it is pretty much
required to pass BufferAccessStrategy and BlockSamplerData to the
initscan().

I am sharing the alternative version of this patch. The first patch
just reverts 041b96802ef and the second patch is the alternative
version.

In this alternative version, the read stream API is not pushed to the
heap-specific code, but it is controlled by the heap-specific code.
The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
heap-specific code if the scan type is 'ANALYZE'. This flag is used to
decide whether streaming API in ANALYZE will be used or not. If this
flag is set, this means heap AMs and read stream API will be used. If
it is not set, this means heap AMs will not be used and code falls
back to the version before read streams.

Personally, I think the alternative version here is the best option
other than leaving what is in master. However, I would vote for
keeping what is in master because 1) where we are in the release
timeline and 2) the acquire_sample_rows() code, before streaming read,
was totally block-based anyway.

If we kept what was in master, do we need to document for table AMs
how to use read_stream_next_buffer() or can we assume they will look
at the heap AM implementation?

- Melanie

#23

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Melanie Plageman (#22)

1 attachment(s)

Re: Use streaming read API in ANALYZE

On Mon, May 20, 2024 at 10:46 PM Melanie Plageman <melanieplageman@gmail.com>
wrote:

On Wed, May 15, 2024 at 2:18 PM Nazir Bilal Yavuz <byavuz81@gmail.com>
wrote:

On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz <byavuz81@gmail.com>

wrote:

On Mon, 8 Apr 2024 at 04:21, Thomas Munro <thomas.munro@gmail.com>

wrote:

I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

041b96802ef is discussed in the 'Table AM Interface Enhancements'
thread [1]. The main problems discussed about this commit is that the
read stream API is not pushed to the heap-specific code and, because
of that, the other AM implementations need to use read streams. To
push read stream API to the heap-specific code, it is pretty much
required to pass BufferAccessStrategy and BlockSamplerData to the
initscan().

I am sharing the alternative version of this patch. The first patch
just reverts 041b96802ef and the second patch is the alternative
version.

In this alternative version, the read stream API is not pushed to the
heap-specific code, but it is controlled by the heap-specific code.
The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
heap-specific code if the scan type is 'ANALYZE'. This flag is used to
decide whether streaming API in ANALYZE will be used or not. If this
flag is set, this means heap AMs and read stream API will be used. If
it is not set, this means heap AMs will not be used and code falls
back to the version before read streams.

Personally, I think the alternative version here is the best option
other than leaving what is in master. However, I would vote for
keeping what is in master because 1) where we are in the release
timeline and 2) the acquire_sample_rows() code, before streaming read,
was totally block-based anyway.

If we kept what was in master, do we need to document for table AMs
how to use read_stream_next_buffer() or can we assume they will look
at the heap AM implementation?

Hi all,

I ran into this with the PG17 beta3 and for our use-case we need to set up
another stream (using a different relation and/or fork, but using the same
strategy) in addition to the one that is passed in to the
scan_analyze_next_block(), so to be able to do that it is necessary to have
the block sampler and the strategy from the original stream. Given that
this makes it very difficult (see below) to set up a different ReadStream
inside the TAM unless you have the BlockSampler and the BufferReadStrategy,
and the old interface did not have this problem, I would consider this a
regression.

This would be possible to solve in a few different ways:

1. The alternate version proposed by Nazir allows you to decide which
interface to use.
2. Reverting the patch entirely would also solve the problem.
3. Passing down the block sampler and the strategy to scan_begin() and
move the ReadStream setup in analyze.c into initscan() in heapam.c, but
this requires adding new parameters to this function.
4. Having accessors that allow you to get the block sampler and strategy
from the ReadStream object.

The proposed solution 1 above would still not solve the problem of allowing
a user to set up a different or extra ReadStream if they want to use the
new ReadStream interface. Reverting the ReadStream patch entirely would
also deal with the regression, but I find the ReadStream interface very
elegant since it moves the block sampling into a separate abstraction and
would like to use it, but right now there are some limitations if you want
to use it fully. The third solution above would allow that, but it requires
a change in the signature of scan_begin(), which might not be the best at
this stage of development. Proposal 4 would allow you to construct a new
stream based on the old one and might be a simple alternative solution as
well with less changes to the current code.

It is possible to capture the information in ProcessUtility() and
re-compute all the parameters, but that is quite a lot of work to get
right, especially considering that these computations are all over the
place and part of different functions at different stages (For example,
variable ring_size, needed to set up the buffer access strategy is computed
in ExecVacuum(); variable targrows, used to set up the buffer sampler, is
computed inside acquire_sample_rows(), which in turn requires to decide
what attributes to analyze, which is computed in do_analyze_rel().)

It would be great if this could be fixed before the PG17 release now that
27bc1772fc8 was reverted.
--
Best wishes,
Mats Kindahl, Timescale

Attachments:

0001-Add-accessors-for-ReadStream.v1.patchtext/x-patch; charset=US-ASCII; name=0001-Add-accessors-for-ReadStream.v1.patchDownload

From f684ba1a941912bfe57c256a080cb8f6e49e8871 Mon Sep 17 00:00:00 2001
From: Mats Kindahl <mats@timescale.com>
Date: Thu, 22 Aug 2024 09:25:30 +0200
Subject: Add accessors for ReadStream

Add accessors to be able to retrieve the buffer access strategy as well
as the callback with its private data.
---
 src/backend/storage/aio/read_stream.c | 19 +++++++++++++++++++
 src/include/storage/read_stream.h     |  4 ++++
 2 files changed, 23 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 684b6d6fc0..02c9ad229f 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -852,3 +852,22 @@ read_stream_end(ReadStream *stream)
 	read_stream_reset(stream);
 	pfree(stream);
 }
+
+BufferAccessStrategy
+read_stream_get_access_strategy(ReadStream *stream)
+{
+	/*
+	 * All the streams have the same strategy and there is at least one, so we
+	 * can safely get the first one.
+	 */
+	return stream->ios[0].op.strategy;
+}
+
+void
+read_stream_get_callback(ReadStream *stream,
+						 ReadStreamBlockNumberCB *callback,
+						 void **callback_private_data)
+{
+	*callback = stream->callback;
+	*callback_private_data = stream->callback_private_data;
+}
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 4e599904f2..0da73719e5 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -69,5 +69,9 @@ extern ReadStream *read_stream_begin_smgr_relation(int flags,
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_private);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
+extern BufferAccessStrategy read_stream_get_access_strategy(ReadStream *stream);
+extern void read_stream_get_callback(ReadStream *stream,
+									 ReadStreamBlockNumberCB *callback,
+									 void **callback_private_data);
 
 #endif							/* READ_STREAM_H */
-- 
2.34.1

#24

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Mats Kindahl (#23)

Re: Use streaming read API in ANALYZE

On Thu, Aug 22, 2024 at 7:31 PM Mats Kindahl <mats@timescale.com> wrote:

The alternate version proposed by Nazir allows you to decide which interface to use.
Reverting the patch entirely would also solve the problem.
Passing down the block sampler and the strategy to scan_begin() and move the ReadStream setup in analyze.c into initscan() in heapam.c, but this requires adding new parameters to this function.
Having accessors that allow you to get the block sampler and strategy from the ReadStream object.

I'm a bit confused about how it can make sense to use the same
BlockSampler with a side relation/fork. Could you point me at the
code?

It would be great if this could be fixed before the PG17 release now that 27bc1772fc8 was reverted.

Ack. Thinking...

Random thought: is there a wiki page or something where we can find
out about all the table AM projects? For the successor to
27bc1772fc8, I hope they'll be following along.

#25

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Thomas Munro (#24)

1 attachment(s)

Re: Use streaming read API in ANALYZE

On Sat, Aug 24, 2024 at 5:31 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Aug 22, 2024 at 7:31 PM Mats Kindahl <mats@timescale.com> wrote:

The alternate version proposed by Nazir allows you to deide which

interface to use.

Reverting the patch entirely would also solve the problem.

After digging through the code a little more I discovered that
there actually is another one: move the ReadStream struct into
read_stream.h.

Passing down the block sampler and the strategy to scan_begin() and move

the ReadStream setup in analyze.c into initscan() in heapam.c, but this
requires adding new parameters to this function.

Having accessors that allow you to get the block sampler and strategy

from the ReadStream object.

I'm a bit confused about how it can make sense to use the same
BlockSampler with a side relation/fork. Could you point me at the
code?

Sorry, that was a bit unclear. Intention was not to re-use the block
sampler but to set a new one up with parameters from the original block
sampler, which would require access to it. (The strategy is less of a
problem since only one is used.)

To elaborate on the situation:

For the TAM in question we have two different storage areas, both are
heaps. Both relations use the same attributes "publicly" (they are
internally different, but we transform them to look the same). One of the
relations is the "default" one and is stored in rd_rel. In order to run
ANALYZE, we need to sample blocks from both relations, in slightly
different ways.

With the old interface, we faked the number of blocks in relation_size()
callback and claimed that there were N + M blocks. When then being asked
about a block by block number, we could easily pick the correct relation
and just forward the call.

With the new ReadStream API, a read-stream is (automatically) set up on the
"default" relation, but we can set up a separate read-stream inside the TAM
for the other relation. However, the difficulty is in setting it up
correctly:

We cannot use the "fake number of block"-trick since the read stream does
not only compute the block number, but actually tries to read the buffer in
the relation provided when setting up the read stream, so a block number
outside the range of this relation will not be found since it is in a
different relation.

If we could create our own read stream with both relations, that could be
solved and we could just implement the same logic, but direct it to the
correct relations depending on where we want to read the block. Unless I am
mistaken, there is already support for this since there is an array of
in-progress I/O and it would be trivial to extend this with more
relations+forks, if you have access to the structure definition. The
ReadStream struct is, however, an opaque struct so it's hard to hack around
with it. Just making the struct declaration public would potentially solve
a lot of problems here. (See attached patch, which is close to the minimum
of what is needed to allow extension writers to tweak the contents.)

Since both relations are using the same attributes with the same
"analyzability", having that information would be useful to compute the
targrows for setting up the additional stream, but it is computed in
do_analyze_rel() and not further propagated, so it needs to be re-computed
if we want to set up a separate read-stream.

It would be great if this could be fixed before the PG17 release now

that 27bc1772fc8 was reverted.

Ack. Thinking...

Right now I think that just making the ReadStream struct available in the
header file is the best approach. It is a safe and low-risk fix (so
something that can be added to a beta) and will allow extension writers to
hack to their hearts' contents. In addition to that, being able to select
what interface to use would also help.

Random thought: is there a wiki page or something where we can find
out about all the table AM projects? For the successor to
27bc1772fc8, I hope they'll be following along.

At this point, unfortunately not, we are quite early in this. Once I have
something, I'll share.
--
Best wishes,
Mats Kindahl, Timescale

Attachments:

0001-Make-ReadStream-struct-non-opaque.v1.patchapplication/x-patch; name=0001-Make-ReadStream-struct-non-opaque.v1.patchDownload

From ea4bb194e0dcccac8465b3aa13950f721bde3860 Mon Sep 17 00:00:00 2001
From: Mats Kindahl <mats@timescale.com>
Date: Thu, 29 Aug 2024 10:39:34 +0200
Subject: Make ReadStream struct non-opaque

Move the ReadStream struct and two utility functions from read_stream.c
to read_stream.h to allow extensions to modify the structure if they
need to.
---
 src/backend/storage/aio/read_stream.c |  59 ---------------
 src/include/storage/read_stream.h     | 105 ++++++++++++++++++++++++++
 2 files changed, 105 insertions(+), 59 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index a83c18c2a4..bf2a679037 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -97,65 +97,6 @@
 #include "utils/rel.h"
 #include "utils/spccache.h"
 
-typedef struct InProgressIO
-{
-	int16		buffer_index;
-	ReadBuffersOperation op;
-} InProgressIO;
-
-/*
- * State for managing a stream of reads.
- */
-struct ReadStream
-{
-	int16		max_ios;
-	int16		ios_in_progress;
-	int16		queue_size;
-	int16		max_pinned_buffers;
-	int16		pinned_buffers;
-	int16		distance;
-	bool		advice_enabled;
-
-	/*
-	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
-	 * control problems when I/Os are split.  Also useful for batch-loading
-	 * block numbers in the fast path.
-	 */
-	BlockNumber blocknums[16];
-	int16		blocknums_count;
-	int16		blocknums_next;
-
-	/*
-	 * The callback that will tell us which block numbers to read, and an
-	 * opaque pointer that will be pass to it for its own purposes.
-	 */
-	ReadStreamBlockNumberCB callback;
-	void	   *callback_private_data;
-
-	/* Next expected block, for detecting sequential access. */
-	BlockNumber seq_blocknum;
-
-	/* The read operation we are currently preparing. */
-	BlockNumber pending_read_blocknum;
-	int16		pending_read_nblocks;
-
-	/* Space for buffers and optional per-buffer private data. */
-	size_t		per_buffer_data_size;
-	void	   *per_buffer_data;
-
-	/* Read operations that have been started but not waited for yet. */
-	InProgressIO *ios;
-	int16		oldest_io_index;
-	int16		next_io_index;
-
-	bool		fast_path;
-
-	/* Circular queue of buffers. */
-	int16		oldest_buffer_index;	/* Next pinned buffer to return */
-	int16		next_buffer_index;	/* Index of next buffer to pin */
-	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
-};
-
 /*
  * Return a pointer to the per-buffer data by index.
  */
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 4e599904f2..006ec3feb1 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -50,6 +50,111 @@ typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 												void *callback_private_data,
 												void *per_buffer_data);
 
+/*
+ * State of an in-progress read buffer operation.
+ */
+typedef struct InProgressIO
+{
+	int16		buffer_index;
+	ReadBuffersOperation op;
+} InProgressIO;
+
+/*
+ * State for managing a stream of reads.
+ */
+struct ReadStream
+{
+	int16		max_ios;
+	int16		ios_in_progress;
+	int16		queue_size;
+	int16		max_pinned_buffers;
+	int16		pinned_buffers;
+	int16		distance;
+	bool		advice_enabled;
+
+	/*
+	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
+	 * control problems when I/Os are split.  Also useful for batch-loading
+	 * block numbers in the fast path.
+	 */
+	BlockNumber blocknums[16];
+	int16		blocknums_count;
+	int16		blocknums_next;
+
+	/*
+	 * The callback that will tell us which block numbers to read, and an
+	 * opaque pointer that will be pass to it for its own purposes.
+	 */
+	ReadStreamBlockNumberCB callback;
+	void	   *callback_private_data;
+
+	/* Next expected block, for detecting sequential access. */
+	BlockNumber seq_blocknum;
+
+	/* The read operation we are currently preparing. */
+	BlockNumber pending_read_blocknum;
+	int16		pending_read_nblocks;
+
+	/* Space for buffers and optional per-buffer private data. */
+	size_t		per_buffer_data_size;
+	void	   *per_buffer_data;
+
+	/* Read operations that have been started but not waited for yet. */
+	InProgressIO *ios;
+	int16		oldest_io_index;
+	int16		next_io_index;
+
+	bool		fast_path;
+
+	/* Circular queue of buffers. */
+	int16		oldest_buffer_index;	/* Next pinned buffer to return */
+	int16		next_buffer_index;	/* Index of next buffer to pin */
+	Buffer		buffers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
+ */
+static inline BlockNumber
+read_stream_get_block(ReadStream *stream, void *per_buffer_data)
+{
+	if (stream->blocknums_next < stream->blocknums_count)
+		return stream->blocknums[stream->blocknums_next++];
+
+	/*
+	 * We only bother to fetch one at a time here (but see the fast path which
+	 * uses more).
+	 */
+	return stream->callback(stream,
+							stream->callback_private_data,
+							per_buffer_data);
+}
+
+/*
+ * In order to deal with short reads in StartReadBuffers(), we sometimes need
+ * to defer handling of a block until later.
+ */
+static inline void
+read_stream_unget_block(ReadStream *stream, BlockNumber blocknum)
+{
+	if (stream->blocknums_next == stream->blocknums_count)
+	{
+		/* Never initialized or entirely consumed.  Re-initialize. */
+		stream->blocknums[0] = blocknum;
+		stream->blocknums_count = 1;
+		stream->blocknums_next = 0;
+	}
+	else
+	{
+		/* Must be the last value return from blocknums array. */
+		Assert(stream->blocknums_next > 0);
+		stream->blocknums_next--;
+		Assert(stream->blocknums[stream->blocknums_next] == blocknum);
+	}
+}
+
 extern ReadStream *read_stream_begin_relation(int flags,
 											  BufferAccessStrategy strategy,
 											  Relation rel,
-- 
2.34.1

#26

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Mats Kindahl (#25)

Re: Use streaming read API in ANALYZE

Thanks for the explanation. I think we should revert it. IMHO it was
a nice clean example of a streaming transformation, but unfortunately
it transformed an API that nobody liked in the first place, and broke
some weird and wonderful workarounds. Let's try again in 18.

#27

Robert Haas

robertmhaas@gmail.com

over 1 year ago

In reply to: Thomas Munro (#26)

Re: Use streaming read API in ANALYZE

On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks for the explanation. I think we should revert it. IMHO it was
a nice clean example of a streaming transformation, but unfortunately
it transformed an API that nobody liked in the first place, and broke
some weird and wonderful workarounds. Let's try again in 18.

The problem I have with this is that we just released RC1. I suppose
if we have to make this change it's better to do it sooner than later,
but are we sure we want to whack this around this close to final
release?

--
Robert Haas
EDB: http://www.enterprisedb.com

#28

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Robert Haas (#27)

Re: Use streaming read API in ANALYZE

On Thu, Sep 5, 2024 at 3:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Thanks for the explanation. I think we should revert it. IMHO it was
a nice clean example of a streaming transformation, but unfortunately
it transformed an API that nobody liked in the first place, and broke
some weird and wonderful workarounds. Let's try again in 18.

The problem I have with this is that we just released RC1. I suppose
if we have to make this change it's better to do it sooner than later,
but are we sure we want to whack this around this close to final
release?

I hear you. But I definitely don't want to (and likely can't at this
point) make any of the other proposed changes, and I also don't want
to break Timescale. That seems to leave only one option: go back to
the v16 API for RC2, and hope that the ongoing table AM discussions
for v18 (CF #4866) will fix all the problems for the people whose TAMs
don't quack like a "heap", and the people whose TAMs do and who would
not like to duplicate the code, and the people who want streaming I/O.

#29

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Thomas Munro (#28)

Re: Use streaming read API in ANALYZE

On Thu, Sep 5, 2024 at 1:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Sep 5, 2024 at 3:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro <thomas.munro@gmail.com>

wrote:

Thanks for the explanation. I think we should revert it. IMHO it was
a nice clean example of a streaming transformation, but unfortunately
it transformed an API that nobody liked in the first place, and broke
some weird and wonderful workarounds. Let's try again in 18.

The problem I have with this is that we just released RC1. I suppose
if we have to make this change it's better to do it sooner than later,
but are we sure we want to whack this around this close to final
release?

I hear you. But I definitely don't want to (and likely can't at this
point) make any of the other proposed changes, and I also don't want
to break Timescale. That seems to leave only one option: go back to
the v16 API for RC2, and hope that the ongoing table AM discussions
for v18 (CF #4866) will fix all the problems for the people whose TAMs
don't quack like a "heap", and the people whose TAMs do and who would
not like to duplicate the code, and the people who want streaming I/O.

Forgive me for asking, but I am not entirely sure why the ReadStream struct
is opaque. The usual reasons are:

- You want to provide an ABI to allow extensions to work with new major
versions without re-compiling. Right now it is necessary to recompile
extensions anyway, this does not seem to apply. (Because there are a lot of
other changes that you need when switching versions because of the lack of
a stable ABI for other parts of the code. However, it might be that the
goal is to support it eventually, and then it would make sense to start
making structs opaque.)
- You want to ensure that you can make modifications *inside* a major
version without breaking ABIs and requiring a re-compile. In this case, you
could still follow safe practice of adding new fields last, not relying on
the size of the struct for anything (e.g., no arrays of these structures,
just pointers to them), etc. However, if you want to be *very* safe and
support very drastic changes inside a major version, it needs to be opaque,
so this could be the reason.

Is it either of these reasons, or is there another reason?

Making the ReadStream API non-opaque (that is, moving the definition to the
header file) would at least solve our problem (unless I am mistaken).
However, I am ignorant about long-term plans which might affect this, so
there might be a good reason to revert it for reasons I am not aware of.
--
Best wishes,
Mats Kindahl, Timescale

#30

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Mats Kindahl (#29)

Re: Use streaming read API in ANALYZE

On Thu, Sep 5, 2024 at 6:45 PM Mats Kindahl <mats@timescale.com> wrote:

Forgive me for asking, but I am not entirely sure why the ReadStream struct is opaque. The usual reasons are:

You want to provide an ABI to allow extensions to work with new major versions without re-compiling. Right now it is necessary to recompile extensions anyway, this does not seem to apply. (Because there are a lot of other changes that you need when switching versions because of the lack of a stable ABI for other parts of the code. However, it might be that the goal is to support it eventually, and then it would make sense to start making structs opaque.)
You want to ensure that you can make modifications inside a major version without breaking ABIs and requiring a re-compile. In this case, you could still follow safe practice of adding new fields last, not relying on the size of the struct for anything (e.g., no arrays of these structures, just pointers to them), etc. However, if you want to be very safe and support very drastic changes inside a major version, it needs to be opaque, so this could be the reason.

Is it either of these reasons, or is there another reason?

Making the ReadStream API non-opaque (that is, moving the definition to the header file) would at least solve our problem (unless I am mistaken). However, I am ignorant about long-term plans which might affect this, so there might be a good reason to revert it for reasons I am not aware of.

The second thing. Also there are very active plans[1]/messages/by-id/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt to change the
internal design of ReadStream in 18, since the goal is to drive true
asynchronous I/O, and the idea of ReadStream was to create a simple
API to let many consumers start using it, so that we can drive
efficient modern system interfaces below that API, so having people
depending on how it works would not be great.

But let's talk about how that would actually look, for example if we
exposed the struct or you took a photocopy of it... I think your idea
must be something like: if you could access struct ReadStream's
internals, you could replace stream->callback with an interceptor
callback, and if the BlockSampler had been given the fake N + M
relation size, the interceptor could overwrite
stream->ios[next_io_index].op.smgr and return x - N if the intercepted
callback returned x >= N. (Small detail: need to check
stream->fast_path and use 0 instead or something like that, but maybe
we could change that.) One minor problem that jumps out is that
read_stream.c could inappropriately merge blocks from the two
relations into one I/O. Hmm, I guess you'd have to teach the
interceptor not to allow that: if switching between the two relation,
and if the block number would coincide with
stream->pending_read_blocknum + stream->pending_read_nblocks, it would
need to pick a new block instead (interfering with the block sampling
algorithm, but only very rarely). Is this what you had in mind, or
something else?

(BTW I have a patch to teach read_stream.c about multi-smgr-relation
streams, by adding a different constructor with a different callback
that returns smgr, fork, block instead of just the block, but it
didn't make it into 17.)

[1]: /messages/by-id/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt

#31

Michael Banck

mbanck@gmx.net

over 1 year ago

In reply to: Thomas Munro (#30)

Re: Use streaming read API in ANALYZE

Hi,

On Thu, Sep 05, 2024 at 09:12:07PM +1200, Thomas Munro wrote:

On Thu, Sep 5, 2024 at 6:45 PM Mats Kindahl <mats@timescale.com> wrote:

Making the ReadStream API non-opaque (that is, moving the definition
to the header file) would at least solve our problem (unless I am
mistaken). However, I am ignorant about long-term plans which might
affect this, so there might be a good reason to revert it for
reasons I am not aware of.

The second thing.

I am a bit confused about the status of this thread. Robert mentioned
RC1, so I guess it pertains to v17 but I don't see it on the open item
wiki list?

Does the above mean you are going to revert it for v17, Thomas? And if
so, what exactly? The ANALYZE changes on top of the streaming read API
or something else about that API that is being discussed on this thread?

I am also asking because this feature (i.e. Use streaming read API in
ANALYZE) is being mentioned in the release announcement and that was
just frozen for translations.

Michael

#32

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Michael Banck (#31)

1 attachment(s)

Re: Use streaming read API in ANALYZE

On Tue, Sep 10, 2024 at 3:36 AM Michael Banck <mbanck@gmx.net> wrote:

I am a bit confused about the status of this thread. Robert mentioned
RC1, so I guess it pertains to v17 but I don't see it on the open item
wiki list?

Yes, v17. Alight, I'll add an item.

Does the above mean you are going to revert it for v17, Thomas? And if
so, what exactly? The ANALYZE changes on top of the streaming read API
or something else about that API that is being discussed on this thread?

I might have been a little pessimistic in that assessment. Another
workaround that seems an awful lot cleaner and less invasive would be
to offer a new ReadStream API function that provides access to block
numbers and the strategy, ie the arguments of v16's
scan_analyze_next_block() function. Mats, what do you think about
this? (I haven't tried to preserve the prefetching behaviour, which
probably didn't actually too work for you in v16 anyway at a guess,
I'm just looking for the absolute simplest thing we can do to resolve
this API mismatch.) TimeScale could then continue to use its v16
coding to handle the two-relations-in-a-trenchcoat problem, and we
could continue discussing how to make v18 better.

I looked briefly at another non-heap-like table AM, the Citus Columnar
TAM. I am not familiar with that code and haven't studied it deeply
this morning, but its _next_block() currently just returns true, so I
think it will somehow need to change to counting calls and returning
false when it thinks its been called enough times (otherwise the loop
in acquire_sample_rows() won't terminate, I think?). I suppose an
easy way to do that without generating extra I/O or having to think
hard about how to preserve the loop cound from v16 would be to use
this function.

I think there are broadly three categories of TAMs with respect to
ANALYZE block sampling: those that are very heap-like (blocks of one
SMgrRelation) and can just use the stream directly, those that are not
at all heap-like (doing something completely different to sample
tuples and ignoring the block aspect but using _next_block() to
control the loop), and then Timescale's case which is sort of
somewhere in between: almost heap-like from the point of view of this
sampling code, ie working with blocks, but fudging the meaning of
block numbers, which we didn't anticipate. (I wonder if it fails to
sample fairly across the underlying relation boundary anyway because
their data densities must surely be quite different, but that's not
what we're here to talk about.)

. o O { We need that wiki page listing TAMs with links to the open
source ones... }

Attachments:

0001-Allow-raw-block-numbers-to-be-read-from-ReadStream.patchtext/x-patch; charset=US-ASCII; name=0001-Allow-raw-block-numbers-to-be-read-from-ReadStream.patchDownload

From db05a33e742cb292b1a2d44665582042fcd05d2f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 10 Sep 2024 10:15:33 +1200
Subject: [PATCH] Allow raw block numbers to be read from ReadStream.

---
 src/backend/storage/aio/read_stream.c | 14 ++++++++++++++
 src/include/storage/read_stream.h     |  3 +++
 2 files changed, 17 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 7f0e07d9586..cf914a7712f 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -835,3 +835,17 @@ read_stream_end(ReadStream *stream)
 	read_stream_reset(stream);
 	pfree(stream);
 }
+
+/*
+ * Transitional support for code that would like to perform a read directly,
+ * without using the stream.  Returns, and skips, the next block number that
+ * would be read by the stream's look-ahead algorithm, or InvalidBlockNumber
+ * if the end of the stream is reached.  Also reports the strategy that would
+ * be used to read it.
+ */
+BlockNumber
+read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
+{
+	*strategy = stream->ios[0].op.strategy;
+	return read_stream_get_block(stream, NULL);
+}
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 2f94ee744b9..37b51934f8d 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -68,6 +68,9 @@ extern ReadStream *read_stream_begin_relation(int flags,
 											  void *callback_private_data,
 											  size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_data);
+extern BlockNumber read_stream_next_block(ReadStream *stream,
+										  BufferAccessStrategy *strategy);
+
 extern ReadStream *read_stream_begin_smgr_relation(int flags,
 												   BufferAccessStrategy strategy,
 												   SMgrRelation smgr,
-- 
2.46.0

#33

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Thomas Munro (#32)

Re: Use streaming read API in ANALYZE

On Tue, Sep 10, 2024 at 10:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Mats, what do you think about
this? (I haven't tried to preserve the prefetching behaviour, which
probably didn't actually too work for you in v16 anyway at a guess,
I'm just looking for the absolute simplest thing we can do to resolve
this API mismatch.) TimeScale could then continue to use its v16
coding to handle the two-relations-in-a-trenchcoat problem, and we
could continue discussing how to make v18 better.

. o O { Spitballing here: if we add that tiny function I showed to get
you unstuck for v17, then later in v18, if we add a multi-relation
ReadStream constructor/callback (I have a patch somewhere, I want to
propose that as it is needed for streaming recovery), you could
construct a new ReadSteam of your own that is daisy-chained from that
one. You could keep using your N + M block numbering scheme if you
want to, and the callback of the new stream could decode the block
numbers and redirect to the appropriate relation + real block number.
That way you'd get I/O concurrency for both relations (for now just
read-ahead advice, but see Andres's AIO v2 thread). That'd
essentially be a more supported version of the 'access the struct
internals' idea (or at least my understanding of what you had in
mind), through daisy-chained streams. A little weird maybe, and maybe
the redesign work will result in something completely
different/better... just a thought... }

#34

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Thomas Munro (#30)

Re: Use streaming read API in ANALYZE

On Thu, Sep 5, 2024 at 11:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Sep 5, 2024 at 6:45 PM Mats Kindahl <mats@timescale.com> wrote:

Forgive me for asking, but I am not entirely sure why the ReadStream

struct is opaque. The usual reasons are:

You want to provide an ABI to allow extensions to work with new major

versions without re-compiling. Right now it is necessary to recompile
extensions anyway, this does not seem to apply. (Because there are a lot of
other changes that you need when switching versions because of the lack of
a stable ABI for other parts of the code. However, it might be that the
goal is to support it eventually, and then it would make sense to start
making structs opaque.)

You want to ensure that you can make modifications inside a major

version without breaking ABIs and requiring a re-compile. In this case, you
could still follow safe practice of adding new fields last, not relying on
the size of the struct for anything (e.g., no arrays of these structures,
just pointers to them), etc. However, if you want to be very safe and
support very drastic changes inside a major version, it needs to be opaque,
so this could be the reason.

Is it either of these reasons, or is there another reason?

Making the ReadStream API non-opaque (that is, moving the definition to

the header file) would at least solve our problem (unless I am mistaken).
However, I am ignorant about long-term plans which might affect this, so
there might be a good reason to revert it for reasons I am not aware of.

The second thing. Also there are very active plans[1] to change the
internal design of ReadStream in 18, since the goal is to drive true
asynchronous I/O, and the idea of ReadStream was to create a simple
API to let many consumers start using it, so that we can drive
efficient modern system interfaces below that API, so having people
depending on how it works would not be great.

That is understandable, since you usually do not want to have to re-compile
the extension for different minor versions. However, it would be a rare
case with extensions that are meddling with this, so might not turn out to
be a big problem in reality, as long as it is very clear to all involved
that this might change and that you make an effort to avoid binary
incompatibility by removing or changing types for fields.

But let's talk about how that would actually look, for example if we
exposed the struct or you took a photocopy of it... I think your idea
must be something like: if you could access struct ReadStream's
internals, you could replace stream->callback with an interceptor
callback, and if the BlockSampler had been given the fake N + M
relation size, the interceptor could overwrite
stream->ios[next_io_index].op.smgr and return x - N if the intercepted
callback returned x >= N. (Small detail: need to check
stream->fast_path and use 0 instead or something like that, but maybe
we could change that.)

Yes, this is what I had in mind, but I did not dig too deeply into the code.

One minor problem that jumps out is that
read_stream.c could inappropriately merge blocks from the two
relations into one I/O. Hmm, I guess you'd have to teach the
interceptor not to allow that: if switching between the two relation,
and if the block number would coincide with
stream->pending_read_blocknum + stream->pending_read_nblocks, it would
need to pick a new block instead (interfering with the block sampling
algorithm, but only very rarely). Is this what you had in mind, or
something else?

Hmmm... I didn't look too closely at this. Since the block number comes
from the callback, I guess we could make sure to have a "padding" block
between the regions so that we "break" any suite of blocks, which I think
is what you mean with "teach the interceptor not to allow that", but I
would have to write a patch to make sure.

(BTW I have a patch to teach read_stream.c about multi-smgr-relation
streams, by adding a different constructor with a different callback
that returns smgr, fork, block instead of just the block, but it
didn't make it into 17.)

Without having looked at the patch, this sounds like the correct way to do
it.

[1]
/messages/by-id/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt

--
Best wishes,
Mats Kindahl, Timescale

#35

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Thomas Munro (#32)

1 attachment(s)

Re: Use streaming read API in ANALYZE

On Tue, Sep 10, 2024 at 12:28 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Tue, Sep 10, 2024 at 3:36 AM Michael Banck <mbanck@gmx.net> wrote:

I am a bit confused about the status of this thread. Robert mentioned
RC1, so I guess it pertains to v17 but I don't see it on the open item
wiki list?

Yes, v17. Alight, I'll add an item.

Does the above mean you are going to revert it for v17, Thomas? And if
so, what exactly? The ANALYZE changes on top of the streaming read API
or something else about that API that is being discussed on this thread?

I might have been a little pessimistic in that assessment. Another
workaround that seems an awful lot cleaner and less invasive would be
to offer a new ReadStream API function that provides access to block
numbers and the strategy, ie the arguments of v16's
scan_analyze_next_block() function. Mats, what do you think about
this? (I haven't tried to preserve the prefetching behaviour, which
probably didn't actually too work for you in v16 anyway at a guess,
I'm just looking for the absolute simplest thing we can do to resolve
this API mismatch.) TimeScale could then continue to use its v16
coding to handle the two-relations-in-a-trenchcoat problem, and we
could continue discussing how to make v18 better.

In the original code we could call the methods with an "adjusted" block
number, so the entire logic worked as before because we could just
recursively forward the call with modified parameters. This is a little
different with the new API.

I looked briefly at another non-heap-like table AM, the Citus Columnar
TAM. I am not familiar with that code and haven't studied it deeply
this morning, but its _next_block() currently just returns true, so I
think it will somehow need to change to counting calls and returning
false when it thinks its been called enough times (otherwise the loop
in acquire_sample_rows() won't terminate, I think?). I suppose an
easy way to do that without generating extra I/O or having to think
hard about how to preserve the loop cound from v16 would be to use
this function.

Yes, but we are re-using the heapam so forwarding the call to it, which not
only fetches the next block it also reads the buffer. Since you could just
pass in the block number before, it just worked.

As mentioned, we intended to set up a new ReadStream for the "internal"
relation ourselves (I think this is what you mean with "daisy-chain" in the
followup to this mail), but then you need targrows, which is based on
vacattrstats, which is computed with code that is currently either inline
(the loop over the attributes in do_analyze_rel), or static (the
examine_attribute function). We can write our own code for this, it would
help to have the code that does this work callable, or be able to extract
parameters from the existing readstream to at least get a hint. This would
allow us to just get the vacuum attribute stats for an arbitrary relation
and then run the same computations as in do_analyze_rel. Being able to do
the same for the indexes is less important since this is an "internal"
relation and the "public" indexes are the ones that matter.

I attached a tentative patch for this, just doing some refactorings, and
will see if that is sufficient for the current work by trying to use it. (I
thought I would be able to verify this today, but am a little delayed so
I'm sending this anyway.)

A patch like this is a minimal refactoring so should be safe even in an RC.
I have deliberately not tried to do a more serious refactoring although I
see that there are some duplications when doing the same work with the
indexes and it would probably be possible to make a more generic function
for this.

I think there are broadly three categories of TAMs with respect to
ANALYZE block sampling: those that are very heap-like (blocks of one
SMgrRelation) and can just use the stream directly, those that are not
at all heap-like (doing something completely different to sample
tuples and ignoring the block aspect but using _next_block() to
control the loop), and then Timescale's case which is sort of
somewhere in between: almost heap-like from the point of view of this
sampling code, ie working with blocks, but fudging the meaning of
block numbers, which we didn't anticipate.

In this case the block numbers are only from a different relation, so they
are still valid blocks, just encoded in a funny way. The block numbers
trick is just a hack, but the gist is that we want to sample an
arbitrary number of relations/forks when running analysis, not just the
"front-facing" one.

(I wonder if it fails to
sample fairly across the underlying relation boundary anyway because
their data densities must surely be quite different, but that's not
what we're here to talk about.)

Yes, they are, so this is kind-of-a-hack-to-get-it-roughly-correct. The
ideal scenario would be to be able to run the same analysis of that is done
in do_analyze_rel on the "hidden" relation to get an accurate targetrows.
This is what I am trying now with the attached patch.

. o O { We need that wiki page listing TAMs with links to the open
source ones... }

--
Best wishes,
Mats Kindahl, Timescale

Attachments:

0001-Add-function-to-compute-vacuum-attribute-statistics.patch.v1application/octet-stream; name=0001-Add-function-to-compute-vacuum-attribute-statistics.patch.v1Download

From fee4f5882971cc353e0b68f6974ab7af0e344ddd Mon Sep 17 00:00:00 2001
From: Mats Kindahl <mats@timescale.com>
Date: Thu, 12 Sep 2024 18:14:04 +0200
Subject: [PATCH] Add function to compute vacuum attribute statistics

Refactor analyze.c so that the code that computes vacuum attribute
statistics is a separate function and available for extensions to use
to set up their own ReadStream for analyze.
---
 src/backend/commands/analyze.c | 120 ++++++++++++++++++---------------
 src/include/commands/vacuum.h  |   2 +
 2 files changed, 69 insertions(+), 53 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c590a2adc35..8a402ad15c6 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -269,6 +269,72 @@ analyze_rel(Oid relid, RangeVar *relation,
 	pgstat_progress_end_command();
 }
 
+/*
+ * Determine which columns to analyze
+ *
+ * Note that system attributes are never analyzed, so we just reject them
+ * at the lookup stage.  We also reject duplicate column mentions.  (We
+ * could alternatively ignore duplicates, but analyzing a column twice
+ * won't work; we'd end up making a conflicting update in pg_statistic.)
+ */
+int
+analyze_compute_vacattrstats(Relation onerel, List *va_cols, VacAttrStats ***vacattrstats_out)
+{
+	int			tcnt,
+				i,
+				attr_cnt;
+	VacAttrStats **vacattrstats;
+
+	if (va_cols != NIL)
+	{
+		Bitmapset  *unique_cols = NULL;
+		ListCell   *le;
+
+		vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) *
+												sizeof(VacAttrStats *));
+		tcnt = 0;
+		foreach(le, va_cols)
+		{
+			char	   *col = strVal(lfirst(le));
+
+			i = attnameAttNum(onerel, col, false);
+			if (i == InvalidAttrNumber)
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_COLUMN),
+						 errmsg("column \"%s\" of relation \"%s\" does not exist",
+								col, RelationGetRelationName(onerel))));
+			if (bms_is_member(i, unique_cols))
+				ereport(ERROR,
+						(errcode(ERRCODE_DUPLICATE_COLUMN),
+						 errmsg("column \"%s\" of relation \"%s\" appears more than once",
+								col, RelationGetRelationName(onerel))));
+			unique_cols = bms_add_member(unique_cols, i);
+
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			if (vacattrstats[tcnt] != NULL)
+				tcnt++;
+		}
+		attr_cnt = tcnt;
+	}
+	else
+	{
+		attr_cnt = onerel->rd_att->natts;
+		vacattrstats = (VacAttrStats **)
+			palloc(attr_cnt * sizeof(VacAttrStats *));
+		tcnt = 0;
+		for (i = 1; i <= attr_cnt; i++)
+		{
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			if (vacattrstats[tcnt] != NULL)
+				tcnt++;
+		}
+		attr_cnt = tcnt;
+	}
+
+	*vacattrstats_out = vacattrstats;
+	return attr_cnt;
+}
+
 /*
  *	do_analyze_rel() -- analyze one relation, recursively or not
  *
@@ -353,59 +419,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		starttime = GetCurrentTimestamp();
 	}
 
-	/*
-	 * Determine which columns to analyze
-	 *
-	 * Note that system attributes are never analyzed, so we just reject them
-	 * at the lookup stage.  We also reject duplicate column mentions.  (We
-	 * could alternatively ignore duplicates, but analyzing a column twice
-	 * won't work; we'd end up making a conflicting update in pg_statistic.)
-	 */
-	if (va_cols != NIL)
-	{
-		Bitmapset  *unique_cols = NULL;
-		ListCell   *le;
-
-		vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) *
-												sizeof(VacAttrStats *));
-		tcnt = 0;
-		foreach(le, va_cols)
-		{
-			char	   *col = strVal(lfirst(le));
-
-			i = attnameAttNum(onerel, col, false);
-			if (i == InvalidAttrNumber)
-				ereport(ERROR,
-						(errcode(ERRCODE_UNDEFINED_COLUMN),
-						 errmsg("column \"%s\" of relation \"%s\" does not exist",
-								col, RelationGetRelationName(onerel))));
-			if (bms_is_member(i, unique_cols))
-				ereport(ERROR,
-						(errcode(ERRCODE_DUPLICATE_COLUMN),
-						 errmsg("column \"%s\" of relation \"%s\" appears more than once",
-								col, RelationGetRelationName(onerel))));
-			unique_cols = bms_add_member(unique_cols, i);
-
-			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
-			if (vacattrstats[tcnt] != NULL)
-				tcnt++;
-		}
-		attr_cnt = tcnt;
-	}
-	else
-	{
-		attr_cnt = onerel->rd_att->natts;
-		vacattrstats = (VacAttrStats **)
-			palloc(attr_cnt * sizeof(VacAttrStats *));
-		tcnt = 0;
-		for (i = 1; i <= attr_cnt; i++)
-		{
-			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
-			if (vacattrstats[tcnt] != NULL)
-				tcnt++;
-		}
-		attr_cnt = tcnt;
-	}
+	attr_cnt = analyze_compute_vacattrstats(onerel, va_cols, &vacattrstats);
 
 	/*
 	 * Open all indexes of the relation, and see if there are any analyzable
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d38..f456734855d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -378,6 +378,8 @@ extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
+extern int	analyze_compute_vacattrstats(Relation onerel, List *va_cols,
+										 VacAttrStats ***vacattrstats_out);
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
-- 
2.43.0

#36

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Thomas Munro (#33)

Re: Use streaming read API in ANALYZE

On Tue, Sep 10, 2024 at 6:04 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Sep 10, 2024 at 10:27 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

Mats, what do you think about
this? (I haven't tried to preserve the prefetching behaviour, which
probably didn't actually too work for you in v16 anyway at a guess,
I'm just looking for the absolute simplest thing we can do to resolve
this API mismatch.) TimeScale could then continue to use its v16
coding to handle the two-relations-in-a-trenchcoat problem, and we
could continue discussing how to make v18 better.

. o O { Spitballing here: if we add that tiny function I showed to get
you unstuck for v17, then later in v18, if we add a multi-relation
ReadStream constructor/callback (I have a patch somewhere, I want to
propose that as it is needed for streaming recovery), you could
construct a new ReadSteam of your own that is daisy-chained from that
one. You could keep using your N + M block numbering scheme if you
want to, and the callback of the new stream could decode the block
numbers and redirect to the appropriate relation + real block number.

I think it is good to make as small changes as possible to the RC, so agree
with this approach. Looking at the patch. I think it will work, but I'll do
some experimentation with the patch.

Just asking, is there any particular reason why you do not want to *add*
new functions for opaque objects inside a major release? After all, that
was the reason they were opaque from the beginning and extending with new
functions would not break any existing code, not even from the ABI
perspective.

That way you'd get I/O concurrency for both relations (for now just
read-ahead advice, but see Andres's AIO v2 thread). That'd
essentially be a more supported version of the 'access the struct
internals' idea (or at least my understanding of what you had in
mind), through daisy-chained streams. A little weird maybe, and maybe
the redesign work will result in something completely
different/better... just a thought... }

I'll take a look at the thread. I really think the ReadStream abstraction
is a good step in the right direction.
--
Best wishes,
Mats Kindahl, Timescale

#37

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Mats Kindahl (#36)

1 attachment(s)

Re: Use streaming read API in ANALYZE

On Fri, Sep 13, 2024 at 10:10 AM Mats Kindahl <mats@timescale.com> wrote:

On Tue, Sep 10, 2024 at 6:04 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

On Tue, Sep 10, 2024 at 10:27 AM Thomas Munro <thomas.munro@gmail.com>
wrote:

Mats, what do you think about
this? (I haven't tried to preserve the prefetching behaviour, which
probably didn't actually too work for you in v16 anyway at a guess,
I'm just looking for the absolute simplest thing we can do to resolve
this API mismatch.) TimeScale could then continue to use its v16
coding to handle the two-relations-in-a-trenchcoat problem, and we
could continue discussing how to make v18 better.

. o O { Spitballing here: if we add that tiny function I showed to get
you unstuck for v17, then later in v18, if we add a multi-relation
ReadStream constructor/callback (I have a patch somewhere, I want to
propose that as it is needed for streaming recovery), you could
construct a new ReadSteam of your own that is daisy-chained from that
one. You could keep using your N + M block numbering scheme if you
want to, and the callback of the new stream could decode the block
numbers and redirect to the appropriate relation + real block number.

I think it is good to make as small changes as possible to the RC, so
agree with this approach. Looking at the patch. I think it will work, but
I'll do some experimentation with the patch.

Just asking, is there any particular reason why you do not want to *add*
new functions for opaque objects inside a major release? After all, that
was the reason they were opaque from the beginning and extending with new
functions would not break any existing code, not even from the ABI
perspective.

That way you'd get I/O concurrency for both relations (for now just
read-ahead advice, but see Andres's AIO v2 thread). That'd
essentially be a more supported version of the 'access the struct
internals' idea (or at least my understanding of what you had in
mind), through daisy-chained streams. A little weird maybe, and maybe
the redesign work will result in something completely
different/better... just a thought... }

I'll take a look at the thread. I really think the ReadStream abstraction
is a good step in the right direction.
--
Best wishes,
Mats Kindahl, Timescale

Hi Thomas,

I used the combination of your patch and making the computation of
vacattrstats for a relation available through the API and managed to
implement something that I think does the right thing. (I just sampled a
few different statistics to check if they seem reasonable, like most common
vals and most common freqs.) See attached patch.

I need the vacattrstats to set up the two streams for the internal
relations. I can just re-implement them in the same way as is already done,
but this seems like a small change that avoids unnecessary code
duplication.
--
Best wishes,
Mats Kindahl, Timescale

Attachments:

0001-Support-extensions-wanting-to-do-more-advanced-analy.patchtext/x-patch; charset=US-ASCII; name=0001-Support-extensions-wanting-to-do-more-advanced-analy.patchDownload

From 8acb707cc859f1570046919295817e27f0d8b4f6 Mon Sep 17 00:00:00 2001
From: Mats Kindahl <mats@timescale.com>
Date: Sat, 14 Sep 2024 14:03:35 +0200
Subject: [PATCH] Support extensions wanting to do more advanced analyze

To support extensions that want to do more advanced analyzis
implementations, this commit adds two function:

`analuze_compute_vacattrstats` to compute vacuum attribute stats that
can be used to compute the number of target rows to analyze.

`read_stream_next_block` which allow an extension to just use the
provided analyze stream to sample blocks.
---
 src/backend/commands/analyze.c        | 120 ++++++++++++++------------
 src/backend/storage/aio/read_stream.c |  14 +++
 src/include/commands/vacuum.h         |   2 +
 src/include/storage/read_stream.h     |   2 +
 4 files changed, 85 insertions(+), 53 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c590a2adc35..8a402ad15c6 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -269,6 +269,72 @@ analyze_rel(Oid relid, RangeVar *relation,
 	pgstat_progress_end_command();
 }
 
+/*
+ * Determine which columns to analyze
+ *
+ * Note that system attributes are never analyzed, so we just reject them
+ * at the lookup stage.  We also reject duplicate column mentions.  (We
+ * could alternatively ignore duplicates, but analyzing a column twice
+ * won't work; we'd end up making a conflicting update in pg_statistic.)
+ */
+int
+analyze_compute_vacattrstats(Relation onerel, List *va_cols, VacAttrStats ***vacattrstats_out)
+{
+	int			tcnt,
+				i,
+				attr_cnt;
+	VacAttrStats **vacattrstats;
+
+	if (va_cols != NIL)
+	{
+		Bitmapset  *unique_cols = NULL;
+		ListCell   *le;
+
+		vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) *
+												sizeof(VacAttrStats *));
+		tcnt = 0;
+		foreach(le, va_cols)
+		{
+			char	   *col = strVal(lfirst(le));
+
+			i = attnameAttNum(onerel, col, false);
+			if (i == InvalidAttrNumber)
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_COLUMN),
+						 errmsg("column \"%s\" of relation \"%s\" does not exist",
+								col, RelationGetRelationName(onerel))));
+			if (bms_is_member(i, unique_cols))
+				ereport(ERROR,
+						(errcode(ERRCODE_DUPLICATE_COLUMN),
+						 errmsg("column \"%s\" of relation \"%s\" appears more than once",
+								col, RelationGetRelationName(onerel))));
+			unique_cols = bms_add_member(unique_cols, i);
+
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			if (vacattrstats[tcnt] != NULL)
+				tcnt++;
+		}
+		attr_cnt = tcnt;
+	}
+	else
+	{
+		attr_cnt = onerel->rd_att->natts;
+		vacattrstats = (VacAttrStats **)
+			palloc(attr_cnt * sizeof(VacAttrStats *));
+		tcnt = 0;
+		for (i = 1; i <= attr_cnt; i++)
+		{
+			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
+			if (vacattrstats[tcnt] != NULL)
+				tcnt++;
+		}
+		attr_cnt = tcnt;
+	}
+
+	*vacattrstats_out = vacattrstats;
+	return attr_cnt;
+}
+
 /*
  *	do_analyze_rel() -- analyze one relation, recursively or not
  *
@@ -353,59 +419,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 		starttime = GetCurrentTimestamp();
 	}
 
-	/*
-	 * Determine which columns to analyze
-	 *
-	 * Note that system attributes are never analyzed, so we just reject them
-	 * at the lookup stage.  We also reject duplicate column mentions.  (We
-	 * could alternatively ignore duplicates, but analyzing a column twice
-	 * won't work; we'd end up making a conflicting update in pg_statistic.)
-	 */
-	if (va_cols != NIL)
-	{
-		Bitmapset  *unique_cols = NULL;
-		ListCell   *le;
-
-		vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) *
-												sizeof(VacAttrStats *));
-		tcnt = 0;
-		foreach(le, va_cols)
-		{
-			char	   *col = strVal(lfirst(le));
-
-			i = attnameAttNum(onerel, col, false);
-			if (i == InvalidAttrNumber)
-				ereport(ERROR,
-						(errcode(ERRCODE_UNDEFINED_COLUMN),
-						 errmsg("column \"%s\" of relation \"%s\" does not exist",
-								col, RelationGetRelationName(onerel))));
-			if (bms_is_member(i, unique_cols))
-				ereport(ERROR,
-						(errcode(ERRCODE_DUPLICATE_COLUMN),
-						 errmsg("column \"%s\" of relation \"%s\" appears more than once",
-								col, RelationGetRelationName(onerel))));
-			unique_cols = bms_add_member(unique_cols, i);
-
-			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
-			if (vacattrstats[tcnt] != NULL)
-				tcnt++;
-		}
-		attr_cnt = tcnt;
-	}
-	else
-	{
-		attr_cnt = onerel->rd_att->natts;
-		vacattrstats = (VacAttrStats **)
-			palloc(attr_cnt * sizeof(VacAttrStats *));
-		tcnt = 0;
-		for (i = 1; i <= attr_cnt; i++)
-		{
-			vacattrstats[tcnt] = examine_attribute(onerel, i, NULL);
-			if (vacattrstats[tcnt] != NULL)
-				tcnt++;
-		}
-		attr_cnt = tcnt;
-	}
+	attr_cnt = analyze_compute_vacattrstats(onerel, va_cols, &vacattrstats);
 
 	/*
 	 * Open all indexes of the relation, and see if there are any analyzable
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 74b9bae6313..7e25793ea7e 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -804,3 +804,17 @@ read_stream_end(ReadStream *stream)
 	read_stream_reset(stream);
 	pfree(stream);
 }
+
+/*
+ * Transitional support for code that would like to perform a read directly,
+ * without using the stream.  Returns, and skips, the next block number that
+ * would be read by the stream's look-ahead algorithm, or InvalidBlockNumber
+ * if the end of the stream is reached.  Also reports the strategy that would
+ * be used to read it.
+ */
+BlockNumber
+read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy)
+{
+	*strategy = stream->ios[0].op.strategy;
+	return read_stream_get_block(stream, NULL);
+}
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 759f9a87d38..f456734855d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -378,6 +378,8 @@ extern void parallel_vacuum_main(dsm_segment *seg, shm_toc *toc);
 extern void analyze_rel(Oid relid, RangeVar *relation,
 						VacuumParams *params, List *va_cols, bool in_outer_xact,
 						BufferAccessStrategy bstrategy);
+extern int	analyze_compute_vacattrstats(Relation onerel, List *va_cols,
+										 VacAttrStats ***vacattrstats_out);
 extern bool std_typanalyze(VacAttrStats *stats);
 
 /* in utils/misc/sampling.c --- duplicate of declarations in utils/sampling.h */
diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index f676d2cc20a..7b9005e87bc 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -57,6 +57,8 @@ extern ReadStream *read_stream_begin_relation(int flags,
 											  void *callback_private_data,
 											  size_t per_buffer_data_size);
 extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_data);
+extern BlockNumber read_stream_next_block(ReadStream *stream,
+										  BufferAccessStrategy *strategy);
 extern void read_stream_reset(ReadStream *stream);
 extern void read_stream_end(ReadStream *stream);
 
-- 
2.43.0

#38

Thomas Munro

thomas.munro@gmail.com

over 1 year ago

In reply to: Mats Kindahl (#37)

Re: Use streaming read API in ANALYZE

On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl <mats@timescale.com> wrote:

I used the combination of your patch and making the computation of vacattrstats for a relation available through the API and managed to implement something that I think does the right thing. (I just sampled a few different statistics to check if they seem reasonable, like most common vals and most common freqs.) See attached patch.

Cool. I went ahead and committed that small new function and will
mark the open item closed.

I need the vacattrstats to set up the two streams for the internal relations. I can just re-implement them in the same way as is already done, but this seems like a small change that avoids unnecessary code duplication.

Unfortunately we're not in a phase where we can make non-essential
changes, we're right about to release and we're only committing fixes,
and it seems like you have a way forward (albeit with some
duplication). We can keep talking about that for v18.

From your earlier email:

I'll take a look at the thread. I really think the ReadStream abstraction is a good step in the right direction.

Here's something you or your colleagues might be interested in: I was
looking around for a fun extension to streamify as a demo of the
technology, and I finished up writing a quick patch to streamify
pgvector's HNSW index scan, which worked well enough to share[1]/messages/by-id/CA+hUKGJ_7NKd46nx1wbyXWriuZSNzsTfm+rhEuvU6nxZi3-KVw@mail.gmail.com (I
think it should in principle be able to scale with the number of graph
connections, at least 16x), but then people told me that it's of
limited interest because everybody knows that HNSW indexes have to fit
in memory (I think there may also be memory prefetch streaming
opportunities, unexamined for now). But that made me wonder what the
people with the REALLY big indexes do for hyperdimensional graph
search on a scale required to build Skynet, and that led me back to
Timescale pgvectorscale[2]https://github.com/timescale/pgvectorscale. I see two obvious signs that this thing
is eminently and profitably streamifiable: (1) The stated aim is
optimising for indexes that don't fit in memory, hence "Disk" in the
name of the research project it is inspired by, (2) I see that
DIskANN[3]https://github.com/microsoft/DiskANN is aggressively using libaio (Linux) and overlapped/IOCP
(Windows). So now I am waiting patiently for a Rustacean to show up
with patches for pgvectorscale to use ReadStream, which would already
get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon
hopefully), and hopefully also provide a nice test case for the AIO
patch set which redirects buffer reads through io_uring (Linux,
basically the newer better libaio) or background I/O workers (other
OSes, which works surprisingly competitively). Just BTW for
comparison with DiskANN we have also had early POC-quality patches
that drive AIO with overlapped/IOCP (Windows) which will eventually be
rebased and proposed (Windows isn't really a primary target but we
wanted to validate that the stuff we're working on has abstractions
that will map to the obvious system APIs found in the systems
PostgreSQL targets). For completeness, I've also had it mostly
working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped
support for those last two so that was a bit of a dead end).

[1]: /messages/by-id/CA+hUKGJ_7NKd46nx1wbyXWriuZSNzsTfm+rhEuvU6nxZi3-KVw@mail.gmail.com
[2]: https://github.com/timescale/pgvectorscale
[3]: https://github.com/microsoft/DiskANN

#39

Mats Kindahl

mats@timescale.com

over 1 year ago

In reply to: Thomas Munro (#38)

Re: Use streaming read API in ANALYZE

On Wed, Sep 18, 2024 at 5:13 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl <mats@timescale.com> wrote:

I used the combination of your patch and making the computation of

vacattrstats for a relation available through the API and managed to
implement something that I think does the right thing. (I just sampled a
few different statistics to check if they seem reasonable, like most common
vals and most common freqs.) See attached patch.

Cool. I went ahead and committed that small new function and will
mark the open item closed.

Thank you Thomas, this will help a lot.

I need the vacattrstats to set up the two streams for the internal

relations. I can just re-implement them in the same way as is already done,
but this seems like a small change that avoids unnecessary code duplication.

Unfortunately we're not in a phase where we can make non-essential
changes, we're right about to release and we're only committing fixes,
and it seems like you have a way forward (albeit with some
duplication). We can keep talking about that for v18.

Yes, I can work around this by re-implementing the same code that is
present in PostgreSQL.

From your earlier email:

I'll take a look at the thread. I really think the ReadStream

abstraction is a good step in the right direction.

Here's something you or your colleagues might be interested in: I was
looking around for a fun extension to streamify as a demo of the
technology, and I finished up writing a quick patch to streamify
pgvector's HNSW index scan, which worked well enough to share[1] (I
think it should in principle be able to scale with the number of graph
connections, at least 16x), but then people told me that it's of
limited interest because everybody knows that HNSW indexes have to fit
in memory (I think there may also be memory prefetch streaming
opportunities, unexamined for now). But that made me wonder what the
people with the REALLY big indexes do for hyperdimensional graph
search on a scale required to build Skynet, and that led me back to
Timescale pgvectorscale[2]. I see two obvious signs that this thing
is eminently and profitably streamifiable: (1) The stated aim is
optimising for indexes that don't fit in memory, hence "Disk" in the
name of the research project it is inspired by, (2) I see that
DIskANN[3] is aggressively using libaio (Linux) and overlapped/IOCP
(Windows). So now I am waiting patiently for a Rustacean to show up
with patches for pgvectorscale to use ReadStream, which would already
get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon
hopefully), and hopefully also provide a nice test case for the AIO
patch set which redirects buffer reads through io_uring (Linux,
basically the newer better libaio) or background I/O workers (other
OSes, which works surprisingly competitively). Just BTW for
comparison with DiskANN we have also had early POC-quality patches
that drive AIO with overlapped/IOCP (Windows) which will eventually be
rebased and proposed (Windows isn't really a primary target but we
wanted to validate that the stuff we're working on has abstractions
that will map to the obvious system APIs found in the systems
PostgreSQL targets). For completeness, I've also had it mostly
working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped
support for those last two so that was a bit of a dead end).

[1]
/messages/by-id/CA+hUKGJ_7NKd46nx1wbyXWriuZSNzsTfm+rhEuvU6nxZi3-KVw@mail.gmail.com
[2] https://github.com/timescale/pgvectorscale
[3] https://github.com/microsoft/DiskANN

Thanks Thomas, this looks really interesting. I've forwarded it to the
pgvectorscale team.
--
Best wishes,
Mats Kindahl, Timescale