Use read streams in pg_visibility

Started by Nazir Bilal Yavuzover 1 year ago21 messages

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

2 attachment(s)

Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
----------

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.
----------

Both patches are attached.

Any feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v1-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload

From c178e010759f042dc089fd4c5b2283a04b95f017 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 9 Aug 2024 11:41:52 +0300
Subject: [PATCH v1 1/2] Use read stream in pg_visibility in
 collect_visibility_data()

Instead of reading blocks with ReadBufferExtended() in
collect_visibility_data(), use read stream.

This change provides about 4% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 53 ++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 1a1a4ff7be7..10f961c58f0 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -41,6 +42,16 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_visibility_data() function.
+ */
+struct collect_visibility_data_read_stream_private
+{
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -463,6 +474,23 @@ pg_visibility_tupdesc(bool include_blkno, bool include_pd)
 	return BlessTupleDesc(tupdesc);
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data)
+{
+	struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Collect visibility data about a relation.
  *
@@ -478,6 +506,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	struct collect_visibility_data_read_stream_private p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +519,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Crate a read stream if read will be done. */
+	if (include_pd)
+	{
+		p.blocknum = 0;
+		p.nblocks = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											collect_visibility_data_read_stream_next_block,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +557,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +568,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd && stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
-- 
2.45.2

v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload

From 456d1df641fd6e81658fd9288e1f81b9ce068708 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 13 Aug 2024 13:03:39 +0300
Subject: [PATCH v1 2/2] Use read stream in pg_visibility in
 collect_corrupt_items()

Instead of reading blocks with ReadBufferExtended() in
collect_corrupt_items(), use read stream.

This change provides about 3% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 93 +++++++++++++++++++++++----
 1 file changed, 81 insertions(+), 12 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 10f961c58f0..4729ee95991 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -52,6 +52,20 @@ struct collect_visibility_data_read_stream_private
 	BlockNumber nblocks;
 };
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -639,6 +653,35 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -663,6 +706,10 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+
+	Assert(all_visible || all_frozen);
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
@@ -700,17 +761,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
 		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
+		buffer = read_stream_next_buffer(stream, NULL);
+
+		/*
+		 * If the read stream returns an InvalidBuffer, this means all the
+		 * blocks are processed. So, end the stream and loop.
+		 */
+		if (buffer == InvalidBuffer)
+		{
+			read_stream_end(stream);
+			stream = NULL;
+			break;
+		}
+
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
@@ -720,9 +784,9 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		 * The visibility map bits might have changed while we were acquiring
 		 * the page lock.  Recheck to avoid returning spurious results.
 		 */
-		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;
 		if (!check_visible && !check_frozen)
 		{
@@ -807,6 +871,11 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	if (stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
-- 
2.45.2

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Nazir Bilal Yavuz (#1)

Re: Use read streams in pg_visibility

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
----------

Mind sharing a script for reproducibility? Except for the drop_caches
part, of course..
--
Michael

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Michael Paquier (#2)

1 attachment(s)

Re: Use read streams in pg_visibility

Hi,

On Mon, 19 Aug 2024 at 09:30, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
----------

Mind sharing a script for reproducibility? Except for the drop_caches
part, of course..

Sure, the script is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#1)

Re: Use read streams in pg_visibility

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

- Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
lookup when the other Buffer var already has the block you want.

- [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
to stop calling the ReadStreamBlockNumberCB for awhile. This could help if
nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
block. For example, if one block out of every 33000 is all-visible, every
heap block we visit has a different vmbuffer. It's likely not optimal for
the callback to pin and unpin 20 vmbuffers, then have
collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility
could notice this trend and request a stop of the callbacks until more of
the heap block work completes. If pg_visibility is going to be the only
place in the code with this use case, it's probably not worth carrying the
extra API just for pg_visibility. However, if we get a stronger use case
later, pg_visibility could be another beneficiary.

+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data)
+{
+	struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;

This is the third callback that just iterates over a block range, after
pg_prewarm_read_stream_next_block() and
copy_storage_using_buffer_read_stream_next_block(). While not a big problem,
I think it's time to have a general-use callback for block range scans. The
quantity of duplicate code is low, but the existing function names are long
and less informative than a behavior-based name.

+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;

If a vm has zero bits set, this loop will scan the entire vm before returning.
Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

@@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
items->count = 64;
items->tids = palloc(items->count * sizeof(ItemPointerData));

+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
/* Loop over every block in the relation. */
for (blkno = 0; blkno < nblocks; ++blkno)

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Since the callback skips blocks, this function should use the
read_stream_reset() style of looping:

while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer) ...

Thanks,
nm

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#4)

2 attachment(s)

Re: Use read streams in pg_visibility

Hi,

Thanks for the review and feedback!

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

- Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
lookup when the other Buffer var already has the block you want.

- [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
to stop calling the ReadStreamBlockNumberCB for awhile. This could help if
nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
block. For example, if one block out of every 33000 is all-visible, every
heap block we visit has a different vmbuffer. It's likely not optimal for
the callback to pin and unpin 20 vmbuffers, then have
collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility
could notice this trend and request a stop of the callbacks until more of
the heap block work completes. If pg_visibility is going to be the only
place in the code with this use case, it's probably not worth carrying the
extra API just for pg_visibility. However, if we get a stronger use case
later, pg_visibility could be another beneficiary.

I think the number of "read-again" cases are low enough now. I am
working on 'using ReadRecentBuffer() or a similar technique' but that
may take more time, so I attached patches without these optimizations.

+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+                                                                                        void *callback_private_data,
+                                                                                        void *per_buffer_data)
+{
+     struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+     if (p->blocknum < p->nblocks)
+             return p->blocknum++;
+
+     return InvalidBlockNumber;
This is the third callback that just iterates over a block range, after
pg_prewarm_read_stream_next_block() and
copy_storage_using_buffer_read_stream_next_block(). While not a big problem,
I think it's time to have a general-use callback for block range scans. The
quantity of duplicate code is low, but the existing function names are long
and less informative than a behavior-based name.

I agree. Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+                                                                                      void *callback_private_data,
+                                                                                      void *per_buffer_data)
+{
+     struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+     for (; p->blocknum < p->nblocks; p->blocknum++)
+     {
+             bool            check_frozen = false;
+             bool            check_visible = false;
+
+             if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+                     check_frozen = true;
+             if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+                     check_visible = true;
+             if (!check_visible && !check_frozen)
+                     continue;

Done. I added CHECK_FOR_INTERRUPTS() to the loop in callback.

@@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
items->count = 64;
items->tids = palloc(items->count * sizeof(ItemPointerData));
+     p.blocknum = 0;
+     p.nblocks = nblocks;
+     p.rel = rel;
+     p.vmbuffer = &vmbuffer;
+     p.all_frozen = all_frozen;
+     p.all_visible = all_visible;
+     stream = read_stream_begin_relation(READ_STREAM_FULL,
+                                                                             bstrategy,
+                                                                             rel,
+                                                                             MAIN_FORKNUM,
+                                                                             collect_corrupt_items_read_stream_next_block,
+                                                                             &p,
+                                                                             0);
+
/* Loop over every block in the relation. */
for (blkno = 0; blkno < nblocks; ++blkno)
The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

Since the callback skips blocks, this function should use the
read_stream_reset() style of looping:

Done.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

v2 of the patches are attached, only 0002 is updated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v2-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload

From e7ec4f3af86499b8433379b3b102eb58f75cf361 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 9 Aug 2024 11:41:52 +0300
Subject: [PATCH v2 1/2] Use read stream in pg_visibility in
 collect_visibility_data()

Instead of reading blocks with ReadBufferExtended() in
collect_visibility_data(), use read stream.

This change provides about 4% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 53 ++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92e454..ece67ccbe5f 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -41,6 +42,16 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_visibility_data() function.
+ */
+struct collect_visibility_data_read_stream_private
+{
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -463,6 +474,23 @@ pg_visibility_tupdesc(bool include_blkno, bool include_pd)
 	return BlessTupleDesc(tupdesc);
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data)
+{
+	struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Collect visibility data about a relation.
  *
@@ -478,6 +506,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	struct collect_visibility_data_read_stream_private p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +519,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Crate a read stream if read will be done. */
+	if (include_pd)
+	{
+		p.blocknum = 0;
+		p.nblocks = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											collect_visibility_data_read_stream_next_block,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +557,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +568,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd && stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
-- 
2.45.2

v2-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload

From 1ed469a6301d8ac889228c49198368028a6b6a0e Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 23 Aug 2024 10:15:09 +0300
Subject: [PATCH v2 2/2] Use read stream in pg_visibility in
 collect_corrupt_items()

Instead of reading blocks with ReadBufferExtended() in
collect_corrupt_items(), use read stream.

This change provides about 5% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 87 ++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index ece67ccbe5f..cfd94bd3be9 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -52,6 +52,20 @@ struct collect_visibility_data_read_stream_private
 	BlockNumber nblocks;
 };
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -659,6 +673,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -681,8 +727,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	corrupt_items *items;
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
+	Buffer		stream_vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+	Buffer		buffer;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -707,12 +757,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &stream_vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
-	for (blkno = 0; blkno < nblocks; ++blkno)
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 	{
 		bool		check_frozen = false;
 		bool		check_visible = false;
-		Buffer		buffer;
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
@@ -720,29 +783,19 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
-		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
 		maxoff = PageGetMaxOffsetNumber(page);
+		blkno = BufferGetBlockNumber(buffer);
 
 		/*
 		 * The visibility map bits might have changed while we were acquiring
 		 * the page lock.  Recheck to avoid returning spurious results.
 		 */
-		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;
 		if (!check_visible && !check_frozen)
 		{
@@ -827,10 +880,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+	read_stream_end(stream);
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
+	if (stream_vmbuffer != InvalidBuffer)
+		ReleaseBuffer(stream_vmbuffer);
 	relation_close(rel, AccessShareLock);
 
 	/*
-- 
2.45.2

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#5)

Re: Use read streams in pg_visibility

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

I gather the scripts deal exclusively in tables with no vm bits set, so
pg_visibility did no heap page reads under those scripts. But the '700 "read
again"' result suggests either I'm guessing wrong, or that result came from a
different test procedure. Thoughts?

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

I think the number of "read-again" cases are low enough now.

Agreed. The increase from 20 to 40 is probably an increase in buffer mapping
lookups, not an increase in I/O.

Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

To me, that name isn't conveying the function's behavior, and the words "pg_"
and "general_" aren't adding information there. How about one of these names
or similar:

seq_read_stream_cb
sequential_read_stream_cb
block_range_read_stream_cb

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page. Past tests
like that had instability problems. One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly. Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that. It may not be problem for this particular test.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

No, I don't consider that a problem. That's not making us do additional I/O,
just testing more bits within the pages we're already loading. The difference
in behavior is user-visible, but it's the same behavior change the user would
get if the timing had been a bit different.

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#6)

6 attachment(s)

Re: Use read streams in pg_visibility

Hi,

On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah@leadboat.com> wrote:

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

I gather the scripts deal exclusively in tables with no vm bits set, so
pg_visibility did no heap page reads under those scripts. But the '700 "read
again"' result suggests either I'm guessing wrong, or that result came from a
different test procedure. Thoughts?

Sorry, yes. You need to run VACUUM on the test table before running
the query. The script is attached. You can run the script with
[corrupt | visibility] arguments to test the related patches.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

I think the number of "read-again" cases are low enough now.

Agreed. The increase from 20 to 40 is probably an increase in buffer mapping
lookups, not an increase in I/O.

Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

To me, that name isn't conveying the function's behavior, and the words "pg_"
and "general_" aren't adding information there. How about one of these names
or similar:

seq_read_stream_cb
sequential_read_stream_cb
block_range_read_stream_cb

I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page. Past tests
like that had instability problems. One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly. Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that. It may not be problem for this particular test.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

No, I don't consider that a problem. That's not making us do additional I/O,
just testing more bits within the pages we're already loading. The difference
in behavior is user-visible, but it's the same behavior change the user would
get if the timing had been a bit different.

Thanks for the clarification.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v3-0001-Add-general-use-struct-and-callback-to-read-strea.patchapplication/x-patch; name=v3-0001-Add-general-use-struct-and-callback-to-read-strea.patchDownload

From 55e8e78d05b135babf8ff8c84fb9747ad19c58c1 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:12:52 +0300
Subject: [PATCH v3 1/5] Add general-use struct and callback to read stream

Number of callbacks that just iterates over block range in read stream
are increasing. So, add general-use struct and callback to read stream.
---
 src/include/storage/read_stream.h     | 12 ++++++++++++
 src/backend/storage/aio/read_stream.c | 17 +++++++++++++++++
 src/tools/pgindent/typedefs.list      |  1 +
 3 files changed, 30 insertions(+)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 4e599904f26..7613c1e361c 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -45,11 +45,23 @@
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
+/*
+ * General-use struct to use in callback functions for block range scans.
+ */
+typedef struct BlockRangeReadStreamPrivate
+{
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+} BlockRangeReadStreamPrivate;
+
 /* Callback that returns the next block number to read. */
 typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 												void *callback_private_data,
 												void *per_buffer_data);
 
+extern BlockNumber block_range_read_stream_cb(ReadStream *stream,
+											  void *callback_private_data,
+											  void *per_buffer_data);
 extern ReadStream *read_stream_begin_relation(int flags,
 											  BufferAccessStrategy strategy,
 											  Relation rel,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index a83c18c2a4b..937d001ef08 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -166,6 +166,23 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 		stream->per_buffer_data_size * buffer_index;
 }
 
+/*
+ * General-use callback function for block range scans. Callback loops between
+ * blocknum (inclusive) and nblocks (exclusive).
+ */
+BlockNumber
+block_range_read_stream_cb(ReadStream *stream,
+						   void *callback_private_data,
+						   void *per_buffer_data)
+{
+	BlockRangeReadStreamPrivate *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Ask the callback which block it would like us to read next, with a small
  * buffer in front to allow read_stream_unget_block() to work and to allow the
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f3..df3f336bec0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -275,6 +275,7 @@ BlockId
 BlockIdData
 BlockInfoRecord
 BlockNumber
+BlockRangeReadStreamPrivate
 BlockRefTable
 BlockRefTableBuffer
 BlockRefTableChunk
-- 
2.45.2

v3-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchapplication/x-patch; name=v3-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchDownload

From 01de5ff954f7db6a749ebfc0b1281ed010b5e8f9 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:16:06 +0300
Subject: [PATCH v3 2/5] pg_prewarm: Use generic-use read stream struct and
 callback

Instead of creating another read stream struct and callback, use generic
one that is exported from read stream.
---
 contrib/pg_prewarm/pg_prewarm.c | 25 +++----------------------
 1 file changed, 3 insertions(+), 22 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 5c859e983c5..889639a09a3 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -39,25 +39,6 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
-struct pg_prewarm_read_stream_private
-{
-	BlockNumber blocknum;
-	int64		last_block;
-};
-
-static BlockNumber
-pg_prewarm_read_stream_next_block(ReadStream *stream,
-								  void *callback_private_data,
-								  void *per_buffer_data)
-{
-	struct pg_prewarm_read_stream_private *p = callback_private_data;
-
-	if (p->blocknum <= p->last_block)
-		return p->blocknum++;
-
-	return InvalidBlockNumber;
-}
-
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -203,7 +184,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
-		struct pg_prewarm_read_stream_private p;
+		BlockRangeReadStreamPrivate p;
 		ReadStream *stream;
 
 		/*
@@ -212,13 +193,13 @@ pg_prewarm(PG_FUNCTION_ARGS)
 
 		/* Set up the private state for our streaming buffer read callback. */
 		p.blocknum = first_block;
-		p.last_block = last_block;
+		p.nblocks = last_block + 1;
 
 		stream = read_stream_begin_relation(READ_STREAM_FULL,
 											NULL,
 											rel,
 											forkNumber,
-											pg_prewarm_read_stream_next_block,
+											block_range_read_stream_cb,
 											&p,
 											0);
 
-- 
2.45.2

v3-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchapplication/x-patch; name=v3-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchDownload

From afc8c50c39212e016f71ee287062c287fffbacc2 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:24:26 +0300
Subject: [PATCH v3 3/5] RelationCopyStorageUsingBuffer: Use generic-use read
 stream struct and callback

Instead of creating another read stream struct and callback, use generic
one that is exported from read stream.
---
 src/backend/storage/buffer/bufmgr.c | 31 ++---------------------------
 1 file changed, 2 insertions(+), 29 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fb38c7bdd45..c04e4733921 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -136,33 +136,6 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
-/*
- * Helper struct for read stream object used in
- * RelationCopyStorageUsingBuffer() function.
- */
-struct copy_storage_using_buffer_read_stream_private
-{
-	BlockNumber blocknum;
-	BlockNumber nblocks;
-};
-
-/*
- * Callback function to get next block for read stream object used in
- * RelationCopyStorageUsingBuffer() function.
- */
-static BlockNumber
-copy_storage_using_buffer_read_stream_next_block(ReadStream *stream,
-												 void *callback_private_data,
-												 void *per_buffer_data)
-{
-	struct copy_storage_using_buffer_read_stream_private *p = callback_private_data;
-
-	if (p->blocknum < p->nblocks)
-		return p->blocknum++;
-
-	return InvalidBlockNumber;
-}
-
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -4710,7 +4683,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	PGIOAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
-	struct copy_storage_using_buffer_read_stream_private p;
+	BlockRangeReadStreamPrivate p;
 	ReadStream *src_stream;
 	SMgrRelation src_smgr;
 
@@ -4750,7 +4723,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 												 src_smgr,
 												 permanent ? RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED,
 												 forkNum,
-												 copy_storage_using_buffer_read_stream_next_block,
+												 block_range_read_stream_cb,
 												 &p,
 												 0);
 
-- 
2.45.2

v3-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchapplication/x-patch; name=v3-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload

From fbbca6e1c27e45afae9137228811d9fa11a7a197 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 15:45:20 +0300
Subject: [PATCH v3 4/5] Use read stream in pg_visibility in
 collect_visibility_data()

Instead of reading blocks with ReadBufferExtended() in
collect_visibility_data(), use read stream.

This change provides about 4% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92e454..6c64f5ead7e 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -478,6 +479,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BlockRangeReadStreamPrivate p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +492,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Crate a read stream if read will be done. */
+	if (include_pd)
+	{
+		p.blocknum = 0;
+		p.nblocks = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											block_range_read_stream_cb,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +530,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +541,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd && stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
-- 
2.45.2

v3-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchapplication/x-patch; name=v3-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload

From aae5ded7c3b3f7600f01674a6200645d10d07116 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 23 Aug 2024 10:15:09 +0300
Subject: [PATCH v3 5/5] Use read stream in pg_visibility in
 collect_corrupt_items()

Instead of reading blocks with ReadBufferExtended() in
collect_corrupt_items(), use read stream.

This change provides about 5% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 87 ++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 6c64f5ead7e..75a5b362fe2 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -42,6 +42,20 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -632,6 +646,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -654,8 +700,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	corrupt_items *items;
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
+	Buffer		stream_vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+	Buffer		buffer;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -680,12 +730,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &stream_vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
-	for (blkno = 0; blkno < nblocks; ++blkno)
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 	{
 		bool		check_frozen = false;
 		bool		check_visible = false;
-		Buffer		buffer;
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
@@ -693,29 +756,19 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
-		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
 		maxoff = PageGetMaxOffsetNumber(page);
+		blkno = BufferGetBlockNumber(buffer);
 
 		/*
 		 * The visibility map bits might have changed while we were acquiring
 		 * the page lock.  Recheck to avoid returning spurious results.
 		 */
-		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;
 		if (!check_visible && !check_frozen)
 		{
@@ -800,10 +853,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+	read_stream_end(stream);
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
+	if (stream_vmbuffer != InvalidBuffer)
+		ReleaseBuffer(stream_vmbuffer);
 	relation_close(rel, AccessShareLock);
 
 	/*
-- 
2.45.2

setup.shapplication/x-sh; name=setup.shDownload

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#7)

Re: Use read streams in pg_visibility

On Tue, Aug 27, 2024 at 10:49:19AM +0300, Nazir Bilal Yavuz wrote:

On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah@leadboat.com> wrote:

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.

To read blocks 10 and 11, I would expect to initialize the struct with one of:

{ .first=10, .nblocks=2 }
{ .first=10, .last_inclusive=11 }
{ .first=10, .last_exclusive=12 }

With the patch's API, I would need {.first=10,.nblocks=12}. The struct field
named "nblocks" behaves like a last_block_exclusive. Please either make the
behavior an "nblocks" behavior or change the field name to replace the term
"nblocks" with something matching the behavior. (I used longer field names in
my examples here, to disambiguate those examples. It's okay if the final
field names aren't those, as long as the field names and the behavior align.)

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page. Past tests
like that had instability problems. One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly. Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that. It may not be problem for this particular test.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

I see what you're saying. collect_corrupt_items() needs a corrupt table to
report anything; all corruption-free tables get the same output. Testing this
would need extra C code or techniques like corrupt_page_checksum() to create
the corrupt state. That wouldn't be a bad thing to have, but it's big enough
that I'll consider it out of scope for $SUBJECT. With the callback change
above, I'll be ready to push all this.

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#8)

5 attachment(s)

Re: Use read streams in pg_visibility

Hi,

On Sat, 31 Aug 2024 at 02:51, Noah Misch <noah@leadboat.com> wrote:

To read blocks 10 and 11, I would expect to initialize the struct with one of:

{ .first=10, .nblocks=2 }
{ .first=10, .last_inclusive=11 }
{ .first=10, .last_exclusive=12 }

With the patch's API, I would need {.first=10,.nblocks=12}. The struct field
named "nblocks" behaves like a last_block_exclusive. Please either make the
behavior an "nblocks" behavior or change the field name to replace the term
"nblocks" with something matching the behavior. (I used longer field names in
my examples here, to disambiguate those examples. It's okay if the final
field names aren't those, as long as the field names and the behavior align.)

I decided to use 'current_blocknum' and 'last_exclusive'. I think
these are easier to understand and use.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

I see what you're saying. collect_corrupt_items() needs a corrupt table to
report anything; all corruption-free tables get the same output. Testing this
would need extra C code or techniques like corrupt_page_checksum() to create
the corrupt state. That wouldn't be a bad thing to have, but it's big enough
that I'll consider it out of scope for $SUBJECT. With the callback change
above, I'll be ready to push all this.

Thanks, updated patches are attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patchapplication/x-patch; name=v4-0001-Add-general-use-struct-and-callback-to-read-strea.patchDownload

From 1dbdeacc54c3575a2cb82e95aab5b335b0fa1d2f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:12:52 +0300
Subject: [PATCH v4 1/5] Add general-use struct and callback to read stream

Number of callbacks that just iterates over block range in read stream
are increasing. So, add general-use struct and callback to read stream.
---
 src/include/storage/read_stream.h     | 13 +++++++++++++
 src/backend/storage/aio/read_stream.c | 21 +++++++++++++++++++--
 src/tools/pgindent/typedefs.list      |  1 +
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 4e599904f26..0a2398fd7df 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -45,11 +45,24 @@
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
+/*
+ * General-use struct to use in callback functions for block range scans.
+ * Callback loops between current_blocknum (inclusive) and last_exclusive.
+ */
+typedef struct BlockRangeReadStreamPrivate
+{
+	BlockNumber current_blocknum;
+	BlockNumber last_exclusive;
+} BlockRangeReadStreamPrivate;
+
 /* Callback that returns the next block number to read. */
 typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 												void *callback_private_data,
 												void *per_buffer_data);
 
+extern BlockNumber block_range_read_stream_cb(ReadStream *stream,
+											  void *callback_private_data,
+											  void *per_buffer_data);
 extern ReadStream *read_stream_begin_relation(int flags,
 											  BufferAccessStrategy strategy,
 											  Relation rel,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 93cdd35fea0..8a449bab8a0 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -164,8 +164,25 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 }
 
 /*
- * Ask the callback which block it would like us to read next, with a one block
- * buffer in front to allow read_stream_unget_block() to work.
+ * General-use callback function for block range scans.
+ */
+BlockNumber
+block_range_read_stream_cb(ReadStream *stream,
+						   void *callback_private_data,
+						   void *per_buffer_data)
+{
+	BlockRangeReadStreamPrivate *p = callback_private_data;
+
+	if (p->current_blocknum < p->last_exclusive)
+		return p->current_blocknum++;
+
+	return InvalidBlockNumber;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f3..df3f336bec0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -275,6 +275,7 @@ BlockId
 BlockIdData
 BlockInfoRecord
 BlockNumber
+BlockRangeReadStreamPrivate
 BlockRefTable
 BlockRefTableBuffer
 BlockRefTableChunk
-- 
2.45.2

v4-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchapplication/x-patch; name=v4-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchDownload

From 84bc78cc36814309634c3686827b8a222bdfbfef Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:16:06 +0300
Subject: [PATCH v4 2/5] pg_prewarm: Use generic-use read stream struct and
 callback

Instead of creating another read stream struct and callback, use generic
one that is exported from read stream.
---
 contrib/pg_prewarm/pg_prewarm.c | 27 ++++-----------------------
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 5c859e983c5..243d36c46e8 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -39,25 +39,6 @@ typedef enum
 
 static PGIOAlignedBlock blockbuffer;
 
-struct pg_prewarm_read_stream_private
-{
-	BlockNumber blocknum;
-	int64		last_block;
-};
-
-static BlockNumber
-pg_prewarm_read_stream_next_block(ReadStream *stream,
-								  void *callback_private_data,
-								  void *per_buffer_data)
-{
-	struct pg_prewarm_read_stream_private *p = callback_private_data;
-
-	if (p->blocknum <= p->last_block)
-		return p->blocknum++;
-
-	return InvalidBlockNumber;
-}
-
 /*
  * pg_prewarm(regclass, mode text, fork text,
  *			  first_block int8, last_block int8)
@@ -203,7 +184,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 	}
 	else if (ptype == PREWARM_BUFFER)
 	{
-		struct pg_prewarm_read_stream_private p;
+		BlockRangeReadStreamPrivate p;
 		ReadStream *stream;
 
 		/*
@@ -211,14 +192,14 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		 */
 
 		/* Set up the private state for our streaming buffer read callback. */
-		p.blocknum = first_block;
-		p.last_block = last_block;
+		p.current_blocknum = first_block;
+		p.last_exclusive = last_block + 1;
 
 		stream = read_stream_begin_relation(READ_STREAM_FULL,
 											NULL,
 											rel,
 											forkNumber,
-											pg_prewarm_read_stream_next_block,
+											block_range_read_stream_cb,
 											&p,
 											0);
 
-- 
2.45.2

v4-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchapplication/x-patch; name=v4-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchDownload

From d5a743b4541d43abb5808696e72d50333a755ef7 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 12:24:26 +0300
Subject: [PATCH v4 3/5] RelationCopyStorageUsingBuffer: Use generic-use read
 stream struct and callback

Instead of creating another read stream struct and callback, use generic
one that is exported from read stream.
---
 src/backend/storage/buffer/bufmgr.c | 35 ++++-------------------------
 1 file changed, 4 insertions(+), 31 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5cdd2f10fc8..7dc800888e0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -136,33 +136,6 @@ typedef struct SMgrSortArray
 	SMgrRelation srel;
 } SMgrSortArray;
 
-/*
- * Helper struct for read stream object used in
- * RelationCopyStorageUsingBuffer() function.
- */
-struct copy_storage_using_buffer_read_stream_private
-{
-	BlockNumber blocknum;
-	BlockNumber nblocks;
-};
-
-/*
- * Callback function to get next block for read stream object used in
- * RelationCopyStorageUsingBuffer() function.
- */
-static BlockNumber
-copy_storage_using_buffer_read_stream_next_block(ReadStream *stream,
-												 void *callback_private_data,
-												 void *per_buffer_data)
-{
-	struct copy_storage_using_buffer_read_stream_private *p = callback_private_data;
-
-	if (p->blocknum < p->nblocks)
-		return p->blocknum++;
-
-	return InvalidBlockNumber;
-}
-
 /* GUC variables */
 bool		zero_damaged_pages = false;
 int			bgwriter_lru_maxpages = 100;
@@ -4710,7 +4683,7 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	PGIOAlignedBlock buf;
 	BufferAccessStrategy bstrategy_src;
 	BufferAccessStrategy bstrategy_dst;
-	struct copy_storage_using_buffer_read_stream_private p;
+	BlockRangeReadStreamPrivate p;
 	ReadStream *src_stream;
 	SMgrRelation src_smgr;
 
@@ -4742,15 +4715,15 @@ RelationCopyStorageUsingBuffer(RelFileLocator srclocator,
 	bstrategy_dst = GetAccessStrategy(BAS_BULKWRITE);
 
 	/* Initalize streaming read */
-	p.blocknum = 0;
-	p.nblocks = nblocks;
+	p.current_blocknum = 0;
+	p.last_exclusive = nblocks;
 	src_smgr = smgropen(srclocator, INVALID_PROC_NUMBER);
 	src_stream = read_stream_begin_smgr_relation(READ_STREAM_FULL,
 												 bstrategy_src,
 												 src_smgr,
 												 permanent ? RELPERSISTENCE_PERMANENT : RELPERSISTENCE_UNLOGGED,
 												 forkNum,
-												 copy_storage_using_buffer_read_stream_next_block,
+												 block_range_read_stream_cb,
 												 &p,
 												 0);
 
-- 
2.45.2

v4-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchapplication/x-patch; name=v4-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload

From 03d07e5a20b8a1c74074861f8b16f2c88b48667a Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 26 Aug 2024 15:45:20 +0300
Subject: [PATCH v4 4/5] Use read stream in pg_visibility in
 collect_visibility_data()

Instead of reading blocks with ReadBufferExtended() in
collect_visibility_data(), use read stream.

This change provides about 4% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92e454..087cd23d1b0 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -478,6 +479,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BlockRangeReadStreamPrivate p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +492,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Crate a read stream if read will be done. */
+	if (include_pd)
+	{
+		p.current_blocknum = 0;
+		p.last_exclusive = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											block_range_read_stream_cb,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +530,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +541,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd && stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
-- 
2.45.2

v4-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchapplication/x-patch; name=v4-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload

From d8017fb647a7438a7e061380e0859a62d7095ab7 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Fri, 23 Aug 2024 10:15:09 +0300
Subject: [PATCH v4 5/5] Use read stream in pg_visibility in
 collect_corrupt_items()

Instead of reading blocks with ReadBufferExtended() in
collect_corrupt_items(), use read stream.

This change provides about 5% performance improvement.
---
 contrib/pg_visibility/pg_visibility.c | 87 ++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 087cd23d1b0..3b6ead83c45 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -42,6 +42,20 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber current_blocknum;
+	BlockNumber last_exclusive;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -632,6 +646,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->current_blocknum < p->last_exclusive; p->current_blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->current_blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->current_blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->current_blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -654,8 +700,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	corrupt_items *items;
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
+	Buffer		stream_vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+	Buffer		buffer;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -680,12 +730,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.current_blocknum = 0;
+	p.last_exclusive = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &stream_vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
-	for (blkno = 0; blkno < nblocks; ++blkno)
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 	{
 		bool		check_frozen = false;
 		bool		check_visible = false;
-		Buffer		buffer;
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
@@ -693,29 +756,19 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
-		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
 		maxoff = PageGetMaxOffsetNumber(page);
+		blkno = BufferGetBlockNumber(buffer);
 
 		/*
 		 * The visibility map bits might have changed while we were acquiring
 		 * the page lock.  Recheck to avoid returning spurious results.
 		 */
-		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;
 		if (!check_visible && !check_frozen)
 		{
@@ -800,10 +853,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+	read_stream_end(stream);
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
+	if (stream_vmbuffer != InvalidBuffer)
+		ReleaseBuffer(stream_vmbuffer);
 	relation_close(rel, AccessShareLock);
 
 	/*
-- 
2.45.2

#10

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#9)

Re: Use read streams in pg_visibility

On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:

Thanks, updated patches are attached.

+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
*/
static inline BlockNumber
read_stream_get_block(ReadStream *stream, void *per_buffer_data)

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here. I reverted this.

Pushed with some other cosmetic changes.

#11

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Noah Misch (#10)

Re: Use read streams in pg_visibility

On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:

On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:

Thanks, updated patches are attached.
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
*/
static inline BlockNumber
read_stream_get_block(ReadStream *stream, void *per_buffer_data)
v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here. I reverted this.

Pushed with some other cosmetic changes.

I see I pushed something unacceptable under ASAN. I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20

#12

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#11)

1 attachment(s)

Re: Use read streams in pg_visibility

Hi,

On Tue, 3 Sept 2024 at 22:20, Noah Misch <noah@leadboat.com> wrote:

On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:
On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:

Thanks, updated patches are attached.
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
*/
static inline BlockNumber
read_stream_get_block(ReadStream *stream, void *per_buffer_data)
v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here. I reverted this.

Sorry, it should be left from rebase. Thanks for reverting it.

Pushed with some other cosmetic changes.

Thanks!

I see I pushed something unacceptable under ASAN. I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20

I think it is related to the scope of BlockRangeReadStreamPrivate in
the collect_visibility_data() function. Attached a small fix for that.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v5-0001-Fix-ASAN-error-introduced-in-ed1b1ee59f.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Fix-ASAN-error-introduced-in-ed1b1ee59f.patchDownload

From 5393dfe6e2a0888cb33ad15f31def1691637564f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 3 Sep 2024 22:35:00 +0300
Subject: [PATCH v5] Fix ASAN error introduced in ed1b1ee59f

---
 contrib/pg_visibility/pg_visibility.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9975e8876e6..db796e35cb2 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -490,6 +490,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BlockRangeReadStreamPrivate p;
 	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
@@ -505,8 +506,6 @@ collect_visibility_data(Oid relid, bool include_pd)
 	/* Create a stream if reading main fork. */
 	if (include_pd)
 	{
-		BlockRangeReadStreamPrivate p;
-
 		p.current_blocknum = 0;
 		p.last_exclusive = nblocks;
 		stream = read_stream_begin_relation(READ_STREAM_FULL,
-- 
2.45.2

#13

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#12)

Re: Use read streams in pg_visibility

On Tue, Sep 03, 2024 at 10:46:34PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 3 Sept 2024 at 22:20, Noah Misch <noah@leadboat.com> wrote:

On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:

Pushed with some other cosmetic changes.

Thanks!

I see I pushed something unacceptable under ASAN. I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20

I think it is related to the scope of BlockRangeReadStreamPrivate in
the collect_visibility_data() function. Attached a small fix for that.

Right.

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

#14

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#13)

Re: Use read streams in pg_visibility

Hi,

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#15

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#14)

Re: Use read streams in pg_visibility

On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

I think so. Please check that it would have caught both the blkno bug and the
above bug.

#16

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#15)

2 attachment(s)

Re: Use read streams in pg_visibility

Hi,

On Thu, 5 Sept 2024 at 18:54, Noah Misch <noah@leadboat.com> wrote:

On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

I think so. Please check that it would have caught both the blkno bug and the
above bug.

The test and updated patch files are attached. In that test I
overwrite the visibility map file with its older state. Something like
this:

1- Create the table, then run VACUUM FREEZE on the table.
2- Shutdown the server, create a copy of the vm file, restart the server.
3- Run the DELETE command on some $random_tuples.
4- Shutdown the server, overwrite the vm file with the #2 vm file,
restart the server.
5- pg_check_visible and pg_check_frozen must report $random_tuples as corrupted.

Do you think this test makes sense and enough?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v6-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchDownload

From 1f7b2cfe0d2247eb4768ce79d14982701437d473 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Mon, 9 Sep 2024 15:27:53 +0300
Subject: [PATCH v6 1/2] Add tests that pg_check_[visible|frozen] report
 corrupted tuples

Currently, there are no tests that pg_check_[visible|frozen] need to
report corrupted tuples. Add that kind of tests.

To add these tests, visibility map needs to be corrupted. It is done by
overwriting visibility map by its older state.
---
 contrib/pg_visibility/meson.build         |   1 +
 contrib/pg_visibility/t/002_corrupt_vm.pl | 100 ++++++++++++++++++++++
 2 files changed, 101 insertions(+)
 create mode 100644 contrib/pg_visibility/t/002_corrupt_vm.pl

diff --git a/contrib/pg_visibility/meson.build b/contrib/pg_visibility/meson.build
index f3c1263313a..609fc5f9f3e 100644
--- a/contrib/pg_visibility/meson.build
+++ b/contrib/pg_visibility/meson.build
@@ -36,6 +36,7 @@ tests += {
   'tap': {
     'tests': [
       't/001_concurrent_transaction.pl',
+      't/002_corrupt_vm.pl',
     ],
   },
 }
diff --git a/contrib/pg_visibility/t/002_corrupt_vm.pl b/contrib/pg_visibility/t/002_corrupt_vm.pl
new file mode 100644
index 00000000000..cc6d043b8b7
--- /dev/null
+++ b/contrib/pg_visibility/t/002_corrupt_vm.pl
@@ -0,0 +1,100 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Check that pg_check_visible() and pg_check_frozen() reports
+# correct TIDs when there is a corruption.
+use strict;
+use warnings FATAL => 'all';
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $blck_size = $node->safe_psql("postgres", "SHOW block_size;");
+
+# Create a sample table with at least 10 pages and then run VACUUM. 10 is
+# selected manually as it is big enough to select 5 random tuples from the
+# relation.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE EXTENSION pg_visibility;
+	    CREATE TABLE corruption_test
+            WITH (autovacuum_enabled = false) AS
+           SELECT
+				i,
+				repeat('a', 10) AS data
+			FROM
+    			generate_series(1, $blck_size) i;
+	    VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) corruption_test;
+    )
+);
+
+# VACUUM is run, it is safe to get the number of pages from the relpages.
+my $npages = $node->safe_psql(
+	"postgres",
+	"SELECT relpages FROM pg_class
+		WHERE  relname = 'corruption_test';"
+);
+
+ok($npages >= 10, "There should be at least 10 pages in the table");
+
+my $file = $node->safe_psql("postgres",
+	"SELECT pg_relation_filepath('corruption_test');");
+
+# Copy visibility map.
+$node->stop;
+my $vm_file = $node->data_dir . '/' . $file . '_vm';
+copy("$vm_file", "${vm_file}_temp");
+$node->start;
+
+# Select 5 random tuples from the relation.
+my $tuples = $node->safe_psql(
+	"postgres",
+	"SELECT ctid FROM (
+        SELECT ctid FROM corruption_test
+            ORDER BY random() LIMIT 5)
+        ORDER BY ctid ASC;"
+);
+
+# Do the changes below to use tuples in the query.
+# "\n" -> ","
+# "(" -> "'("
+# ")" -> ")'"
+(my $tuples_query = $tuples) =~ s/\n/,/g;
+$tuples_query =~ s/\(/\'\(/g;
+$tuples_query =~ s/\)/\)\'/g;
+
+$node->safe_psql(
+	"postgres",
+	"DELETE FROM corruption_test
+		WHERE ctid in ($tuples_query);"
+);
+
+# Overwrite visibility map with the old one.
+$node->stop;
+move("${vm_file}_temp", "$vm_file");
+$node->start;
+
+my $result = $node->safe_psql(
+	"postgres",
+	"SELECT DISTINCT t_ctid
+		FROM pg_check_visible('corruption_test')
+		ORDER BY t_ctid ASC;"
+);
+
+is($result, $tuples, 'pg_check_visible must report tuples as corrupted');
+
+$result = $node->safe_psql(
+	"postgres",
+	"SELECT DISTINCT t_ctid
+		FROM pg_check_frozen('corruption_test')
+		ORDER BY t_ctid ASC;"
+);
+
+is($result, $tuples, 'pg_check_frozen must report tuples as corrupted');
+
+$node->stop;
+done_testing();
-- 
2.45.2

v6-0002-Optimize-pg_visibility-with-read-streams.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Optimize-pg_visibility-with-read-streams.patchDownload

From 69d095e9050f7f747cc2586fc9d2900f3bee7380 Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Tue, 3 Sep 2024 10:46:20 -0700
Subject: [PATCH v6 2/2] Optimize pg_visibility with read streams.

We've measured 5% performance improvement, and this arranges to benefit
automatically from future optimizations to the read_stream subsystem.

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/CAN55FZ1_Ru3XpMgTwsU67FTH2fs_FrRROmb7x6zs+F44QBEiww@mail.gmail.com
---
 contrib/pg_visibility/pg_visibility.c | 113 +++++++++++++++++++++-----
 1 file changed, 92 insertions(+), 21 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92e454..724122b1bc5 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -41,6 +42,17 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/* for collect_corrupt_items_read_stream_next_block */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber current_blocknum;
+	BlockNumber last_exclusive;
+	Relation	rel;
+	Buffer		vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -478,6 +490,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BlockRangeReadStreamPrivate p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +503,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Create a stream if reading main fork. */
+	if (include_pd)
+	{
+		p.current_blocknum = 0;
+		p.last_exclusive = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											block_range_read_stream_cb,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +541,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +552,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
@@ -610,6 +643,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->current_blocknum < p->last_exclusive; p->current_blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->current_blocknum, &p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->current_blocknum, &p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->current_blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -628,12 +693,13 @@ static corrupt_items *
 collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 {
 	Relation	rel;
-	BlockNumber nblocks;
 	corrupt_items *items;
-	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+	Buffer		buffer;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -643,8 +709,6 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 		OldestXmin = GetStrictOldestNonRemovableTransactionId(rel);
 
-	nblocks = RelationGetNumberOfBlocks(rel);
-
 	/*
 	 * Guess an initial array size. We don't expect many corrupted tuples, so
 	 * start with a small array.  This function uses the "next" field to track
@@ -658,34 +722,38 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.current_blocknum = 0;
+	p.last_exclusive = RelationGetNumberOfBlocks(rel);
+	p.rel = rel;
+	p.vmbuffer = InvalidBuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
-	for (blkno = 0; blkno < nblocks; ++blkno)
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 	{
-		bool		check_frozen = false;
-		bool		check_visible = false;
-		Buffer		buffer;
+		bool		check_frozen = all_frozen;
+		bool		check_visible = all_visible;
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
+		BlockNumber blkno;
 
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
-		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
 		maxoff = PageGetMaxOffsetNumber(page);
+		blkno = BufferGetBlockNumber(buffer);
 
 		/*
 		 * The visibility map bits might have changed while we were acquiring
@@ -778,10 +846,13 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	read_stream_end(stream);
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
+	if (p.vmbuffer != InvalidBuffer)
+		ReleaseBuffer(p.vmbuffer);
 	relation_close(rel, AccessShareLock);
 
 	/*
-- 
2.45.2

#17

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#16)

1 attachment(s)

Re: Use read streams in pg_visibility

On Mon, Sep 09, 2024 at 06:25:07PM +0300, Nazir Bilal Yavuz wrote:

On Thu, 5 Sept 2024 at 18:54, Noah Misch <noah@leadboat.com> wrote:

On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

I think so. Please check that it would have caught both the blkno bug and the
above bug.

The test and updated patch files are attached. In that test I
overwrite the visibility map file with its older state. Something like
this:

1- Create the table, then run VACUUM FREEZE on the table.
2- Shutdown the server, create a copy of the vm file, restart the server.
3- Run the DELETE command on some $random_tuples.
4- Shutdown the server, overwrite the vm file with the #2 vm file,
restart the server.
5- pg_check_visible and pg_check_frozen must report $random_tuples as corrupted.

Copying the vm file is a good technique. In my test runs, this does catch the
"never detect" bug, but it doesn't catch the blkno bug. Can you make it catch
both? It's possible my hand-patching to recreate the blkno bug is what went
wrong, so I'm attaching what I used. It consists of
v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch plus these
fixes for the "never detect" bug from your v6-0002:

--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -724,4 +724,4 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	{
-		bool		check_frozen = false;
-		bool		check_visible = false;
+		bool		check_frozen = all_frozen;
+		bool		check_visible = all_visible;
 		Buffer		buffer;
@@ -757,5 +757,5 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		 */
-		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;

Attachments:

blkno-bug-v0.patchtext/plain; charset=us-asciiDownload

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92..ac575a1 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -41,6 +41,20 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/*
+ * Helper struct for read stream object used in
+ * collect_corrupt_items() function.
+ */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber blocknum;
+	BlockNumber nblocks;
+	Relation	rel;
+	Buffer	   *vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -611,6 +625,35 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 }
 
 /*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
+/*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
  *
@@ -634,6 +677,10 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+
+	Assert(all_visible || all_frozen);
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -658,11 +705,25 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
-		bool		check_frozen = false;
-		bool		check_visible = false;
+		bool		check_frozen = all_frozen;
+		bool		check_visible = all_visible;
 		Buffer		buffer;
 		Page		page;
 		OffsetNumber offnum,
@@ -671,17 +732,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
 		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
+		buffer = read_stream_next_buffer(stream, NULL);
+
+		/*
+		 * If the read stream returns an InvalidBuffer, this means all the
+		 * blocks are processed. So, end the stream and loop.
+		 */
+		if (buffer == InvalidBuffer)
+		{
+			read_stream_end(stream);
+			stream = NULL;
+			break;
+		}
+
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
@@ -778,6 +842,11 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	if (stream)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)

#18

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#17)

2 attachment(s)

Re: Use read streams in pg_visibility

Hi,

On Tue, 10 Sept 2024 at 00:32, Noah Misch <noah@leadboat.com> wrote:

Copying the vm file is a good technique. In my test runs, this does catch the
"never detect" bug, but it doesn't catch the blkno bug. Can you make it catch
both? It's possible my hand-patching to recreate the blkno bug is what went
wrong, so I'm attaching what I used. It consists of
v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch plus these
fixes for the "never detect" bug from your v6-0002:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v7-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchapplication/x-patch; name=v7-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchDownload

From 342bd8ede69942424b6d0568574db52cb574c479 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 10 Sep 2024 13:15:11 +0300
Subject: [PATCH v7 1/2] Add tests that pg_check_[visible|frozen] report
 corrupted tuples

Currently, there are no tests that pg_check_[visible|frozen] need to
report corrupted tuples. Add that kind of tests.

To add these tests, visibility map needs to be corrupted. It is done by
overwriting visibility map by its older state.
---
 contrib/pg_visibility/meson.build         |   1 +
 contrib/pg_visibility/t/002_corrupt_vm.pl | 107 ++++++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 contrib/pg_visibility/t/002_corrupt_vm.pl

diff --git a/contrib/pg_visibility/meson.build b/contrib/pg_visibility/meson.build
index f3c1263313a..609fc5f9f3e 100644
--- a/contrib/pg_visibility/meson.build
+++ b/contrib/pg_visibility/meson.build
@@ -36,6 +36,7 @@ tests += {
   'tap': {
     'tests': [
       't/001_concurrent_transaction.pl',
+      't/002_corrupt_vm.pl',
     ],
   },
 }
diff --git a/contrib/pg_visibility/t/002_corrupt_vm.pl b/contrib/pg_visibility/t/002_corrupt_vm.pl
new file mode 100644
index 00000000000..e57f365576f
--- /dev/null
+++ b/contrib/pg_visibility/t/002_corrupt_vm.pl
@@ -0,0 +1,107 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Check that pg_check_visible() and pg_check_frozen() reports
+# correct TIDs when there is a corruption.
+use strict;
+use warnings FATAL => 'all';
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $blck_size = $node->safe_psql("postgres", "SHOW block_size;");
+
+# Create a sample table with at least 10 pages and then run VACUUM. 10 is
+# selected manually as it is big enough to select 5 random tuples from the
+# relation.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE EXTENSION pg_visibility;
+	    CREATE TABLE corruption_test
+            WITH (autovacuum_enabled = false) AS
+           SELECT
+				i,
+				repeat('a', 10) AS data
+			FROM
+    			generate_series(1, $blck_size) i;
+	    VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) corruption_test;
+    )
+);
+
+# VACUUM is run, it is safe to get the number of pages from the relpages.
+my $npages = $node->safe_psql(
+	"postgres",
+	"SELECT relpages FROM pg_class
+		WHERE  relname = 'corruption_test';"
+);
+ok($npages >= 10, "There should be at least 10 pages in the table");
+
+my $file = $node->safe_psql("postgres",
+	"SELECT pg_relation_filepath('corruption_test');");
+
+# Delete the first block to make sure that it will be skipped as it is
+# not visible nor frozen.
+$node->safe_psql(
+	"postgres",
+	"DELETE FROM corruption_test
+		WHERE (ctid::text::point)[0] = 0;"
+);
+
+# Copy visibility map.
+$node->stop;
+my $vm_file = $node->data_dir . '/' . $file . '_vm';
+copy("$vm_file", "${vm_file}_temp");
+$node->start;
+
+# Select 5 random tuples that are starting from the second block of the
+# relation. The first block is skipped because it is deleted above.
+my $tuples = $node->safe_psql(
+	"postgres",
+	"SELECT ctid FROM (
+        SELECT ctid FROM corruption_test
+			WHERE (ctid::text::point)[0] != 0
+            ORDER BY random() LIMIT 5)
+        ORDER BY ctid ASC;"
+);
+
+# Do the changes below to use tuples in the query.
+# "\n" -> ","
+# "(" -> "'("
+# ")" -> ")'"
+(my $tuples_query = $tuples) =~ s/\n/,/g;
+$tuples_query =~ s/\(/\'\(/g;
+$tuples_query =~ s/\)/\)\'/g;
+
+$node->safe_psql(
+	"postgres",
+	"DELETE FROM corruption_test
+		WHERE ctid in ($tuples_query);"
+);
+
+# Overwrite visibility map with the old one.
+$node->stop;
+move("${vm_file}_temp", "$vm_file");
+$node->start;
+
+my $result = $node->safe_psql(
+	"postgres",
+	"SELECT DISTINCT t_ctid
+		FROM pg_check_visible('corruption_test')
+		ORDER BY t_ctid ASC;"
+);
+is($result, $tuples, 'pg_check_visible must report tuples as corrupted');
+
+$result = $node->safe_psql(
+	"postgres",
+	"SELECT DISTINCT t_ctid
+		FROM pg_check_frozen('corruption_test')
+		ORDER BY t_ctid ASC;"
+);
+is($result, $tuples, 'pg_check_frozen must report tuples as corrupted');
+
+$node->stop;
+done_testing();
-- 
2.45.2

v7-0002-Optimize-pg_visibility-with-read-streams.patchapplication/x-patch; name=v7-0002-Optimize-pg_visibility-with-read-streams.patchDownload

From ec0a22440971e92a696a4c2f52d1eae7b1f62078 Mon Sep 17 00:00:00 2001
From: Noah Misch <noah@leadboat.com>
Date: Tue, 3 Sep 2024 10:46:20 -0700
Subject: [PATCH v7 2/2] Optimize pg_visibility with read streams.

We've measured 5% performance improvement, and this arranges to benefit
automatically from future optimizations to the read_stream subsystem.

Nazir Bilal Yavuz

Discussion: https://postgr.es/m/CAN55FZ1_Ru3XpMgTwsU67FTH2fs_FrRROmb7x6zs+F44QBEiww@mail.gmail.com
---
 contrib/pg_visibility/pg_visibility.c | 113 +++++++++++++++++++++-----
 1 file changed, 92 insertions(+), 21 deletions(-)

diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 773ba92e454..724122b1bc5 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
+#include "storage/read_stream.h"
 #include "storage/smgr.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
@@ -41,6 +42,17 @@ typedef struct corrupt_items
 	ItemPointer tids;
 } corrupt_items;
 
+/* for collect_corrupt_items_read_stream_next_block */
+struct collect_corrupt_items_read_stream_private
+{
+	bool		all_frozen;
+	bool		all_visible;
+	BlockNumber current_blocknum;
+	BlockNumber last_exclusive;
+	Relation	rel;
+	Buffer		vmbuffer;
+};
+
 PG_FUNCTION_INFO_V1(pg_visibility_map);
 PG_FUNCTION_INFO_V1(pg_visibility_map_rel);
 PG_FUNCTION_INFO_V1(pg_visibility);
@@ -478,6 +490,8 @@ collect_visibility_data(Oid relid, bool include_pd)
 	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+	BlockRangeReadStreamPrivate p;
+	ReadStream *stream = NULL;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -489,6 +503,20 @@ collect_visibility_data(Oid relid, bool include_pd)
 	info->next = 0;
 	info->count = nblocks;
 
+	/* Create a stream if reading main fork. */
+	if (include_pd)
+	{
+		p.current_blocknum = 0;
+		p.last_exclusive = nblocks;
+		stream = read_stream_begin_relation(READ_STREAM_FULL,
+											bstrategy,
+											rel,
+											MAIN_FORKNUM,
+											block_range_read_stream_cb,
+											&p,
+											0);
+	}
+
 	for (blkno = 0; blkno < nblocks; ++blkno)
 	{
 		int32		mapbits;
@@ -513,8 +541,7 @@ collect_visibility_data(Oid relid, bool include_pd)
 			Buffer		buffer;
 			Page		page;
 
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										bstrategy);
+			buffer = read_stream_next_buffer(stream, NULL);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 			page = BufferGetPage(buffer);
@@ -525,6 +552,12 @@ collect_visibility_data(Oid relid, bool include_pd)
 		}
 	}
 
+	if (include_pd)
+	{
+		Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+		read_stream_end(stream);
+	}
+
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
@@ -610,6 +643,38 @@ GetStrictOldestNonRemovableTransactionId(Relation rel)
 	}
 }
 
+/*
+ * Callback function to get next block for read stream object used in
+ * collect_corrupt_items() function.
+ */
+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->current_blocknum < p->last_exclusive; p->current_blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		/* Make sure we are interruptible. */
+		CHECK_FOR_INTERRUPTS();
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->current_blocknum, &p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->current_blocknum, &p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;
+
+		return p->current_blocknum++;
+	}
+
+	return InvalidBlockNumber;
+}
+
 /*
  * Returns a list of items whose visibility map information does not match
  * the status of the tuples on the page.
@@ -628,12 +693,13 @@ static corrupt_items *
 collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 {
 	Relation	rel;
-	BlockNumber nblocks;
 	corrupt_items *items;
-	BlockNumber blkno;
 	Buffer		vmbuffer = InvalidBuffer;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 	TransactionId OldestXmin = InvalidTransactionId;
+	struct collect_corrupt_items_read_stream_private p;
+	ReadStream *stream;
+	Buffer		buffer;
 
 	rel = relation_open(relid, AccessShareLock);
 
@@ -643,8 +709,6 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	if (all_visible)
 		OldestXmin = GetStrictOldestNonRemovableTransactionId(rel);
 
-	nblocks = RelationGetNumberOfBlocks(rel);
-
 	/*
 	 * Guess an initial array size. We don't expect many corrupted tuples, so
 	 * start with a small array.  This function uses the "next" field to track
@@ -658,34 +722,38 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	items->count = 64;
 	items->tids = palloc(items->count * sizeof(ItemPointerData));
 
+	p.current_blocknum = 0;
+	p.last_exclusive = RelationGetNumberOfBlocks(rel);
+	p.rel = rel;
+	p.vmbuffer = InvalidBuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
 	/* Loop over every block in the relation. */
-	for (blkno = 0; blkno < nblocks; ++blkno)
+	while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
 	{
-		bool		check_frozen = false;
-		bool		check_visible = false;
-		Buffer		buffer;
+		bool		check_frozen = all_frozen;
+		bool		check_visible = all_visible;
 		Page		page;
 		OffsetNumber offnum,
 					maxoff;
+		BlockNumber blkno;
 
 		/* Make sure we are interruptible. */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Use the visibility map to decide whether to check this page. */
-		if (all_frozen && VM_ALL_FROZEN(rel, blkno, &vmbuffer))
-			check_frozen = true;
-		if (all_visible && VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
-			check_visible = true;
-		if (!check_visible && !check_frozen)
-			continue;
-
-		/* Read and lock the page. */
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									bstrategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
 
 		page = BufferGetPage(buffer);
 		maxoff = PageGetMaxOffsetNumber(page);
+		blkno = BufferGetBlockNumber(buffer);
 
 		/*
 		 * The visibility map bits might have changed while we were acquiring
@@ -778,10 +846,13 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 
 		UnlockReleaseBuffer(buffer);
 	}
+	read_stream_end(stream);
 
 	/* Clean up. */
 	if (vmbuffer != InvalidBuffer)
 		ReleaseBuffer(vmbuffer);
+	if (p.vmbuffer != InvalidBuffer)
+		ReleaseBuffer(p.vmbuffer);
 	relation_close(rel, AccessShareLock);
 
 	/*
-- 
2.45.2

#19

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#18)

Re: Use read streams in pg_visibility

On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

Pushed. I added autovacuum=off so auto-analyze of a system catalog can't take
a snapshot that blocks VACUUM updating the vismap. I doubt that could happen
under default settings, but this lets us disregard the possibility entirely.

I also fixed the mix of tabs and spaces inside test file string literals.

#20

Nazir Bilal Yavuz

byavuz81@gmail.com

over 1 year ago

In reply to: Noah Misch (#19)

Re: Use read streams in pg_visibility

Hi,

On Wed, 11 Sept 2024 at 01:38, Noah Misch <noah@leadboat.com> wrote:

On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

Pushed. I added autovacuum=off so auto-analyze of a system catalog can't take
a snapshot that blocks VACUUM updating the vismap. I doubt that could happen
under default settings, but this lets us disregard the possibility entirely.

Thanks!

I also fixed the mix of tabs and spaces inside test file string literals.

I ran both pgindent and pgperltidy but they didn't catch them. Is
there an automated way to catch them?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#21

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Nazir Bilal Yavuz (#20)

Re: Use read streams in pg_visibility

On Wed, Sep 11, 2024 at 09:19:09AM +0300, Nazir Bilal Yavuz wrote:

On Wed, 11 Sept 2024 at 01:38, Noah Misch <noah@leadboat.com> wrote:

I also fixed the mix of tabs and spaces inside test file string literals.

I ran both pgindent and pgperltidy but they didn't catch them. Is
there an automated way to catch them?

git diff --check