Use read streams in pg_visibility

Started by Nazir Bilal Yavuzover 1 year ago21 messageshackers
Jump to latest
#1Nazir Bilal Yavuz
byavuz81@gmail.com

Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
----------

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.
----------

Both patches are attached.

Any feedback would be appreciated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v1-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload+51-3
v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload+81-13
#2Michael Paquier
michael@paquier.xyz
In reply to: Nazir Bilal Yavuz (#1)
Re: Use read streams in pg_visibility

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
----------

Mind sharing a script for reproducibility? Except for the drop_caches
part, of course..
--
Michael

#3Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Michael Paquier (#2)
Re: Use read streams in pg_visibility

Hi,

On Mon, 19 Aug 2024 at 09:30, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

Hi,

I am working on using the read stream in pg_visibility. There are two
places to use it:

1- collect_visibility_data()

This one is straightforward. I created a read stream object if
'include_pd' is true because read I/O is done when 'include_pd' is
true. There is ~4% timing improvement with this change. I started the
server with the default settings and created a 6 GB table. Then run
100 times pg_visibility() by clearing the OS cache between each run.
----------

Mind sharing a script for reproducibility? Except for the drop_caches
part, of course..

Sure, the script is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

setup.shapplication/x-shellscript; name=setup.shDownload
#4Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#1)
Re: Use read streams in pg_visibility

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

- Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
lookup when the other Buffer var already has the block you want.

- [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
to stop calling the ReadStreamBlockNumberCB for awhile. This could help if
nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
block. For example, if one block out of every 33000 is all-visible, every
heap block we visit has a different vmbuffer. It's likely not optimal for
the callback to pin and unpin 20 vmbuffers, then have
collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility
could notice this trend and request a stop of the callbacks until more of
the heap block work completes. If pg_visibility is going to be the only
place in the code with this use case, it's probably not worth carrying the
extra API just for pg_visibility. However, if we get a stronger use case
later, pg_visibility could be another beneficiary.

+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+											   void *callback_private_data,
+											   void *per_buffer_data)
+{
+	struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+	if (p->blocknum < p->nblocks)
+		return p->blocknum++;
+
+	return InvalidBlockNumber;

This is the third callback that just iterates over a block range, after
pg_prewarm_read_stream_next_block() and
copy_storage_using_buffer_read_stream_next_block(). While not a big problem,
I think it's time to have a general-use callback for block range scans. The
quantity of duplicate code is low, but the existing function names are long
and less informative than a behavior-based name.

+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+											 void *callback_private_data,
+											 void *per_buffer_data)
+{
+	struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+	for (; p->blocknum < p->nblocks; p->blocknum++)
+	{
+		bool		check_frozen = false;
+		bool		check_visible = false;
+
+		if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+			check_frozen = true;
+		if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+			check_visible = true;
+		if (!check_visible && !check_frozen)
+			continue;

If a vm has zero bits set, this loop will scan the entire vm before returning.
Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

@@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
items->count = 64;
items->tids = palloc(items->count * sizeof(ItemPointerData));

+	p.blocknum = 0;
+	p.nblocks = nblocks;
+	p.rel = rel;
+	p.vmbuffer = &vmbuffer;
+	p.all_frozen = all_frozen;
+	p.all_visible = all_visible;
+	stream = read_stream_begin_relation(READ_STREAM_FULL,
+										bstrategy,
+										rel,
+										MAIN_FORKNUM,
+										collect_corrupt_items_read_stream_next_block,
+										&p,
+										0);
+
/* Loop over every block in the relation. */
for (blkno = 0; blkno < nblocks; ++blkno)

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Since the callback skips blocks, this function should use the
read_stream_reset() style of looping:

while ((buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer) ...

Thanks,
nm

#5Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#4)
Re: Use read streams in pg_visibility

Hi,

Thanks for the review and feedback!

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

2- collect_corrupt_items()

This one is more complicated. The read stream callback function loops
until it finds a suitable block to read. So, if the callback returns
an InvalidBlockNumber; it means that the stream processed all possible
blocks and the stream should be finished. There is ~3% timing
improvement with this change. I started the server with the default
settings and created a 6 GB table. Then run 100 times
pg_check_visible() by clearing the OS cache between each run.

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

- Use ReadRecentBuffer() or a similar technique, to avoid a buffer mapping
lookup when the other Buffer var already has the block you want.

- [probably not worth it] Add APIs for pg_visibility.c to tell read_stream.c
to stop calling the ReadStreamBlockNumberCB for awhile. This could help if
nonzero vm bits are infrequent, causing us to visit few heap blocks per vm
block. For example, if one block out of every 33000 is all-visible, every
heap block we visit has a different vmbuffer. It's likely not optimal for
the callback to pin and unpin 20 vmbuffers, then have
collect_corrupt_items() pin and unpin the same 20 vmbuffers. pg_visibility
could notice this trend and request a stop of the callbacks until more of
the heap block work completes. If pg_visibility is going to be the only
place in the code with this use case, it's probably not worth carrying the
extra API just for pg_visibility. However, if we get a stronger use case
later, pg_visibility could be another beneficiary.

I think the number of "read-again" cases are low enough now. I am
working on 'using ReadRecentBuffer() or a similar technique' but that
may take more time, so I attached patches without these optimizations.

+/*
+ * Callback function to get next block for read stream object used in
+ * collect_visibility_data() function.
+ */
+static BlockNumber
+collect_visibility_data_read_stream_next_block(ReadStream *stream,
+                                                                                        void *callback_private_data,
+                                                                                        void *per_buffer_data)
+{
+     struct collect_visibility_data_read_stream_private *p = callback_private_data;
+
+     if (p->blocknum < p->nblocks)
+             return p->blocknum++;
+
+     return InvalidBlockNumber;

This is the third callback that just iterates over a block range, after
pg_prewarm_read_stream_next_block() and
copy_storage_using_buffer_read_stream_next_block(). While not a big problem,
I think it's time to have a general-use callback for block range scans. The
quantity of duplicate code is low, but the existing function names are long
and less informative than a behavior-based name.

I agree. Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

+static BlockNumber
+collect_corrupt_items_read_stream_next_block(ReadStream *stream,
+                                                                                      void *callback_private_data,
+                                                                                      void *per_buffer_data)
+{
+     struct collect_corrupt_items_read_stream_private *p = callback_private_data;
+
+     for (; p->blocknum < p->nblocks; p->blocknum++)
+     {
+             bool            check_frozen = false;
+             bool            check_visible = false;
+
+             if (p->all_frozen && VM_ALL_FROZEN(p->rel, p->blocknum, p->vmbuffer))
+                     check_frozen = true;
+             if (p->all_visible && VM_ALL_VISIBLE(p->rel, p->blocknum, p->vmbuffer))
+                     check_visible = true;
+             if (!check_visible && !check_frozen)
+                     continue;

If a vm has zero bits set, this loop will scan the entire vm before returning.
Hence, this loop deserves a CHECK_FOR_INTERRUPTS() or a comment about how
VM_ALL_FROZEN/VM_ALL_VISIBLE reaches a CHECK_FOR_INTERRUPTS().

Done. I added CHECK_FOR_INTERRUPTS() to the loop in callback.

@@ -687,6 +734,20 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
items->count = 64;
items->tids = palloc(items->count * sizeof(ItemPointerData));

+     p.blocknum = 0;
+     p.nblocks = nblocks;
+     p.rel = rel;
+     p.vmbuffer = &vmbuffer;
+     p.all_frozen = all_frozen;
+     p.all_visible = all_visible;
+     stream = read_stream_begin_relation(READ_STREAM_FULL,
+                                                                             bstrategy,
+                                                                             rel,
+                                                                             MAIN_FORKNUM,
+                                                                             collect_corrupt_items_read_stream_next_block,
+                                                                             &p,
+                                                                             0);
+
/* Loop over every block in the relation. */
for (blkno = 0; blkno < nblocks; ++blkno)

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

Since the callback skips blocks, this function should use the
read_stream_reset() style of looping:

Done.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

v2 of the patches are attached, only 0002 is updated.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v2-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload+51-3
v2-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload+72-16
#6Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#5)
Re: Use read streams in pg_visibility

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

I gather the scripts deal exclusively in tables with no vm bits set, so
pg_visibility did no heap page reads under those scripts. But the '700 "read
again"' result suggests either I'm guessing wrong, or that result came from a
different test procedure. Thoughts?

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

I think the number of "read-again" cases are low enough now.

Agreed. The increase from 20 to 40 is probably an increase in buffer mapping
lookups, not an increase in I/O.

Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

To me, that name isn't conveying the function's behavior, and the words "pg_"
and "general_" aren't adding information there. How about one of these names
or similar:

seq_read_stream_cb
sequential_read_stream_cb
block_range_read_stream_cb

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page. Past tests
like that had instability problems. One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly. Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that. It may not be problem for this particular test.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

No, I don't consider that a problem. That's not making us do additional I/O,
just testing more bits within the pages we're already loading. The difference
in behavior is user-visible, but it's the same behavior change the user would
get if the timing had been a bit different.

#7Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#6)
Re: Use read streams in pg_visibility

Hi,

On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah@leadboat.com> wrote:

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

The downside of this approach is there are too many "vmbuffer is valid
but BufferGetBlockNumber(*vmbuffer) is not equal to mapBlock, so
vmbuffer needs to be read again" cases in the read stream version (700
vs 20 for the 6 GB table). This is caused by the callback function of
the read stream reading a new vmbuf while getting next block numbers.
So, vmbuf is wrong when we are checking visibility map bits that might
have changed while we were acquiring the page lock.

Did the test that found 700 "read again" use a different procedure than the
one shared as setup.sh down-thread? Using that script alone, none of the vm
bits are set, so I'd not expect any heap page reads.

Yes, it is basically the same script but the query is 'SELECT
pg_check_visible('${TABLE}'::regclass);'.

I gather the scripts deal exclusively in tables with no vm bits set, so
pg_visibility did no heap page reads under those scripts. But the '700 "read
again"' result suggests either I'm guessing wrong, or that result came from a
different test procedure. Thoughts?

Sorry, yes. You need to run VACUUM on the test table before running
the query. The script is attached. You can run the script with
[corrupt | visibility] arguments to test the related patches.

The "vmbuffer needs to be read again" regression fits what I would expect the
v1 patch to do with a table having many vm bits set. In general, I think the
fix is to keep two heap Buffer vars, so the callback can work on one vmbuffer
while collect_corrupt_items() works on another vmbuffer. Much of the time,
they'll be the same buffer. It could be as simple as that, but you could
consider further optimizations like these:

Thanks for the suggestion. Keeping two vmbuffers lowered the count of
"read-again" cases to ~40. I ran the test again and the timing
improvement is ~5% now.

I think the number of "read-again" cases are low enough now.

Agreed. The increase from 20 to 40 is probably an increase in buffer mapping
lookups, not an increase in I/O.

Does creating something like
pg_general_read_stream_next_block() in read_stream code and exporting
it makes sense?

To me, that name isn't conveying the function's behavior, and the words "pg_"
and "general_" aren't adding information there. How about one of these names
or similar:

seq_read_stream_cb
sequential_read_stream_cb
block_range_read_stream_cb

I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page. Past tests
like that had instability problems. One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly. Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that. It may not be problem for this particular test.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

There is one behavioral change introduced with the patches. It could
happen when collect_corrupt_items() is called with both all_visible
and all_frozen being true.
-> Let's say VM_ALL_FROZEN() returns true but VM_ALL_VISIBLE() returns
false in callback. Callback returns this block number because
VM_ALL_FROZEN() is true.
-> At the /* Recheck to avoid returning spurious results. */ part, we
should only check VM_ALL_FROZEN() again as this was returning true in
the callback. But we check both VM_ALL_FROZEN() and VM_ALL_VISIBLE().
VM_ALL_FROZEN() returns false and VM_ALL_VISIBLE() returns true now
(vice versa of callback).
-> We were skipping this block in the master but the patched version
does not skip that.

Is this a problem? If this is a problem, a solution might be to
callback return values of VM_ALL_FROZEN() and VM_ALL_VISIBLE() in the
per_buffer_data.

No, I don't consider that a problem. That's not making us do additional I/O,
just testing more bits within the pages we're already loading. The difference
in behavior is user-visible, but it's the same behavior change the user would
get if the timing had been a bit different.

Thanks for the clarification.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v3-0001-Add-general-use-struct-and-callback-to-read-strea.patchapplication/x-patch; name=v3-0001-Add-general-use-struct-and-callback-to-read-strea.patchDownload+30-1
v3-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchapplication/x-patch; name=v3-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchDownload+3-23
v3-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchapplication/x-patch; name=v3-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchDownload+2-30
v3-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchapplication/x-patch; name=v3-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload+24-3
v3-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchapplication/x-patch; name=v3-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload+72-16
setup.shapplication/x-sh; name=setup.shDownload
#8Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#7)
Re: Use read streams in pg_visibility

On Tue, Aug 27, 2024 at 10:49:19AM +0300, Nazir Bilal Yavuz wrote:

On Fri, 23 Aug 2024 at 22:01, Noah Misch <noah@leadboat.com> wrote:

On Fri, Aug 23, 2024 at 02:20:06PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 20 Aug 2024 at 21:47, Noah Misch <noah@leadboat.com> wrote:

On Tue, Aug 13, 2024 at 03:22:27PM +0300, Nazir Bilal Yavuz wrote:

I liked the block_range_read_stream_cb. Attached patches for that
(first 3 patches). I chose an nblocks way instead of last_blocks in
the struct.

To read blocks 10 and 11, I would expect to initialize the struct with one of:

{ .first=10, .nblocks=2 }
{ .first=10, .last_inclusive=11 }
{ .first=10, .last_exclusive=12 }

With the patch's API, I would need {.first=10,.nblocks=12}. The struct field
named "nblocks" behaves like a last_block_exclusive. Please either make the
behavior an "nblocks" behavior or change the field name to replace the term
"nblocks" with something matching the behavior. (I used longer field names in
my examples here, to disambiguate those examples. It's okay if the final
field names aren't those, as long as the field names and the behavior align.)

The callback doesn't return blocks having zero vm bits, so the blkno variable
is not accurate. I didn't test, but I think the loop's "Recheck to avoid
returning spurious results." looks at the bit for the wrong block. If that's
what v1 does, could you expand the test file to catch that? For example, make
a two-block table with only the second block all-visible.

Yes, it was not accurate. I am getting blockno from the buffer now. I
checked and confirmed it is working as expected by manually logging
blocknos returned from the read stream. I am not sure how to add a
test case for this.

VACUUM FREEZE makes an all-visible, all-frozen table. DELETE of a particular
TID, even if rolled back, clears both vm bits for the TID's page. Past tests
like that had instability problems. One cause is a concurrent session's XID
or snapshot, which can prevent VACUUM setting vm bits. Using a TEMP table may
have been one of the countermeasures, but I don't remember clearly. Hence,
please search the archives or the existing pg_visibility tests for how we
dealt with that. It may not be problem for this particular test.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

I see what you're saying. collect_corrupt_items() needs a corrupt table to
report anything; all corruption-free tables get the same output. Testing this
would need extra C code or techniques like corrupt_page_checksum() to create
the corrupt state. That wouldn't be a bad thing to have, but it's big enough
that I'll consider it out of scope for $SUBJECT. With the callback change
above, I'll be ready to push all this.

#9Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#8)
Re: Use read streams in pg_visibility

Hi,

On Sat, 31 Aug 2024 at 02:51, Noah Misch <noah@leadboat.com> wrote:

To read blocks 10 and 11, I would expect to initialize the struct with one of:

{ .first=10, .nblocks=2 }
{ .first=10, .last_inclusive=11 }
{ .first=10, .last_exclusive=12 }

With the patch's API, I would need {.first=10,.nblocks=12}. The struct field
named "nblocks" behaves like a last_block_exclusive. Please either make the
behavior an "nblocks" behavior or change the field name to replace the term
"nblocks" with something matching the behavior. (I used longer field names in
my examples here, to disambiguate those examples. It's okay if the final
field names aren't those, as long as the field names and the behavior align.)

I decided to use 'current_blocknum' and 'last_exclusive'. I think
these are easier to understand and use.

Thanks for the information, I will check these. What I still do not
understand is how to make sure that only the second block is processed
and the first one is skipped. pg_check_visible() and pg_check_frozen()
returns TIDs that cause corruption in the visibility map, there is no
information about block numbers.

I see what you're saying. collect_corrupt_items() needs a corrupt table to
report anything; all corruption-free tables get the same output. Testing this
would need extra C code or techniques like corrupt_page_checksum() to create
the corrupt state. That wouldn't be a bad thing to have, but it's big enough
that I'll consider it out of scope for $SUBJECT. With the callback change
above, I'll be ready to push all this.

Thanks, updated patches are attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patchapplication/x-patch; name=v4-0001-Add-general-use-struct-and-callback-to-read-strea.patchDownload+33-3
v4-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchapplication/x-patch; name=v4-0002-pg_prewarm-Use-generic-use-read-stream-struct-and.patchDownload+4-24
v4-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchapplication/x-patch; name=v4-0003-RelationCopyStorageUsingBuffer-Use-generic-use-re.patchDownload+4-32
v4-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchapplication/x-patch; name=v4-0004-Use-read-stream-in-pg_visibility-in-collect_visib.patchDownload+24-3
v4-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchapplication/x-patch; name=v4-0005-Use-read-stream-in-pg_visibility-in-collect_corru.patchDownload+72-16
#10Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#9)
Re: Use read streams in pg_visibility

On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:

Thanks, updated patches are attached.

+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
*/
static inline BlockNumber
read_stream_get_block(ReadStream *stream, void *per_buffer_data)

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here. I reverted this.

Pushed with some other cosmetic changes.

#11Noah Misch
noah@leadboat.com
In reply to: Noah Misch (#10)
Re: Use read streams in pg_visibility

On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:

On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:

Thanks, updated patches are attached.

+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
*/
static inline BlockNumber
read_stream_get_block(ReadStream *stream, void *per_buffer_data)

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here. I reverted this.

Pushed with some other cosmetic changes.

I see I pushed something unacceptable under ASAN. I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&amp;dt=2024-09-03%2017%3A47%3A20

#12Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#11)
Re: Use read streams in pg_visibility

Hi,

On Tue, 3 Sept 2024 at 22:20, Noah Misch <noah@leadboat.com> wrote:

On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:

On Mon, Sep 02, 2024 at 03:20:12PM +0300, Nazir Bilal Yavuz wrote:

Thanks, updated patches are attached.

+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
*/
static inline BlockNumber
read_stream_get_block(ReadStream *stream, void *per_buffer_data)

v4-0001-Add-general-use-struct-and-callback-to-read-strea.patch introduced
this update to the read_stream_get_block() header comment, but we're not
changing read_stream_get_block() here. I reverted this.

Sorry, it should be left from rebase. Thanks for reverting it.

Pushed with some other cosmetic changes.

Thanks!

I see I pushed something unacceptable under ASAN. I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&amp;dt=2024-09-03%2017%3A47%3A20

I think it is related to the scope of BlockRangeReadStreamPrivate in
the collect_visibility_data() function. Attached a small fix for that.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v5-0001-Fix-ASAN-error-introduced-in-ed1b1ee59f.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Fix-ASAN-error-introduced-in-ed1b1ee59f.patchDownload+1-3
#13Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#12)
Re: Use read streams in pg_visibility

On Tue, Sep 03, 2024 at 10:46:34PM +0300, Nazir Bilal Yavuz wrote:

On Tue, 3 Sept 2024 at 22:20, Noah Misch <noah@leadboat.com> wrote:

On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:

Pushed with some other cosmetic changes.

Thanks!

I see I pushed something unacceptable under ASAN. I will look into that:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&amp;dt=2024-09-03%2017%3A47%3A20

I think it is related to the scope of BlockRangeReadStreamPrivate in
the collect_visibility_data() function. Attached a small fix for that.

Right.

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

#14Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#13)
Re: Use read streams in pg_visibility

Hi,

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#15Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#14)
Re: Use read streams in pg_visibility

On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

I think so. Please check that it would have caught both the blkno bug and the
above bug.

#16Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#15)
Re: Use read streams in pg_visibility

Hi,

On Thu, 5 Sept 2024 at 18:54, Noah Misch <noah@leadboat.com> wrote:

On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

I think so. Please check that it would have caught both the blkno bug and the
above bug.

The test and updated patch files are attached. In that test I
overwrite the visibility map file with its older state. Something like
this:

1- Create the table, then run VACUUM FREEZE on the table.
2- Shutdown the server, create a copy of the vm file, restart the server.
3- Run the DELETE command on some $random_tuples.
4- Shutdown the server, overwrite the vm file with the #2 vm file,
restart the server.
5- pg_check_visible and pg_check_frozen must report $random_tuples as corrupted.

Do you think this test makes sense and enough?

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v6-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchDownload+101-1
v6-0002-Optimize-pg_visibility-with-read-streams.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Optimize-pg_visibility-with-read-streams.patchDownload+92-22
#17Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#16)
Re: Use read streams in pg_visibility

On Mon, Sep 09, 2024 at 06:25:07PM +0300, Nazir Bilal Yavuz wrote:

On Thu, 5 Sept 2024 at 18:54, Noah Misch <noah@leadboat.com> wrote:

On Thu, Sep 05, 2024 at 03:59:53PM +0300, Nazir Bilal Yavuz wrote:

On Wed, 4 Sept 2024 at 21:43, Noah Misch <noah@leadboat.com> wrote:

/messages/by-id/CAEudQAozv3wTY5TV2t29JcwPydbmKbiWQkZD42S2OgzdixPMDQ@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption. I have pushed revert ddfc556 of the pg_visibility.c changes. For
the next try, could you add that testing we discussed?

Do you think that corrupting the visibility map and then seeing if
pg_check_visible() and pg_check_frozen() report something is enough?

I think so. Please check that it would have caught both the blkno bug and the
above bug.

The test and updated patch files are attached. In that test I
overwrite the visibility map file with its older state. Something like
this:

1- Create the table, then run VACUUM FREEZE on the table.
2- Shutdown the server, create a copy of the vm file, restart the server.
3- Run the DELETE command on some $random_tuples.
4- Shutdown the server, overwrite the vm file with the #2 vm file,
restart the server.
5- pg_check_visible and pg_check_frozen must report $random_tuples as corrupted.

Copying the vm file is a good technique. In my test runs, this does catch the
"never detect" bug, but it doesn't catch the blkno bug. Can you make it catch
both? It's possible my hand-patching to recreate the blkno bug is what went
wrong, so I'm attaching what I used. It consists of
v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch plus these
fixes for the "never detect" bug from your v6-0002:

--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -724,4 +724,4 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 	{
-		bool		check_frozen = false;
-		bool		check_visible = false;
+		bool		check_frozen = all_frozen;
+		bool		check_visible = all_visible;
 		Buffer		buffer;
@@ -757,5 +757,5 @@ collect_corrupt_items(Oid relid, bool all_visible, bool all_frozen)
 		 */
-		if (all_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
+		if (check_frozen && !VM_ALL_FROZEN(rel, blkno, &vmbuffer))
 			check_frozen = false;
-		if (all_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
+		if (check_visible && !VM_ALL_VISIBLE(rel, blkno, &vmbuffer))
 			check_visible = false;

Attachments:

blkno-bug-v0.patchtext/plain; charset=us-asciiDownload+81-12
#18Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#17)
Re: Use read streams in pg_visibility

Hi,

On Tue, 10 Sept 2024 at 00:32, Noah Misch <noah@leadboat.com> wrote:

Copying the vm file is a good technique. In my test runs, this does catch the
"never detect" bug, but it doesn't catch the blkno bug. Can you make it catch
both? It's possible my hand-patching to recreate the blkno bug is what went
wrong, so I'm attaching what I used. It consists of
v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch plus these
fixes for the "never detect" bug from your v6-0002:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

--
Regards,
Nazir Bilal Yavuz
Microsoft

Attachments:

v7-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchapplication/x-patch; name=v7-0001-Add-tests-that-pg_check_-visible-frozen-report-co.patchDownload+108-1
v7-0002-Optimize-pg_visibility-with-read-streams.patchapplication/x-patch; name=v7-0002-Optimize-pg_visibility-with-read-streams.patchDownload+92-22
#19Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#18)
Re: Use read streams in pg_visibility

On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

Pushed. I added autovacuum=off so auto-analyze of a system catalog can't take
a snapshot that blocks VACUUM updating the vismap. I doubt that could happen
under default settings, but this lets us disregard the possibility entirely.

I also fixed the mix of tabs and spaces inside test file string literals.

#20Nazir Bilal Yavuz
byavuz81@gmail.com
In reply to: Noah Misch (#19)
Re: Use read streams in pg_visibility

Hi,

On Wed, 11 Sept 2024 at 01:38, Noah Misch <noah@leadboat.com> wrote:

On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

Pushed. I added autovacuum=off so auto-analyze of a system catalog can't take
a snapshot that blocks VACUUM updating the vismap. I doubt that could happen
under default settings, but this lets us disregard the possibility entirely.

Thanks!

I also fixed the mix of tabs and spaces inside test file string literals.

I ran both pgindent and pgperltidy but they didn't catch them. Is
there an automated way to catch them?

--
Regards,
Nazir Bilal Yavuz
Microsoft

#21Noah Misch
noah@leadboat.com
In reply to: Nazir Bilal Yavuz (#20)