Using read_stream in index vacuum
Hi hackers!
On a recent hacking workshop [0]https://rhaas.blogspot.com/2024/08/postgresql-hacking-workshop-september.html Thomas mentioned that patches using new API would be welcomed.
So I prototyped streamlining of B-tree vacuum for a discussion.
When cleaning an index we must visit every index tuple, thus we uphold a special invariant:
After checking a trailing block, it must be last according to subsequent RelationGetNumberOfBlocks(rel) call.
This invariant does not allow us to completely replace block loop with streamlining. That's why streamlining is done only for number of blocks returned by first RelationGetNumberOfBlocks(rel) call. A tail is processed with regular ReadBufferExtended().
Also, it's worth mentioning that we have to jump to the left blocks from a recently split pages. We also do it with regular ReadBufferExtended(). That's why signature btvacuumpage() now accepts a buffer, not a block number.
I've benchmarked the patch on my laptop (MacBook Air M3) with following workload:
1. Initialization
create unlogged table x as select random() r from generate_series(1,1e7);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
vacuum;
2. pgbench with 1 client
insert into x select random() from generate_series(0,10) x;
vacuum x;
On my laptop I see ~3% increase in TPS of the the pgbench (~ from 101 to 104), but statistical noise is very significant, bigger than performance change. Perhaps, a less noisy benchmark can be devised.
What do you think? If this approach seems worthwhile, I can adapt same technology to other AMs.
Best regards, Andrey Borodin.
[0]: https://rhaas.blogspot.com/2024/08/postgresql-hacking-workshop-september.html
Attachments:
0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From f7fb9d54c1a663cded37c0186ed79514bd133b6f Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH] Prototype B-tree vacuum streamlineing
---
src/backend/access/nbtree/nbtree.c | 66 ++++++++++++++++++++++++------
1 file changed, 54 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 56e502c4fc..5784d97f8f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -86,7 +86,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -1013,6 +1013,45 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
needLock = !RELATION_IS_LOCAL(rel);
scanblkno = BTREE_METAPAGE + 1;
+
+ {
+ /* streamline reading most of index data */
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
+ p.current_blocknum = scanblkno;
+ if (needLock)
+ LockRelationForExtension(rel, ExclusiveLock);
+ /* We only streamline number of blocks that are know at the beginning */
+ p.last_exclusive = RelationGetNumberOfBlocks(rel);
+ if (needLock)
+ UnlockRelationForExtension(rel, ExclusiveLock);
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+ for (; scanblkno < p.last_exclusive; scanblkno++)
+ {
+ Buffer buf = read_stream_next_buffer(stream, NULL);
+ /*
+ * We expect that blocks are returned in order.
+ * However, we do not depent on it much, and in future ths
+ * expetation might change.
+ * Currently, this expectation only matters for progress reporting.
+ */
+ Assert(BufferGetBlockNumber(buf) == scanblkno);
+ btvacuumpage(&vstate, buf);
+ if (info->report_progress)
+ pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
+ scanblkno);
+ }
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+ }
+
+ /* Now we have to process pages created after streamlined vacuum */
for (;;)
{
/* Get the current relation length */
@@ -1032,7 +1071,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* Iterate over pages, then loop back to recheck length */
for (; scanblkno < num_pages; scanblkno++)
{
- btvacuumpage(&vstate, scanblkno);
+ Buffer buf = ReadBufferExtended(rel, MAIN_FORKNUM, scanblkno, RBM_NORMAL,
+ info->strategy);
+ btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
@@ -1069,7 +1110,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* recycled (i.e. before the page split).
*/
static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1080,7 +1121,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1094,14 +1135,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1388,6 +1421,15 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
}
--
2.39.5 (Apple Git-154)
Hi Andrey,
On Sat, Oct 19, 2024 at 5:39 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
Hi hackers!
On a recent hacking workshop [0] Thomas mentioned that patches using new API would be welcomed.
So I prototyped streamlining of B-tree vacuum for a discussion.
When cleaning an index we must visit every index tuple, thus we uphold a special invariant:
After checking a trailing block, it must be last according to subsequent RelationGetNumberOfBlocks(rel) call.This invariant does not allow us to completely replace block loop with streamlining. That's why streamlining is done only for number of blocks returned by first RelationGetNumberOfBlocks(rel) call. A tail is processed with regular ReadBufferExtended().
I'm wondering why is the case, ISTM that we can do *p.current_blocknum
= scanblkno*
and *p.last_exclusive = num_pages* in each loop of the outer for?
+ /* We only streamline number of blocks that are know at the beginning */
know -> known
+ * However, we do not depent on it much, and in future ths
+ * expetation might change.
depent -> depend
ths -> this
expetation -> expectation
Also, it's worth mentioning that we have to jump to the left blocks from a recently split pages. We also do it with regular ReadBufferExtended(). That's why signature btvacuumpage() now accepts a buffer, not a block number.
I've benchmarked the patch on my laptop (MacBook Air M3) with following workload:
1. Initialization
create unlogged table x as select random() r from generate_series(1,1e7);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
create index on x(r);
vacuum;
2. pgbench with 1 client
insert into x select random() from generate_series(0,10) x;
vacuum x;On my laptop I see ~3% increase in TPS of the the pgbench (~ from 101 to 104), but statistical noise is very significant, bigger than performance change. Perhaps, a less noisy benchmark can be devised.
What do you think? If this approach seems worthwhile, I can adapt same technology to other AMs.
I think this is a use case where the read stream api fits very well, thanks.
Best regards, Andrey Borodin.
[0] https://rhaas.blogspot.com/2024/08/postgresql-hacking-workshop-september.html
--
Regards
Junwang Zhao
On 19 Oct 2024, at 20:41, Junwang Zhao <zhjwpku@gmail.com> wrote:
I'm wondering why is the case, ISTM that we can do *p.current_blocknum
= scanblkno*
and *p.last_exclusive = num_pages* in each loop of the outer for?
Thanks for reviewing!
AFAIK we cannot restart stream if it's finished, so we have a race condition of main loop and callback caller.
Resolving this race condition would make code much more complex for a relatively small benefit.
I'll address typos in next patch version, thank you for looking into this.
Best regards, Andrey Borodin.
On Sun, Oct 20, 2024 at 3:19 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 19 Oct 2024, at 20:41, Junwang Zhao <zhjwpku@gmail.com> wrote:
I'm wondering why is the case, ISTM that we can do *p.current_blocknum
= scanblkno*
and *p.last_exclusive = num_pages* in each loop of the outer for?Thanks for reviewing!
AFAIK we cannot restart stream if it's finished, so we have a race condition of main loop and callback caller.
Resolving this race condition would make code much more complex for a relatively small benefit.
I'm not sure if I did not express myself correctly, I didn't mean to
restart the stream,
I mean we can create a new stream for each outer loop, I attached a
refactor 0002
based on your 0001, correct me if I'm wrong.
I'll address typos in next patch version, thank you for looking into this.
Best regards, Andrey Borodin.
--
Regards
Junwang Zhao
Attachments:
v2-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v2-0001-Prototype-B-tree-vacuum-streamlineing.patchDownload
From c21498c4c18432a6c13ef1989182e0f84a171093 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v2 1/2] Prototype B-tree vacuum streamlineing
---
src/backend/access/nbtree/nbtree.c | 66 ++++++++++++++++++++++++------
1 file changed, 54 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f4f79f2706..b6762e9c30 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -88,7 +88,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -1040,6 +1040,45 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
needLock = !RELATION_IS_LOCAL(rel);
scanblkno = BTREE_METAPAGE + 1;
+
+ {
+ /* streamline reading most of index data */
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
+ p.current_blocknum = scanblkno;
+ if (needLock)
+ LockRelationForExtension(rel, ExclusiveLock);
+ /* We only streamline number of blocks that are know at the beginning */
+ p.last_exclusive = RelationGetNumberOfBlocks(rel);
+ if (needLock)
+ UnlockRelationForExtension(rel, ExclusiveLock);
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+ for (; scanblkno < p.last_exclusive; scanblkno++)
+ {
+ Buffer buf = read_stream_next_buffer(stream, NULL);
+ /*
+ * We expect that blocks are returned in order.
+ * However, we do not depent on it much, and in future ths
+ * expetation might change.
+ * Currently, this expectation only matters for progress reporting.
+ */
+ Assert(BufferGetBlockNumber(buf) == scanblkno);
+ btvacuumpage(&vstate, buf);
+ if (info->report_progress)
+ pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
+ scanblkno);
+ }
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
+ }
+
+ /* Now we have to process pages created after streamlined vacuum */
for (;;)
{
/* Get the current relation length */
@@ -1059,7 +1098,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* Iterate over pages, then loop back to recheck length */
for (; scanblkno < num_pages; scanblkno++)
{
- btvacuumpage(&vstate, scanblkno);
+ Buffer buf = ReadBufferExtended(rel, MAIN_FORKNUM, scanblkno, RBM_NORMAL,
+ info->strategy);
+ btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
@@ -1096,7 +1137,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* recycled (i.e. before the page split).
*/
static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1107,7 +1148,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1121,14 +1162,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1415,6 +1448,15 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
}
--
2.39.5
v2-0002-refactor.patchapplication/octet-stream; name=v2-0002-refactor.patchDownload
From cf650befbbe48a14f212fb847a12aee16c2e826f Mon Sep 17 00:00:00 2001
From: Junwang Zhao <zhjwpku@gmail.com>
Date: Sun, 20 Oct 2024 10:09:59 +0000
Subject: [PATCH v2 2/2] refactor
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 56 +++++++++---------------------
1 file changed, 16 insertions(+), 40 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b6762e9c30..1d493aa680 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -971,6 +971,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BlockNumber num_pages;
BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1041,44 +1043,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
scanblkno = BTREE_METAPAGE + 1;
- {
- /* streamline reading most of index data */
- BlockRangeReadStreamPrivate p;
- ReadStream *stream = NULL;
- p.current_blocknum = scanblkno;
- if (needLock)
- LockRelationForExtension(rel, ExclusiveLock);
- /* We only streamline number of blocks that are know at the beginning */
- p.last_exclusive = RelationGetNumberOfBlocks(rel);
- if (needLock)
- UnlockRelationForExtension(rel, ExclusiveLock);
- stream = read_stream_begin_relation(READ_STREAM_FULL,
- info->strategy,
- rel,
- MAIN_FORKNUM,
- block_range_read_stream_cb,
- &p,
- 0);
- for (; scanblkno < p.last_exclusive; scanblkno++)
- {
- Buffer buf = read_stream_next_buffer(stream, NULL);
- /*
- * We expect that blocks are returned in order.
- * However, we do not depent on it much, and in future ths
- * expetation might change.
- * Currently, this expectation only matters for progress reporting.
- */
- Assert(BufferGetBlockNumber(buf) == scanblkno);
- btvacuumpage(&vstate, buf);
- if (info->report_progress)
- pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- scanblkno);
- }
- Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
- read_stream_end(stream);
- }
-
- /* Now we have to process pages created after streamlined vacuum */
for (;;)
{
/* Get the current relation length */
@@ -1095,16 +1059,28 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* Quit if we've scanned the whole relation */
if (scanblkno >= num_pages)
break;
+
+ p.current_blocknum = scanblkno;
+ p.last_exclusive = num_pages;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
+
/* Iterate over pages, then loop back to recheck length */
for (; scanblkno < num_pages; scanblkno++)
{
- Buffer buf = ReadBufferExtended(rel, MAIN_FORKNUM, scanblkno, RBM_NORMAL,
- info->strategy);
+ Buffer buf = read_stream_next_buffer(stream, NULL);
btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ read_stream_end(stream);
}
/* Set statistics num_pages field to final size of index */
--
2.39.5
On 20 Oct 2024, at 15:16, Junwang Zhao <zhjwpku@gmail.com> wrote:
I'm not sure if I did not express myself correctly, I didn't mean to
restart the stream,
I mean we can create a new stream for each outer loop, I attached a
refactor 0002
based on your 0001, correct me if I'm wrong.
I really like how the code looks with this refactoring. But I think we need some input from Thomas.
Is it OK if we start handful of streams for 1 page at the end of vacuum scan? How costly is to start a new scan?
Best regards, Andrey Borodin.
On Sun, Oct 20, 2024 at 10:19 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 20 Oct 2024, at 15:16, Junwang Zhao <zhjwpku@gmail.com> wrote:
I'm not sure if I did not express myself correctly, I didn't mean to
restart the stream,
I mean we can create a new stream for each outer loop, I attached a
refactor 0002
based on your 0001, correct me if I'm wrong.I really like how the code looks with this refactoring. But I think we need some input from Thomas.
Is it OK if we start handful of streams for 1 page at the end of vacuum scan? How costly is to start a new scan?
You shouldn't need either loop in btvacuumscan().
For the inner loop:
for (; scanblkno < num_pages; scanblkno++)
{
Buffer buf = read_stream_next_buffer(stream, NULL);
btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
}
you should just be able to be do something like
Buffer buf = read_stream_next_buffer(stream, NULL);
if (BufferIsValid(buf))
btvacuumpage(&vstate, buf);
Obviously I am eliding some details and clean-up and such. But your
read stream callback should be responsible for advancing the block
number and thus you shouldn't need to loop like this in
btvacuumscan().
The whole point of the read stream callback provided by the caller is
that the logic to get the next block should be contained there
(read_stream_get_block() is an exception to this).
For the outer loop, I feel like we have options. For example, maybe
the read stream callback can call RelationGetNumberOfBlocks(). I mean
maybe we don't want to have to take a relation extension lock in a
callback.
So, alternatively, we could add some kind of restartable flag to the
read stream API. So, after the callback returns InvalidBlockNumber and
the read_stream_next_buffer() returns NULL, we could call something
like read_stream_update() or read_stream_restart() or something. We
would have updated the BlockRangeReadStreamPrivate->last_exclusive
value. In your case it might not be substantially different
operationally than making new read streams (since you are not
allocating memory for a per-buffer data structure). But, I think the
code would read much better than making new read stream objects in a
loop.
- Melanie
On Mon, Oct 21, 2024 at 3:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
For the outer loop, I feel like we have options. For example, maybe
the read stream callback can call RelationGetNumberOfBlocks(). I mean
maybe we don't want to have to take a relation extension lock in a
callback.
Also, given this note in btvacuumscan:
* XXX: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
* think the use of the extension lock is still required.
Maybe we can stop requiring the extension lock and then I think it
might be okay to call RelationGetNumberOfBlocks() in the callback.
Probably needs more thought though.
- Melanie
Melanie, thanks for your comments.<br /><br />21.10.2024, 22:34, "Melanie Plageman" <melanieplageman@gmail.com>:<br /><blockquote><p></p><p>The whole point of the read stream callback provided by the caller is<br />that the logic to get the next block should be there</p></blockquote>We must get number of blocks after examining last block. But callback returning EOF might be called before. With current API we have to restart.<div><br /></div><div>Removing extension lock will not change this.</div><div><br /></div><div><br /></div><div>Best regards, Andrey Borodin.</div>
On Mon, Oct 21, 2024 at 4:49 PM Andrei Borodin <x4mmm@yandex-team.ru> wrote:
21.10.2024, 22:34, "Melanie Plageman" <melanieplageman@gmail.com>:
The whole point of the read stream callback provided by the caller is
that the logic to get the next block should be thereWe must get number of blocks after examining last block. But callback returning EOF might be called before. With current API we have to restart.
Removing extension lock will not change this.
I was suggesting you call RelationGetNumberOfBlocks() once
current_block == last_exclusive in the callback itself.
- Melanie
On 22 Oct 2024, at 00:05, Melanie Plageman <melanieplageman@gmail.com> wrote:
I was suggesting you call RelationGetNumberOfBlocks() once
current_block == last_exclusive in the callback itself.
Consider following sequence of events:
0. We schedule some buffers for IO
1. We call RelationGetNumberOfBlocks() in callback when current_block == last_exclusive and return InvalidBlockNumber to signal EOF
After this:
2. Some page is getting split into new page with number last_exclusive
3. Buffers from IO are returned and vacuumed, but not with number last_exclusive, because it was not scheduled
Maybe I'm missing something...
Best regards, Andrey Borodin.
On Tue, Oct 22, 2024 at 2:30 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 22 Oct 2024, at 00:05, Melanie Plageman <melanieplageman@gmail.com> wrote:
I was suggesting you call RelationGetNumberOfBlocks() once
current_block == last_exclusive in the callback itself.Consider following sequence of events:
0. We schedule some buffers for IO
1. We call RelationGetNumberOfBlocks() in callback when current_block == last_exclusive and return InvalidBlockNumber to signal EOF
After this:
2. Some page is getting split into new page with number last_exclusive
3. Buffers from IO are returned and vacuumed, but not with number last_exclusive, because it was not scheduled
Ah, right, the callback might return InvalidBlockNumber far before
we've actually read (and vacuumed) the blocks it is specifying.
I ran into something similar when trying to use the read stream API
for index prefetching. I added TIDs from the index to a queue that was
passed to the read stream and available in the callback. When the
queue was empty, I needed to check if there were more index entries
and, if so, add more TIDs to the queue (instead of ending the read
stream). So, I wanted some way for the callback to return
InvalidBlockNumber when there might actually be more blocks to
request. This is a kind of "restarting" behavior.
In that case, though, the reason the callback couldn't get more TIDs
when the queue was empty was because of layering violations and not,
like in the case of btree vacuum, because the index might be in a
different state after vacuuming the "last" block. Perhaps there is a
way to make the read stream restartable, though.
I just can't help wondering if there is a way to refactor the code
(potentially in a more invasive way) to make it more natural to use
the read stream API here. I usually hate when people give me such
unhelpful review feedback, though. So, carry on.
- Melanie
On 22 Oct 2024, at 16:42, Melanie Plageman <melanieplageman@gmail.com> wrote:
Ah, right, the callback might return InvalidBlockNumber far before
we've actually read (and vacuumed) the blocks it is specifying.
I've discussed the issue with Thomas on PGConf.eu and he proposed to use stream reset.
PFA v3.
Best regards, Andrey Borodin.
Attachments:
v3-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v3-0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From 185c9469feff2deca7801882a3747008aab22966 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v3] Prototype B-tree vacuum streamlineing
Signed-off-by: Andrey Borodin <x4mmm@yandex-team.ru>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 46 ++++++++++++++++++++++--------
1 file changed, 34 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f4f79f2..2dc89cb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -88,7 +88,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -971,6 +971,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BlockNumber num_pages;
BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1040,6 +1042,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
needLock = !RELATION_IS_LOCAL(rel);
scanblkno = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
/* Get the current relation length */
@@ -1056,15 +1065,27 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* Quit if we've scanned the whole relation */
if (scanblkno >= num_pages)
break;
+
+ p.current_blocknum = scanblkno;
+ p.last_exclusive = num_pages;
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
+
/* Iterate over pages, then loop back to recheck length */
for (; scanblkno < num_pages; scanblkno++)
{
- btvacuumpage(&vstate, scanblkno);
+ Buffer buf = read_stream_next_buffer(stream, NULL);
+ btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
}
+ read_stream_end(stream);
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1096,7 +1117,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* recycled (i.e. before the page split).
*/
static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1107,7 +1128,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1121,14 +1142,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1415,6 +1428,15 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
}
--
2.39.5 (Apple Git-154)
On Wed, Oct 23, 2024 at 9:32 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 22 Oct 2024, at 16:42, Melanie Plageman <melanieplageman@gmail.com> wrote:
Ah, right, the callback might return InvalidBlockNumber far before
we've actually read (and vacuumed) the blocks it is specifying.I've discussed the issue with Thomas on PGConf.eu and he proposed to use stream reset.
PFA v3.
Yeah, read_stream_reset fits better.
The patch LGTM, thanks.
Best regards, Andrey Borodin.
--
Regards
Junwang Zhao
On Wed, Oct 23, 2024 at 9:32 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 22 Oct 2024, at 16:42, Melanie Plageman <melanieplageman@gmail.com> wrote:
Ah, right, the callback might return InvalidBlockNumber far before
we've actually read (and vacuumed) the blocks it is specifying.I've discussed the issue with Thomas on PGConf.eu and he proposed to use stream reset.
That approach seems promising.
PFA v3.
Note that you don't check if buf is valid here and break out of the
inner loop when it is invalid.
for (; scanblkno < num_pages; scanblkno++)
{
Buffer buf = read_stream_next_buffer(stream, NULL);
btvacuumpage(&vstate, buf);
...
}
By doing read_stream_reset() before you first invoke
read_stream_next_buffer(), seems like you invalidate the distance set
in read_stream_begin_relation()
if (flags & READ_STREAM_FULL)
stream->distance = Min(max_pinned_buffers, io_combine_limit);
->
stream->distance = 1 in read_stream_reset()
I still don't really like the inner loop using scanblkno:
for (; scanblkno < num_pages; scanblkno++)
{
Buffer buf = read_stream_next_buffer(stream, NULL);
btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
}
Since you already advance a block number in the callback, I find it
confusing to also use the block number as a loop condition here. I
think it would be clearer to loop on read_stream_next_buffer()
returning a valid buffer (and then move the progress reporting into
btvacuumpage()).
But, I think I would need to study this btree code more to do a more
informed review of the patch.
- Melanie
Thanks!
On 23 Oct 2024, at 18:17, Melanie Plageman <melanieplageman@gmail.com> wrote:
Note that you don't check if buf is valid here and break out of the
inner loop when it is invalid.
I've added two asserts to clarify expectations.
By doing read_stream_reset() before you first invoke
read_stream_next_buffer(), seems like you invalidate the distance set
in read_stream_begin_relation()
I've moved reset so happy path have a proper distance.
Since you already advance a block number in the callback, I find it
confusing to also use the block number as a loop condition here. I
think it would be clearer to loop on read_stream_next_buffer()
returning a valid buffer (and then move the progress reporting into
btvacuumpage()).
I'll think how to restructure flow there...
I've also added a TAP test to check logic of stream resetting.
Best regards, Andrey Borodin.
Attachments:
v4-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v4-0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From 185c9469feff2deca7801882a3747008aab22966 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v4 1/4] Prototype B-tree vacuum streamlineing
Signed-off-by: Andrey Borodin <x4mmm@yandex-team.ru>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 46 ++++++++++++++++++++++--------
1 file changed, 34 insertions(+), 12 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f4f79f2..2dc89cb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -88,7 +88,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -971,6 +971,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
BlockNumber num_pages;
BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1040,6 +1042,13 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
needLock = !RELATION_IS_LOCAL(rel);
scanblkno = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
/* Get the current relation length */
@@ -1056,15 +1065,27 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* Quit if we've scanned the whole relation */
if (scanblkno >= num_pages)
break;
+
+ p.current_blocknum = scanblkno;
+ p.last_exclusive = num_pages;
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
+
/* Iterate over pages, then loop back to recheck length */
for (; scanblkno < num_pages; scanblkno++)
{
- btvacuumpage(&vstate, scanblkno);
+ Buffer buf = read_stream_next_buffer(stream, NULL);
+ btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
}
+ read_stream_end(stream);
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1096,7 +1117,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* recycled (i.e. before the page split).
*/
static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1107,7 +1128,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1121,14 +1142,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1415,6 +1428,15 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
}
--
2.39.5 (Apple Git-154)
v4-0004-Move-reset.patchapplication/octet-stream; name=v4-0004-Move-reset.patch; x-unix-mode=0644Download
From 17d14e6ccf9b74dffdb4bdf36c7bcf933418c625 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 23 Oct 2024 20:54:02 +0300
Subject: [PATCH v4 4/4] Move reset
---
src/backend/access/nbtree/nbtree.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3a775e8..c9b8107 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1073,11 +1073,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
p.current_blocknum = scanblkno;
p.last_exclusive = num_pages;
- /*
- * After reaching the end we have to reset stream to use it again.
- * Extra restart in case of just one iteration does not cost us much.
- */
- read_stream_reset(stream);
/* Iterate over pages, then loop back to recheck length */
for (; scanblkno < num_pages; scanblkno++)
@@ -1090,6 +1085,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
scanblkno);
}
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
}
read_stream_end(stream);
--
2.39.5 (Apple Git-154)
v4-0003-Add-asserts-to-address-review-feedback.patchapplication/octet-stream; name=v4-0003-Add-asserts-to-address-review-feedback.patch; x-unix-mode=0644Download
From e180131f802cb5ba61d8dbc04baf0b5f65f8b149 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 23 Oct 2024 20:50:33 +0300
Subject: [PATCH v4 3/4] Add asserts to address review feedback
---
src/backend/access/nbtree/nbtree.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index bd88b33..3a775e8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1083,6 +1083,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
for (; scanblkno < num_pages; scanblkno++)
{
Buffer buf = read_stream_next_buffer(stream, NULL);
+ Assert(BufferIsValid(buf));
+ Assert(BufferGetBlockNumber == scanblkno);
btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
--
2.39.5 (Apple Git-154)
v4-0002-Add-TAP-tests-that-verifies-B-tree-vacuum.patchapplication/octet-stream; name=v4-0002-Add-TAP-tests-that-verifies-B-tree-vacuum.patch; x-unix-mode=0644Download
From e1a050a893b6f4ee2edf91367ea76b6c2aa16c4a Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 23 Oct 2024 20:46:50 +0300
Subject: [PATCH v4 2/4] Add TAP tests that verifies B-tree vacuum
---
src/backend/access/nbtree/nbtree.c | 5 ++
src/test/modules/test_misc/meson.build | 1 +
.../modules/test_misc/t/007_vacuum_btree.pl | 81 +++++++++++++++++++
3 files changed, 87 insertions(+)
create mode 100644 src/test/modules/test_misc/t/007_vacuum_btree.pl
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 2dc89cb..bd88b33 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,6 +34,7 @@
#include "storage/smgr.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/injection_point.h"
#include "utils/memutils.h"
@@ -1066,6 +1067,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
if (scanblkno >= num_pages)
break;
+ /* In 007_vacuum_btree test we need to coordinate two points here */
+ INJECTION_POINT("nbtree-vacuum-1");
+ INJECTION_POINT("nbtree-vacuum-2");
+
p.current_blocknum = scanblkno;
p.last_exclusive = num_pages;
/*
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa7..a456a5e 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -15,6 +15,7 @@ tests += {
't/004_io_direct.pl',
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
+ 't/007_vacuum_btree.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/007_vacuum_btree.pl b/src/test/modules/test_misc/t/007_vacuum_btree.pl
new file mode 100644
index 0000000..e8df4c0
--- /dev/null
+++ b/src/test/modules/test_misc/t/007_vacuum_btree.pl
@@ -0,0 +1,81 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# This test verifies that B-tree vacuum can restart read stream.
+# To do so we need to insert some data during vacuum. So we wait in injection point
+# after first vacuum scan. During this wait we insert some data forcing page split.
+# this split will trigger relation extension and subsequent read_stream_reset().
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize postgres
+my $psql_err = '';
+my $psql_out = '';
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# This ensures autovacuum do not run
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# From this point, vacuum worker will wait at startup.
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-2', 'wait');");
+
+my $psql_session = $node->background_psql('postgres');
+
+$psql_session->query_until(
+ qr/starting_bg_psql/,
+ q(\echo starting_bg_psql
+ create table a as select random() r from generate_series(1,100) x;
+ create index on a(r);
+ delete from a;
+ vacuum a;
+ ));
+
+#print $node->safe_psql('postgres','select * from pg_stat_activity');
+
+# Wait until an vacuum worker starts.
+$node->wait_for_event('client backend', 'nbtree-vacuum-2');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
+
+# Here's the key point of a test: during vacuum we add some page splits.
+# This will force vacuum into doing another scan thus reseting read stream.
+$node->safe_psql('postgres',
+ "insert into a select x from generate_series(1,3000) x;");
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-2');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-2');");
+
+# Observe that second scan is reached.
+$node->wait_for_event('client backend', 'nbtree-vacuum-1');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-1');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-1');");
+
+ok($psql_session->quit);
+
+done_testing();
--
2.39.5 (Apple Git-154)
On 23 Oct 2024, at 20:57, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
I'll think how to restructure flow there...
OK, I've understood how it should be structured. PFA v5. Sorry for the noise.
Best regards, Andrey Borodin.
Attachments:
v5-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v5-0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From af5aaad258125481ef9cce781feef6db9cec3cfa Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v5] Prototype B-tree vacuum streamlineing
Signed-off-by: Andrey Borodin <x4mmm@yandex-team.ru>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 66 ++++++++++-----
src/test/modules/test_misc/meson.build | 1 +
.../modules/test_misc/t/007_vacuum_btree.pl | 81 +++++++++++++++++++
3 files changed, 130 insertions(+), 18 deletions(-)
create mode 100644 src/test/modules/test_misc/t/007_vacuum_btree.pl
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f4f79f2..5e7bfbf 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,6 +34,7 @@
#include "storage/smgr.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/injection_point.h"
#include "utils/memutils.h"
@@ -88,7 +89,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static void btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -969,8 +970,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTVacState vstate;
BlockNumber num_pages;
- BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1039,7 +1041,14 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- scanblkno = BTREE_METAPAGE + 1;
+ p.current_blocknum = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
/* Get the current relation length */
@@ -1054,17 +1063,37 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages);
/* Quit if we've scanned the whole relation */
- if (scanblkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; scanblkno < num_pages; scanblkno++)
+
+ /* In 007_vacuum_btree test we need to coordinate two distinguishable points here */
+ INJECTION_POINT("nbtree-vacuum-1");
+ INJECTION_POINT("nbtree-vacuum-2");
+
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(true)
{
- btvacuumpage(&vstate, scanblkno);
+ BlockNumber current_block;
+ Buffer buf = read_stream_next_buffer(stream, NULL);
+ if (!BufferIsValid(buf))
+ break;
+
+ current_block = BufferGetBlockNumber(buf);
+ btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- scanblkno);
+ current_block);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1096,7 +1125,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* recycled (i.e. before the page split).
*/
static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1107,7 +1136,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1121,14 +1150,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1415,6 +1436,15 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
}
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa7..a456a5e 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -15,6 +15,7 @@ tests += {
't/004_io_direct.pl',
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
+ 't/007_vacuum_btree.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/007_vacuum_btree.pl b/src/test/modules/test_misc/t/007_vacuum_btree.pl
new file mode 100644
index 0000000..e8df4c0
--- /dev/null
+++ b/src/test/modules/test_misc/t/007_vacuum_btree.pl
@@ -0,0 +1,81 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# This test verifies that B-tree vacuum can restart read stream.
+# To do so we need to insert some data during vacuum. So we wait in injection point
+# after first vacuum scan. During this wait we insert some data forcing page split.
+# this split will trigger relation extension and subsequent read_stream_reset().
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize postgres
+my $psql_err = '';
+my $psql_out = '';
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# This ensures autovacuum do not run
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# From this point, vacuum worker will wait at startup.
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-2', 'wait');");
+
+my $psql_session = $node->background_psql('postgres');
+
+$psql_session->query_until(
+ qr/starting_bg_psql/,
+ q(\echo starting_bg_psql
+ create table a as select random() r from generate_series(1,100) x;
+ create index on a(r);
+ delete from a;
+ vacuum a;
+ ));
+
+#print $node->safe_psql('postgres','select * from pg_stat_activity');
+
+# Wait until an vacuum worker starts.
+$node->wait_for_event('client backend', 'nbtree-vacuum-2');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
+
+# Here's the key point of a test: during vacuum we add some page splits.
+# This will force vacuum into doing another scan thus reseting read stream.
+$node->safe_psql('postgres',
+ "insert into a select x from generate_series(1,3000) x;");
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-2');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-2');");
+
+# Observe that second scan is reached.
+$node->wait_for_event('client backend', 'nbtree-vacuum-1');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-1');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-1');");
+
+ok($psql_session->quit);
+
+done_testing();
--
2.39.5 (Apple Git-154)
On Wed, Oct 23, 2024 at 4:29 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 23 Oct 2024, at 20:57, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
I'll think how to restructure flow there...
OK, I've understood how it should be structured. PFA v5. Sorry for the noise.
I think this would be a bit nicer:
while (BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
block = btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE, block);
}
Maybe change btvacuumpage() to return the block number to avoid the
extra BufferGetBlockNumber() calls (those add up).
While looking at this, I started to wonder if it isn't incorrect that
we are not calling pgstat_progress_update_param() for the blocks that
we backtrack and read in btvacuumpage() too (on master as well).
btvacuumpage() may actually have scanned more than one block, so...
Unrelated to code review, but btree index vacuum has the same issues
that kept us from committing streaming heap vacuum last release --
interactions between the buffer access strategy ring buffer size and
the larger reads -- one of which is an increase in the number of WAL
syncs and writes required. Thomas mentions it here [1]/messages/by-id/CA+hUKGKN3oy0bN_3yv8hd78a4+M1tJC9z7mD8+f+yA+GeoFUwQ@mail.gmail.com and here [2]/messages/by-id/CA+hUKGK1in4FiWtisXZ+Jo-cNSbWjmBcPww3w3DBM+whJTABXA@mail.gmail.com is
the thread where he proposes adding vectored writeback to address some
of these issues.
- Melanie
[1]: /messages/by-id/CA+hUKGKN3oy0bN_3yv8hd78a4+M1tJC9z7mD8+f+yA+GeoFUwQ@mail.gmail.com
[2]: /messages/by-id/CA+hUKGK1in4FiWtisXZ+Jo-cNSbWjmBcPww3w3DBM+whJTABXA@mail.gmail.com
I've also added GiST vacuum to the patchset.
On 24 Oct 2024, at 01:04, Melanie Plageman <melanieplageman@gmail.com> wrote:
On Wed, Oct 23, 2024 at 4:29 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 23 Oct 2024, at 20:57, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
I'll think how to restructure flow there...
OK, I've understood how it should be structured. PFA v5. Sorry for the noise.
I think this would be a bit nicer:
while (BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
block = btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE, block);
}
OK.
Maybe change btvacuumpage() to return the block number to avoid the
extra BufferGetBlockNumber() calls (those add up).
Makes sense... now we have non-obvious function prototype, but I think it worth it.
While looking at this, I started to wonder if it isn't incorrect that
we are not calling pgstat_progress_update_param() for the blocks that
we backtrack and read in btvacuumpage() too (on master as well).
btvacuumpage() may actually have scanned more than one block, so...
It's correct, user wants to know progress and backtracks do not count towards progress.
Though, it must be relatevely infrequent case.
Unrelated to code review, but btree index vacuum has the same issues
that kept us from committing streaming heap vacuum last release --
interactions between the buffer access strategy ring buffer size and
the larger reads -- one of which is an increase in the number of WAL
syncs and writes required. Thomas mentions it here [1] and here [2] is
the thread where he proposes adding vectored writeback to address some
of these issues.
Do I need to do anything about it?
Thank you!
Best regards, Andrey Borodin.
Attachments:
v6-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v6-0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From 7719ab1ebdc3d56711505238eabe4f3f6af56af1 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v6 1/2] Prototype B-tree vacuum streamlineing
Signed-off-by: Andrey Borodin <x4mmm@yandex-team.ru>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 70 +++++++++++-----
src/test/modules/test_misc/meson.build | 1 +
.../modules/test_misc/t/007_vacuum_btree.pl | 81 +++++++++++++++++++
3 files changed, 131 insertions(+), 21 deletions(-)
create mode 100644 src/test/modules/test_misc/t/007_vacuum_btree.pl
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f4f79f2..0ee2b2b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,6 +34,7 @@
#include "storage/smgr.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/injection_point.h"
#include "utils/memutils.h"
@@ -88,7 +89,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static BlockNumber btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -969,8 +970,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTVacState vstate;
BlockNumber num_pages;
- BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1039,9 +1041,17 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- scanblkno = BTREE_METAPAGE + 1;
+ p.current_blocknum = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -1054,17 +1064,31 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages);
/* Quit if we've scanned the whole relation */
- if (scanblkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; scanblkno < num_pages; scanblkno++)
+
+ /* In 007_vacuum_btree test we need to coordinate two distinguishable points here */
+ INJECTION_POINT("nbtree-vacuum-1");
+ INJECTION_POINT("nbtree-vacuum-2");
+
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
- btvacuumpage(&vstate, scanblkno);
+ BlockNumber current_block = btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- scanblkno);
+ current_block);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1094,9 +1118,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* after our cycleid was acquired) whose right half page happened to reuse
* a block that we might have processed at some point before it was
* recycled (i.e. before the page split).
+ *
+ * Returns BlockNumber of a scanned page (not backtracked).
*/
-static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+static BlockNumber
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1107,7 +1133,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1121,14 +1147,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1164,7 +1182,7 @@ backtrack:
errmsg_internal("right sibling %u of scanblkno %u unexpectedly in an inconsistent state in index \"%s\"",
blkno, scanblkno, RelationGetRelationName(rel))));
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
/*
@@ -1184,7 +1202,7 @@ backtrack:
{
/* Done with current scanblkno (and all lower split pages) */
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
}
@@ -1415,8 +1433,18 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
+ return scanblkno;
}
/*
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa7..a456a5e 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -15,6 +15,7 @@ tests += {
't/004_io_direct.pl',
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
+ 't/007_vacuum_btree.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/007_vacuum_btree.pl b/src/test/modules/test_misc/t/007_vacuum_btree.pl
new file mode 100644
index 0000000..e8df4c0
--- /dev/null
+++ b/src/test/modules/test_misc/t/007_vacuum_btree.pl
@@ -0,0 +1,81 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# This test verifies that B-tree vacuum can restart read stream.
+# To do so we need to insert some data during vacuum. So we wait in injection point
+# after first vacuum scan. During this wait we insert some data forcing page split.
+# this split will trigger relation extension and subsequent read_stream_reset().
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize postgres
+my $psql_err = '';
+my $psql_out = '';
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# This ensures autovacuum do not run
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# From this point, vacuum worker will wait at startup.
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-2', 'wait');");
+
+my $psql_session = $node->background_psql('postgres');
+
+$psql_session->query_until(
+ qr/starting_bg_psql/,
+ q(\echo starting_bg_psql
+ create table a as select random() r from generate_series(1,100) x;
+ create index on a(r);
+ delete from a;
+ vacuum a;
+ ));
+
+#print $node->safe_psql('postgres','select * from pg_stat_activity');
+
+# Wait until an vacuum worker starts.
+$node->wait_for_event('client backend', 'nbtree-vacuum-2');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
+
+# Here's the key point of a test: during vacuum we add some page splits.
+# This will force vacuum into doing another scan thus reseting read stream.
+$node->safe_psql('postgres',
+ "insert into a select x from generate_series(1,3000) x;");
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-2');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-2');");
+
+# Observe that second scan is reached.
+$node->wait_for_event('client backend', 'nbtree-vacuum-1');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-1');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-1');");
+
+ok($psql_session->quit);
+
+done_testing();
--
2.39.5 (Apple Git-154)
v6-0002-Use-read_stream-in-GiST-vacuum.patchapplication/octet-stream; name=v6-0002-Use-read_stream-in-GiST-vacuum.patch; x-unix-mode=0644Download
From 375904cfb828f5feb84dd800b581b9e006e4cc78 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 24 Oct 2024 10:00:49 +0300
Subject: [PATCH v6 2/2] Use read_stream in GiST vacuum
---
src/backend/access/gist/gistvacuum.c | 46 ++++++++++++++++++++--------
1 file changed, 33 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f..a86122a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -22,6 +22,7 @@
#include "miscadmin.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/read_stream.h"
#include "utils/memutils.h"
/* Working state needed by gistbulkdelete */
@@ -44,8 +45,7 @@ typedef struct
static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state);
-static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
- BlockNumber orig_blkno);
+static void gistvacuumpage(GistVacState *vstate, Buffer buffer);
static void gistvacuum_delete_empty_pages(IndexVacuumInfo *info,
GistVacState *vstate);
static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -129,8 +129,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
GistVacState vstate;
BlockNumber num_pages;
bool needLock;
- BlockNumber blkno;
MemoryContext oldctx;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -208,9 +209,17 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- blkno = GIST_ROOT_BLKNO;
+ p.current_blocknum = GIST_ROOT_BLKNO;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -219,12 +228,23 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
UnlockRelationForExtension(rel, ExclusiveLock);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
- gistvacuumpage(&vstate, blkno, blkno);
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
+ {
+ gistvacuumpage(&vstate, buf);
+ }
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/*
* If we found any recyclable pages (and recorded them in the FSM), then
@@ -269,15 +289,16 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* are recursing to re-examine a previous page).
*/
static void
-gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+gistvacuumpage(GistVacState *vstate, Buffer buffer)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteCallback callback = vstate->callback;
void *callback_state = vstate->callback_state;
Relation rel = info->index;
- Buffer buffer;
+ BlockNumber orig_blkno = BufferGetBlockNumber(buffer);
Page page;
BlockNumber recurse_to;
+ BlockNumber blkno = orig_blkno;
restart:
recurse_to = InvalidBlockNumber;
@@ -285,9 +306,6 @@ restart:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
-
/*
* We are not going to stay here for a long time, aggressively grab an
* exclusive lock.
@@ -450,6 +468,8 @@ restart:
if (recurse_to != InvalidBlockNumber)
{
blkno = recurse_to;
+ buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto restart;
}
}
--
2.39.5 (Apple Git-154)
On 24 Oct 2024, at 10:15, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
I've also added GiST vacuum to the patchset.
I decided to add up a SP-GiST while having launched on pgconf.eu.
Best regards, Andrey Borodin.
Attachments:
v7-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v7-0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From 7719ab1ebdc3d56711505238eabe4f3f6af56af1 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v7 1/3] Prototype B-tree vacuum streamlineing
Signed-off-by: Andrey Borodin <x4mmm@yandex-team.ru>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 70 +++++++++++-----
src/test/modules/test_misc/meson.build | 1 +
.../modules/test_misc/t/007_vacuum_btree.pl | 81 +++++++++++++++++++
3 files changed, 131 insertions(+), 21 deletions(-)
create mode 100644 src/test/modules/test_misc/t/007_vacuum_btree.pl
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index f4f79f2..0ee2b2b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,6 +34,7 @@
#include "storage/smgr.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/injection_point.h"
#include "utils/memutils.h"
@@ -88,7 +89,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static BlockNumber btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -969,8 +970,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTVacState vstate;
BlockNumber num_pages;
- BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1039,9 +1041,17 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- scanblkno = BTREE_METAPAGE + 1;
+ p.current_blocknum = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -1054,17 +1064,31 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages);
/* Quit if we've scanned the whole relation */
- if (scanblkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; scanblkno < num_pages; scanblkno++)
+
+ /* In 007_vacuum_btree test we need to coordinate two distinguishable points here */
+ INJECTION_POINT("nbtree-vacuum-1");
+ INJECTION_POINT("nbtree-vacuum-2");
+
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
- btvacuumpage(&vstate, scanblkno);
+ BlockNumber current_block = btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- scanblkno);
+ current_block);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1094,9 +1118,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* after our cycleid was acquired) whose right half page happened to reuse
* a block that we might have processed at some point before it was
* recycled (i.e. before the page split).
+ *
+ * Returns BlockNumber of a scanned page (not backtracked).
*/
-static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+static BlockNumber
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1107,7 +1133,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1121,14 +1147,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1164,7 +1182,7 @@ backtrack:
errmsg_internal("right sibling %u of scanblkno %u unexpectedly in an inconsistent state in index \"%s\"",
blkno, scanblkno, RelationGetRelationName(rel))));
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
/*
@@ -1184,7 +1202,7 @@ backtrack:
{
/* Done with current scanblkno (and all lower split pages) */
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
}
@@ -1415,8 +1433,18 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
+ return scanblkno;
}
/*
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa7..a456a5e 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -15,6 +15,7 @@ tests += {
't/004_io_direct.pl',
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
+ 't/007_vacuum_btree.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/007_vacuum_btree.pl b/src/test/modules/test_misc/t/007_vacuum_btree.pl
new file mode 100644
index 0000000..e8df4c0
--- /dev/null
+++ b/src/test/modules/test_misc/t/007_vacuum_btree.pl
@@ -0,0 +1,81 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# This test verifies that B-tree vacuum can restart read stream.
+# To do so we need to insert some data during vacuum. So we wait in injection point
+# after first vacuum scan. During this wait we insert some data forcing page split.
+# this split will trigger relation extension and subsequent read_stream_reset().
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize postgres
+my $psql_err = '';
+my $psql_out = '';
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# This ensures autovacuum do not run
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# From this point, vacuum worker will wait at startup.
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-2', 'wait');");
+
+my $psql_session = $node->background_psql('postgres');
+
+$psql_session->query_until(
+ qr/starting_bg_psql/,
+ q(\echo starting_bg_psql
+ create table a as select random() r from generate_series(1,100) x;
+ create index on a(r);
+ delete from a;
+ vacuum a;
+ ));
+
+#print $node->safe_psql('postgres','select * from pg_stat_activity');
+
+# Wait until an vacuum worker starts.
+$node->wait_for_event('client backend', 'nbtree-vacuum-2');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
+
+# Here's the key point of a test: during vacuum we add some page splits.
+# This will force vacuum into doing another scan thus reseting read stream.
+$node->safe_psql('postgres',
+ "insert into a select x from generate_series(1,3000) x;");
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-2');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-2');");
+
+# Observe that second scan is reached.
+$node->wait_for_event('client backend', 'nbtree-vacuum-1');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-1');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-1');");
+
+ok($psql_session->quit);
+
+done_testing();
--
2.39.5 (Apple Git-154)
v7-0002-Use-read_stream-in-GiST-vacuum.patchapplication/octet-stream; name=v7-0002-Use-read_stream-in-GiST-vacuum.patch; x-unix-mode=0644Download
From 375904cfb828f5feb84dd800b581b9e006e4cc78 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 24 Oct 2024 10:00:49 +0300
Subject: [PATCH v7 2/3] Use read_stream in GiST vacuum
---
src/backend/access/gist/gistvacuum.c | 46 ++++++++++++++++++++--------
1 file changed, 33 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f..a86122a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -22,6 +22,7 @@
#include "miscadmin.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/read_stream.h"
#include "utils/memutils.h"
/* Working state needed by gistbulkdelete */
@@ -44,8 +45,7 @@ typedef struct
static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state);
-static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
- BlockNumber orig_blkno);
+static void gistvacuumpage(GistVacState *vstate, Buffer buffer);
static void gistvacuum_delete_empty_pages(IndexVacuumInfo *info,
GistVacState *vstate);
static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -129,8 +129,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
GistVacState vstate;
BlockNumber num_pages;
bool needLock;
- BlockNumber blkno;
MemoryContext oldctx;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -208,9 +209,17 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- blkno = GIST_ROOT_BLKNO;
+ p.current_blocknum = GIST_ROOT_BLKNO;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -219,12 +228,23 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
UnlockRelationForExtension(rel, ExclusiveLock);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
- gistvacuumpage(&vstate, blkno, blkno);
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
+ {
+ gistvacuumpage(&vstate, buf);
+ }
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/*
* If we found any recyclable pages (and recorded them in the FSM), then
@@ -269,15 +289,16 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* are recursing to re-examine a previous page).
*/
static void
-gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+gistvacuumpage(GistVacState *vstate, Buffer buffer)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteCallback callback = vstate->callback;
void *callback_state = vstate->callback_state;
Relation rel = info->index;
- Buffer buffer;
+ BlockNumber orig_blkno = BufferGetBlockNumber(buffer);
Page page;
BlockNumber recurse_to;
+ BlockNumber blkno = orig_blkno;
restart:
recurse_to = InvalidBlockNumber;
@@ -285,9 +306,6 @@ restart:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
-
/*
* We are not going to stay here for a long time, aggressively grab an
* exclusive lock.
@@ -450,6 +468,8 @@ restart:
if (recurse_to != InvalidBlockNumber)
{
blkno = recurse_to;
+ buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto restart;
}
}
--
2.39.5 (Apple Git-154)
v7-0003-Use-read_stream-in-SP-GiST-vacuum.patchapplication/octet-stream; name=v7-0003-Use-read_stream-in-SP-GiST-vacuum.patch; x-unix-mode=0644Download
From 7a5a52d298670ef8a88b455c0f05e25069510d43 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 24 Oct 2024 13:53:39 +0300
Subject: [PATCH v7 3/3] Use read_stream in SP-GiST vacuum
---
src/backend/access/spgist/spgvacuum.c | 38 ++++++++++++++++++++-------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0da069f..7ae163c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -25,6 +25,7 @@
#include "storage/bufmgr.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/read_stream.h"
#include "utils/snapmgr.h"
@@ -618,17 +619,15 @@ vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
* Process one page during a bulkdelete scan
*/
static void
-spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
+spgvacuumpage(spgBulkDeleteState *bds, Buffer buffer)
{
Relation index = bds->info->index;
- Buffer buffer;
+ BlockNumber blkno = BufferGetBlockNumber(buffer);
Page page;
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
- RBM_NORMAL, bds->info->strategy);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = (Page) BufferGetPage(buffer);
@@ -805,8 +804,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
{
Relation index = bds->info->index;
bool needLock;
- BlockNumber num_pages,
- blkno;
+ BlockNumber num_pages;
+ Buffer buf;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/* Finish setting up spgBulkDeleteState */
initSpGistState(&bds->spgstate, index);
@@ -833,7 +834,14 @@ spgvacuumscan(spgBulkDeleteState *bds)
* delete some deletable tuples. See more extensive comments about this
* in btvacuumscan().
*/
- blkno = SPGIST_METAPAGE_BLKNO + 1;
+ p.current_blocknum = SPGIST_METAPAGE_BLKNO + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ bds->info->strategy,
+ index,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
/* Get the current relation length */
@@ -844,17 +852,27 @@ spgvacuumscan(spgBulkDeleteState *bds)
UnlockRelationForExtension(index, ExclusiveLock);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
+
+ p.last_exclusive = num_pages;
+
/* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
- spgvacuumpage(bds, blkno);
+ spgvacuumpage(bds, buf);
/* empty the pending-list after each page */
if (bds->pendingList != NULL)
spgprocesspending(bds);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Propagate local lastUsedPages cache to metablock */
SpGistUpdateMetaPage(index);
--
2.39.5 (Apple Git-154)
Hi Andrey,
I ran the following test with
v7-0001-Prototype-B-tree-vacuum-streamlineing.patch
to measure the performance improvement.
--Table size of approx 2GB (Fits in RAM)
postgres=# create unlogged table x_big as select i from
generate_series(1,6e7) i;
SELECT 60000000
postgres=# create index on x_big(i);
CREATE INDEX
-- Perform updates to create dead tuples.
postgres=# do $$
declare
var int := 0;
begin
for counter in 1 .. 1e7 loop
var := (SELECT floor(random() * (1e7 - 1 + 1) * 1));
UPDATE x_big SET i = i + 5 WHERE i = var;
end loop;
end;
$$;
postgres=# CREATE EXTENSION pg_buffercache;
CREATE EXTENSION
-- Evict Postgres buffer cache for this relation.
postgres=# SELECT DISTINCT pg_buffercache_evict(bufferid)
FROM pg_buffercache
WHERE relfilenode = pg_relation_filenode('x_big');
pg_buffercache_evict
----------------------
t
(1 row)
postgres=# \timing on
Timing is on.
postgres=# vacuum x_big;
VACUUM
The timing does not seem to have improved with the patch.
Timing with the patch: Time: 9525.696 ms (00:09.526)
Timing without the patch: Time: 9468.739 ms (00:09.469)
While writing this email, I realized I evicted buffers for the table
and not the index. I will perform the test again. However,
I would like to know your opinion on whether this looks like
a valid test.
Thank you,
Rahila Syed
On Thu, Oct 24, 2024 at 4:45 PM Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:
Show quoted text
On 24 Oct 2024, at 10:15, Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:
I've also added GiST vacuum to the patchset.
I decided to add up a SP-GiST while having launched on pgconf.eu.
Best regards, Andrey Borodin.
On 25 Oct 2024, at 00:55, Rahila Syed <rahilasyed90@gmail.com> wrote:
While writing this email, I realized I evicted buffers for the table
and not the index. I will perform the test again. However,
I would like to know your opinion on whether this looks like
a valid test.
Well, yes, kind of, you need drop caches from index. And, perhaps, you can have more indexes. You also can disable autovaccum and just restart postgres instead of iterating through buffer caches.
I've asked Thomas about performance implications and he told me that converting stuff to streamlined API is not expected to have better performance. It's needed to have decent perfromance when DIRECT_IO will be involved.
Thanks!
Best regards, Andrey Borodin.
Hi!
On Fri, 25 Oct 2024 at 17:01, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 25 Oct 2024, at 00:55, Rahila Syed <rahilasyed90@gmail.com> wrote:
While writing this email, I realized I evicted buffers for the table
and not the index. I will perform the test again. However,
I would like to know your opinion on whether this looks like
a valid test.Well, yes, kind of, you need drop caches from index. And, perhaps, you can have more indexes. You also can disable autovaccum and just restart postgres instead of iterating through buffer caches.
I've asked Thomas about performance implications and he told me that converting stuff to streamlined API is not expected to have better performance. It's needed to have decent perfromance when DIRECT_IO will be involved.
Thanks!
Best regards, Andrey Borodin.
I noticed CI failure for this patch. This does not look like a flap.
[0]: https://cirrus-ci.com/task/4527545259917312
[1]: https://api.cirrus-ci.com/v1/artifact/task/4527545259917312/log/src/test/modules/test_misc/tmp_check/log/regress_log_007_vacuum_btree
--
Best regards,
Kirill Reshke
On 2 Nov 2024, at 02:36, Kirill Reshke <reshkekirill@gmail.com> wrote:
I noticed CI failure for this patch. This does not look like a flap.
Seems like vacuum did not start index cleanup. I’ve added "index_cleanup on".
Thanks!
Best regards, Andrey Borodin.
Attachments:
v8-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v8-0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From f34aa92fa93cf492828d5931958e6f8075acfcbb Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v8 1/3] Prototype B-tree vacuum streamlineing
Signed-off-by: Andrey Borodin <x4mmm@yandex-team.ru>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 70 +++++++++++-----
src/test/modules/test_misc/meson.build | 1 +
.../modules/test_misc/t/007_vacuum_btree.pl | 79 +++++++++++++++++++
3 files changed, 129 insertions(+), 21 deletions(-)
create mode 100644 src/test/modules/test_misc/t/007_vacuum_btree.pl
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index dd76fe1da90..0849ad6d819 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -31,6 +31,7 @@
#include "storage/lmgr.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/injection_point.h"
#include "utils/memutils.h"
@@ -85,7 +86,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static BlockNumber btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -984,8 +985,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTVacState vstate;
BlockNumber num_pages;
- BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1054,9 +1056,17 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- scanblkno = BTREE_METAPAGE + 1;
+ p.current_blocknum = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -1069,17 +1079,31 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages);
/* Quit if we've scanned the whole relation */
- if (scanblkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; scanblkno < num_pages; scanblkno++)
+
+ /* In 007_vacuum_btree test we need to coordinate two distinguishable points here */
+ INJECTION_POINT("nbtree-vacuum-1");
+ INJECTION_POINT("nbtree-vacuum-2");
+
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
- btvacuumpage(&vstate, scanblkno);
+ BlockNumber current_block = btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- scanblkno);
+ current_block);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1109,9 +1133,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* after our cycleid was acquired) whose right half page happened to reuse
* a block that we might have processed at some point before it was
* recycled (i.e. before the page split).
+ *
+ * Returns BlockNumber of a scanned page (not backtracked).
*/
-static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+static BlockNumber
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1122,7 +1148,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1136,14 +1162,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1179,7 +1197,7 @@ backtrack:
errmsg_internal("right sibling %u of scanblkno %u unexpectedly in an inconsistent state in index \"%s\"",
blkno, scanblkno, RelationGetRelationName(rel))));
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
/*
@@ -1199,7 +1217,7 @@ backtrack:
{
/* Done with current scanblkno (and all lower split pages) */
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
}
@@ -1430,8 +1448,18 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
+ return scanblkno;
}
/*
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 283ffa751aa..a456a5e77f8 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -15,6 +15,7 @@ tests += {
't/004_io_direct.pl',
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
+ 't/007_vacuum_btree.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/007_vacuum_btree.pl b/src/test/modules/test_misc/t/007_vacuum_btree.pl
new file mode 100644
index 00000000000..2c69d2b477d
--- /dev/null
+++ b/src/test/modules/test_misc/t/007_vacuum_btree.pl
@@ -0,0 +1,79 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# This test verifies that B-tree vacuum can restart read stream.
+# To do so we need to insert some data during vacuum. So we wait in injection point
+# after first vacuum scan. During this wait we insert some data forcing page split.
+# this split will trigger relation extension and subsequent read_stream_reset().
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize postgres
+my $psql_err = '';
+my $psql_out = '';
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# This ensures autovacuum do not run
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# From this point, vacuum worker will wait at startup.
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-2', 'wait');");
+
+my $psql_session = $node->background_psql('postgres');
+
+$psql_session->query_until(
+ qr/starting_bg_psql/,
+ q(\echo starting_bg_psql
+ create table a as select random() r from generate_series(1,100) x;
+ create index on a(r);
+ delete from a;
+ vacuum (index_cleanup on) a;
+ ));
+
+# Wait until an vacuum worker starts.
+$node->wait_for_event('client backend', 'nbtree-vacuum-2');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
+
+# Here's the key point of a test: during vacuum we add some page splits.
+# This will force vacuum into doing another scan thus reseting read stream.
+$node->safe_psql('postgres',
+ "insert into a select x from generate_series(1,3000) x;");
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-2');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-2');");
+
+# Observe that second scan is reached.
+$node->wait_for_event('client backend', 'nbtree-vacuum-1');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-1');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-1');");
+
+ok($psql_session->quit);
+
+done_testing();
--
2.42.0
v8-0003-Use-read_stream-in-SP-GiST-vacuum.patchapplication/octet-stream; name=v8-0003-Use-read_stream-in-SP-GiST-vacuum.patch; x-unix-mode=0644Download
From 7b0e4b5d6b91b14c032cd5ec81fb4cc4ddf07894 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 24 Oct 2024 13:53:39 +0300
Subject: [PATCH v8 3/3] Use read_stream in SP-GiST vacuum
---
src/backend/access/spgist/spgvacuum.c | 38 ++++++++++++++++++++-------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0da069fd4d7..7ae163c92dc 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -25,6 +25,7 @@
#include "storage/bufmgr.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/read_stream.h"
#include "utils/snapmgr.h"
@@ -618,17 +619,15 @@ vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
* Process one page during a bulkdelete scan
*/
static void
-spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
+spgvacuumpage(spgBulkDeleteState *bds, Buffer buffer)
{
Relation index = bds->info->index;
- Buffer buffer;
+ BlockNumber blkno = BufferGetBlockNumber(buffer);
Page page;
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
- RBM_NORMAL, bds->info->strategy);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = (Page) BufferGetPage(buffer);
@@ -805,8 +804,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
{
Relation index = bds->info->index;
bool needLock;
- BlockNumber num_pages,
- blkno;
+ BlockNumber num_pages;
+ Buffer buf;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/* Finish setting up spgBulkDeleteState */
initSpGistState(&bds->spgstate, index);
@@ -833,7 +834,14 @@ spgvacuumscan(spgBulkDeleteState *bds)
* delete some deletable tuples. See more extensive comments about this
* in btvacuumscan().
*/
- blkno = SPGIST_METAPAGE_BLKNO + 1;
+ p.current_blocknum = SPGIST_METAPAGE_BLKNO + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ bds->info->strategy,
+ index,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
/* Get the current relation length */
@@ -844,17 +852,27 @@ spgvacuumscan(spgBulkDeleteState *bds)
UnlockRelationForExtension(index, ExclusiveLock);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
+
+ p.last_exclusive = num_pages;
+
/* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
- spgvacuumpage(bds, blkno);
+ spgvacuumpage(bds, buf);
/* empty the pending-list after each page */
if (bds->pendingList != NULL)
spgprocesspending(bds);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Propagate local lastUsedPages cache to metablock */
SpGistUpdateMetaPage(index);
--
2.42.0
v8-0002-Use-read_stream-in-GiST-vacuum.patchapplication/octet-stream; name=v8-0002-Use-read_stream-in-GiST-vacuum.patch; x-unix-mode=0644Download
From ecf95f53bacbfa55c48c4e66d6e275c49a0555bb Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 24 Oct 2024 10:00:49 +0300
Subject: [PATCH v8 2/3] Use read_stream in GiST vacuum
---
src/backend/access/gist/gistvacuum.c | 46 ++++++++++++++++++++--------
1 file changed, 33 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..a86122a53d1 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -22,6 +22,7 @@
#include "miscadmin.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/read_stream.h"
#include "utils/memutils.h"
/* Working state needed by gistbulkdelete */
@@ -44,8 +45,7 @@ typedef struct
static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state);
-static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
- BlockNumber orig_blkno);
+static void gistvacuumpage(GistVacState *vstate, Buffer buffer);
static void gistvacuum_delete_empty_pages(IndexVacuumInfo *info,
GistVacState *vstate);
static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -129,8 +129,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
GistVacState vstate;
BlockNumber num_pages;
bool needLock;
- BlockNumber blkno;
MemoryContext oldctx;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -208,9 +209,17 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- blkno = GIST_ROOT_BLKNO;
+ p.current_blocknum = GIST_ROOT_BLKNO;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -219,12 +228,23 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
UnlockRelationForExtension(rel, ExclusiveLock);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
- gistvacuumpage(&vstate, blkno, blkno);
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
+ {
+ gistvacuumpage(&vstate, buf);
+ }
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/*
* If we found any recyclable pages (and recorded them in the FSM), then
@@ -269,15 +289,16 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* are recursing to re-examine a previous page).
*/
static void
-gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+gistvacuumpage(GistVacState *vstate, Buffer buffer)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteCallback callback = vstate->callback;
void *callback_state = vstate->callback_state;
Relation rel = info->index;
- Buffer buffer;
+ BlockNumber orig_blkno = BufferGetBlockNumber(buffer);
Page page;
BlockNumber recurse_to;
+ BlockNumber blkno = orig_blkno;
restart:
recurse_to = InvalidBlockNumber;
@@ -285,9 +306,6 @@ restart:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
-
/*
* We are not going to stay here for a long time, aggressively grab an
* exclusive lock.
@@ -450,6 +468,8 @@ restart:
if (recurse_to != InvalidBlockNumber)
{
blkno = recurse_to;
+ buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto restart;
}
}
--
2.42.0
On Mon, 18 Nov 2024 at 16:34, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
On 2 Nov 2024, at 02:36, Kirill Reshke <reshkekirill@gmail.com> wrote:
I noticed CI failure for this patch. This does not look like a flap.
Seems like vacuum did not start index cleanup. I’ve added "index_cleanup on".
Thanks!Best regards, Andrey Borodin.
Hi!
0001 Looks mature. Some comments:
1)
+# This ensures autovacuum do not run +$node->append_conf('postgresql.conf', 'autovacuum = off');
The other option is to set `autovacuum = off `in relation DDL. I'm not
sure which option is better.
2) Are these used?
my $psql_err = '';
my $psql_out = '';
Should we add tap testing to 0002 & 0003 like 0001 already has?
--
Best regards,
Kirill Reshke
On Thu, Nov 28, 2024 at 10:29 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
0001 Looks mature. Some comments:
1)+# This ensures autovacuum do not run +$node->append_conf('postgresql.conf', 'autovacuum = off');The other option is to set `autovacuum = off `in relation DDL. I'm not
sure which option is better.
Either is fine. Though perhaps it is better to turn it off globally in
this case since we might as well avoid wasting any resources.
2) Are these used?
my $psql_err = '';
my $psql_out = '';
They don't seem to be.
I'm looking at 0001 with the intent of committing it soon. Today I've
just been studying the test with the injection points.
My main thought is that you should rename the injection points to
something more descriptive. We want to make it clear why there are
two. Also nbtree-vacuum-2 is actually where we wait the first
iteration of the loop, so that is confusing.
Perhaps we also ought to pass parallel false in the vacuum command.
I spent time thinking about if there was a way to exercise the actual
backtracking code and test that index entries that should have been
cleaned up were left behind because they were moved from a page we
didn't process before the split started to one we did. I thought maybe
we could do something with pageinspect. But getting the timing right
seems too hard.
Anyway, it seems worth it to have a bit of coverage for needing
multiple passes in btvacuumscan(). And given the injection points
infrastructure, perhaps other btree vacuum tests could be added to
this file in the future.
- Melanie
On Tue, Mar 18, 2025 at 11:12 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
I'm looking at 0001 with the intent of committing it soon. Today I've
just been studying the test with the injection points.
Now, I've reviewed 0001. I manually validated it does read combining etc.
Few comments:
I know I was the one that advocated calling read_stream_next_buffer()
in the while loop, but I've realized that I think we have to change it
so that we can call vacuum_delay_point() before calling
read_stream_next_buffer(). Otherwise we hold the buffer pin during
vacuum_delay_point(). We'll want to remove it from the top of
btvacuumpage() and add it in this loop:
/* Iterate over pages, then loop back to recheck relation length */
while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
BlockNumber current_block = btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
current_block);
}
and before the call to ReadBufferExtended() in the backtrack code path.
buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
info->strategy);
Also, I think this comment could use a slight update.
/*
* btvacuumpage --- VACUUM one page
*
* This processes a single page for btvacuumscan(). In some cases we must
* backtrack to re-examine and VACUUM pages that were the scanblkno during
*/
scanblkno is a local variable but the comment was written when it was
a parameter. I think it would be more clear to refer to the passed in
"buf" variable instead.
- Melanie
On 18 Mar 2025, at 20:37, Melanie Plageman <melanieplageman@gmail.com> wrote:
I've reviewed 0001
Thanks!
I've added all suggested fixes as a separate patch step.
Except I did not rename injection points... I can't figure out descriptive names.
And delay point is now after the page is processed, not before.
FWIW I do not insist on committing the test, it was mostly necessary to validate that backtracking still works. I could not check it manually. But other injection tests, surprisingly, seem to be stable enough across buildfarm.
Best regards, Andrey Borodin.
Attachments:
v9-0001-Prototype-B-tree-vacuum-streamlineing.patchapplication/octet-stream; name=v9-0001-Prototype-B-tree-vacuum-streamlineing.patch; x-unix-mode=0644Download
From 4129b1164696ad5ca62a709728db6c5d9010a795 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 19 Oct 2024 12:46:29 +0500
Subject: [PATCH v9 1/4] Prototype B-tree vacuum streamlineing
Signed-off-by: Andrey Borodin <x4mmm@yandex-team.ru>
Signed-off-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
---
src/backend/access/nbtree/nbtree.c | 70 +++++++++++-----
src/test/modules/test_misc/meson.build | 1 +
.../modules/test_misc/t/008_vacuum_btree.pl | 79 +++++++++++++++++++
3 files changed, 129 insertions(+), 21 deletions(-)
create mode 100644 src/test/modules/test_misc/t/008_vacuum_btree.pl
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c0a8833e068..fe7084edacd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -32,6 +32,7 @@
#include "storage/lmgr.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/injection_point.h"
#include "utils/memutils.h"
@@ -86,7 +87,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static BlockNumber btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -991,8 +992,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTVacState vstate;
BlockNumber num_pages;
- BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1061,9 +1063,17 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- scanblkno = BTREE_METAPAGE + 1;
+ p.current_blocknum = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -1076,17 +1086,31 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages);
/* Quit if we've scanned the whole relation */
- if (scanblkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; scanblkno < num_pages; scanblkno++)
+
+ /* In 007_vacuum_btree test we need to coordinate two distinguishable points here */
+ INJECTION_POINT("nbtree-vacuum-1");
+ INJECTION_POINT("nbtree-vacuum-2");
+
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
- btvacuumpage(&vstate, scanblkno);
+ BlockNumber current_block = btvacuumpage(&vstate, buf);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- scanblkno);
+ current_block);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1116,9 +1140,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* after our cycleid was acquired) whose right half page happened to reuse
* a block that we might have processed at some point before it was
* recycled (i.e. before the page split).
+ *
+ * Returns BlockNumber of a scanned page (not backtracked).
*/
-static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+static BlockNumber
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1129,7 +1155,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1143,14 +1169,6 @@ backtrack:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point(false);
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1186,7 +1204,7 @@ backtrack:
errmsg_internal("right sibling %u of scanblkno %u unexpectedly in an inconsistent state in index \"%s\"",
blkno, scanblkno, RelationGetRelationName(rel))));
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
/*
@@ -1206,7 +1224,7 @@ backtrack:
{
/* Done with current scanblkno (and all lower split pages) */
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
}
@@ -1437,8 +1455,18 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a nondefault
+ * buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
+ return scanblkno;
}
/*
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 9c50de7efb0..ed0822f19ee 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -16,6 +16,7 @@ tests += {
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
't/007_catcache_inval.pl',
+ 't/008_vacuum_btree.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/008_vacuum_btree.pl b/src/test/modules/test_misc/t/008_vacuum_btree.pl
new file mode 100644
index 00000000000..2c69d2b477d
--- /dev/null
+++ b/src/test/modules/test_misc/t/008_vacuum_btree.pl
@@ -0,0 +1,79 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# This test verifies that B-tree vacuum can restart read stream.
+# To do so we need to insert some data during vacuum. So we wait in injection point
+# after first vacuum scan. During this wait we insert some data forcing page split.
+# this split will trigger relation extension and subsequent read_stream_reset().
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize postgres
+my $psql_err = '';
+my $psql_out = '';
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# This ensures autovacuum do not run
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+# From this point, vacuum worker will wait at startup.
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-2', 'wait');");
+
+my $psql_session = $node->background_psql('postgres');
+
+$psql_session->query_until(
+ qr/starting_bg_psql/,
+ q(\echo starting_bg_psql
+ create table a as select random() r from generate_series(1,100) x;
+ create index on a(r);
+ delete from a;
+ vacuum (index_cleanup on) a;
+ ));
+
+# Wait until an vacuum worker starts.
+$node->wait_for_event('client backend', 'nbtree-vacuum-2');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
+
+# Here's the key point of a test: during vacuum we add some page splits.
+# This will force vacuum into doing another scan thus reseting read stream.
+$node->safe_psql('postgres',
+ "insert into a select x from generate_series(1,3000) x;");
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-2');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-2');");
+
+# Observe that second scan is reached.
+$node->wait_for_event('client backend', 'nbtree-vacuum-1');
+
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-1');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_wakeup('nbtree-vacuum-1');");
+
+ok($psql_session->quit);
+
+done_testing();
--
2.39.5 (Apple Git-154)
v9-0004-Use-read_stream-in-SP-GiST-vacuum.patchapplication/octet-stream; name=v9-0004-Use-read_stream-in-SP-GiST-vacuum.patch; x-unix-mode=0644Download
From 4a630aeb27f42a1be6b3d34007c3e5749373f09c Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 24 Oct 2024 13:53:39 +0300
Subject: [PATCH v9 4/4] Use read_stream in SP-GiST vacuum
---
src/backend/access/spgist/spgvacuum.c | 38 ++++++++++++++++++++-------
1 file changed, 28 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index eeddacd0d52..f0d8428d909 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -25,6 +25,7 @@
#include "storage/bufmgr.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/read_stream.h"
#include "utils/snapmgr.h"
@@ -618,17 +619,15 @@ vacuumRedirectAndPlaceholder(Relation index, Relation heaprel, Buffer buffer)
* Process one page during a bulkdelete scan
*/
static void
-spgvacuumpage(spgBulkDeleteState *bds, BlockNumber blkno)
+spgvacuumpage(spgBulkDeleteState *bds, Buffer buffer)
{
Relation index = bds->info->index;
- Buffer buffer;
+ BlockNumber blkno = BufferGetBlockNumber(buffer);
Page page;
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point(false);
- buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
- RBM_NORMAL, bds->info->strategy);
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = (Page) BufferGetPage(buffer);
@@ -805,8 +804,10 @@ spgvacuumscan(spgBulkDeleteState *bds)
{
Relation index = bds->info->index;
bool needLock;
- BlockNumber num_pages,
- blkno;
+ BlockNumber num_pages;
+ Buffer buf;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/* Finish setting up spgBulkDeleteState */
initSpGistState(&bds->spgstate, index);
@@ -833,7 +834,14 @@ spgvacuumscan(spgBulkDeleteState *bds)
* delete some deletable tuples. See more extensive comments about this
* in btvacuumscan().
*/
- blkno = SPGIST_METAPAGE_BLKNO + 1;
+ p.current_blocknum = SPGIST_METAPAGE_BLKNO + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ bds->info->strategy,
+ index,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
/* Get the current relation length */
@@ -844,17 +852,27 @@ spgvacuumscan(spgBulkDeleteState *bds)
UnlockRelationForExtension(index, ExclusiveLock);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
+
+ p.last_exclusive = num_pages;
+
/* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
{
- spgvacuumpage(bds, blkno);
+ spgvacuumpage(bds, buf);
/* empty the pending-list after each page */
if (bds->pendingList != NULL)
spgprocesspending(bds);
}
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/* Propagate local lastUsedPages cache to metablock */
SpGistUpdateMetaPage(index);
--
2.39.5 (Apple Git-154)
v9-0003-Use-read_stream-in-GiST-vacuum.patchapplication/octet-stream; name=v9-0003-Use-read_stream-in-GiST-vacuum.patch; x-unix-mode=0644Download
From 96b7286f94b392af888d50df1a143dd9c03a9a53 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 24 Oct 2024 10:00:49 +0300
Subject: [PATCH v9 3/4] Use read_stream in GiST vacuum
---
src/backend/access/gist/gistvacuum.c | 46 ++++++++++++++++++++--------
1 file changed, 33 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index dd0d9d5006c..806bd0a4345 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -22,6 +22,7 @@
#include "miscadmin.h"
#include "storage/indexfsm.h"
#include "storage/lmgr.h"
+#include "storage/read_stream.h"
#include "utils/memutils.h"
/* Working state needed by gistbulkdelete */
@@ -44,8 +45,7 @@ typedef struct
static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state);
-static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
- BlockNumber orig_blkno);
+static void gistvacuumpage(GistVacState *vstate, Buffer buffer);
static void gistvacuum_delete_empty_pages(IndexVacuumInfo *info,
GistVacState *vstate);
static bool gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -129,8 +129,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
GistVacState vstate;
BlockNumber num_pages;
bool needLock;
- BlockNumber blkno;
MemoryContext oldctx;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -208,9 +209,17 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- blkno = GIST_ROOT_BLKNO;
+ p.current_blocknum = GIST_ROOT_BLKNO;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -219,12 +228,23 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
UnlockRelationForExtension(rel, ExclusiveLock);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
- gistvacuumpage(&vstate, blkno, blkno);
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while(BufferIsValid(buf = read_stream_next_buffer(stream, NULL)))
+ {
+ gistvacuumpage(&vstate, buf);
+ }
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+ /*
+ * After reaching the end we have to reset stream to use it again.
+ * Extra restart in case of just one iteration does not cost us much.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
/*
* If we found any recyclable pages (and recorded them in the FSM), then
@@ -269,15 +289,16 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* are recursing to re-examine a previous page).
*/
static void
-gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+gistvacuumpage(GistVacState *vstate, Buffer buffer)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteCallback callback = vstate->callback;
void *callback_state = vstate->callback_state;
Relation rel = info->index;
- Buffer buffer;
+ BlockNumber orig_blkno = BufferGetBlockNumber(buffer);
Page page;
BlockNumber recurse_to;
+ BlockNumber blkno = orig_blkno;
restart:
recurse_to = InvalidBlockNumber;
@@ -285,9 +306,6 @@ restart:
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point(false);
- buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
-
/*
* We are not going to stay here for a long time, aggressively grab an
* exclusive lock.
@@ -450,6 +468,8 @@ restart:
if (recurse_to != InvalidBlockNumber)
{
blkno = recurse_to;
+ buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto restart;
}
}
--
2.39.5 (Apple Git-154)
v9-0002-Review-fixes.patchapplication/octet-stream; name=v9-0002-Review-fixes.patch; x-unix-mode=0644Download
From aa7bf69a232c5adbb2dca791cd76ea7e92b15cd8 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 18 Mar 2025 23:11:14 +0500
Subject: [PATCH v9 2/4] Review fixes
---
src/backend/access/nbtree/nbtree.c | 13 ++++++++-----
src/test/modules/test_misc/t/008_vacuum_btree.pl | 4 +---
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fe7084edacd..02688d90fe7 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1089,7 +1089,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
if (p.current_blocknum >= num_pages)
break;
- /* In 007_vacuum_btree test we need to coordinate two distinguishable points here */
+ /* In 008_vacuum_btree test we need to coordinate two distinguishable points here */
INJECTION_POINT("nbtree-vacuum-1");
INJECTION_POINT("nbtree-vacuum-2");
@@ -1135,7 +1135,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* btvacuumpage --- VACUUM one page
*
* This processes a single page for btvacuumscan(). In some cases we must
- * backtrack to re-examine and VACUUM pages that were the scanblkno during
+ * backtrack to re-examine and VACUUM pages that were on buf's page during
* a previous call here. This is how we handle page splits (that happened
* after our cycleid was acquired) whose right half page happened to reuse
* a block that we might have processed at some point before it was
@@ -1166,9 +1166,6 @@ backtrack:
attempt_pagedel = false;
backtrack_to = P_NONE;
- /* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point(false);
-
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1456,6 +1453,9 @@ backtrack:
{
blkno = backtrack_to;
+ /* call vacuum_delay_point while not holding any buffer lock */
+ vacuum_delay_point(false);
+
/*
* We can't use _bt_getbuf() here because it always applies
* _bt_checkpage(), which will barf on an all-zero page. We want to
@@ -1466,6 +1466,9 @@ backtrack:
info->strategy);
goto backtrack;
}
+
+ /* call vacuum_delay_point while not holding any buffer lock */
+ vacuum_delay_point(false);
return scanblkno;
}
diff --git a/src/test/modules/test_misc/t/008_vacuum_btree.pl b/src/test/modules/test_misc/t/008_vacuum_btree.pl
index 2c69d2b477d..8ec56c419a3 100644
--- a/src/test/modules/test_misc/t/008_vacuum_btree.pl
+++ b/src/test/modules/test_misc/t/008_vacuum_btree.pl
@@ -16,8 +16,6 @@ if ($ENV{enable_injection_points} ne 'yes')
}
# Initialize postgres
-my $psql_err = '';
-my $psql_out = '';
my $node = PostgreSQL::Test::Cluster->new('node');
$node->init;
@@ -47,7 +45,7 @@ $psql_session->query_until(
create table a as select random() r from generate_series(1,100) x;
create index on a(r);
delete from a;
- vacuum (index_cleanup on) a;
+ vacuum (index_cleanup on, parallel 0) a;
));
# Wait until an vacuum worker starts.
--
2.39.5 (Apple Git-154)
On Tue, Mar 18, 2025 at 11:12 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
I'm looking at 0001 with the intent of committing it soon. Today I've
just been studying the test with the injection points.My main thought is that you should rename the injection points to
something more descriptive. We want to make it clear why there are
two. Also nbtree-vacuum-2 is actually where we wait the first
iteration of the loop, so that is confusing.
I actually think you could do the test with one injection point and
just wait on it twice
diff --git a/src/backend/access/nbtree/nbtree.c
b/src/backend/access/nbtree/nbtree.c
index fe7084edacd..79f4323f887 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1091,7 +1091,6 @@ btvacuumscan(IndexVacuumInfo *info,
IndexBulkDeleteResult *stats,
/* In 007_vacuum_btree test we need to coordinate two
distinguishable points here */
INJECTION_POINT("nbtree-vacuum-1");
- INJECTION_POINT("nbtree-vacuum-2");
p.last_exclusive = num_pages;
diff --git a/src/test/modules/test_misc/t/007_vacuum_btree.pl
b/src/test/modules/test_misc/t/007_vacuum_btree.pl
index 2c69d2b477d..6d5f474b99a 100644
--- a/src/test/modules/test_misc/t/007_vacuum_btree.pl
+++ b/src/test/modules/test_misc/t/007_vacuum_btree.pl
@@ -37,7 +37,7 @@ $node->safe_psql('postgres', 'CREATE EXTENSION
injection_points;');
# From this point, vacuum worker will wait at startup.
$node->safe_psql('postgres',
- "SELECT injection_points_attach('nbtree-vacuum-2', 'wait');");
+ "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
my $psql_session = $node->background_psql('postgres');
@@ -51,10 +51,7 @@ $psql_session->query_until(
));
# Wait until an vacuum worker starts.
-$node->wait_for_event('client backend', 'nbtree-vacuum-2');
-
-$node->safe_psql('postgres',
- "SELECT injection_points_attach('nbtree-vacuum-1', 'wait');");
+$node->wait_for_event('client backend', 'nbtree-vacuum-1');
# Here's the key point of a test: during vacuum we add some page splits.
# This will force vacuum into doing another scan thus reseting read stream.
@@ -62,18 +59,16 @@ $node->safe_psql('postgres',
"insert into a select x from generate_series(1,3000) x;");
$node->safe_psql('postgres',
- "SELECT injection_points_detach('nbtree-vacuum-2');");
-$node->safe_psql('postgres',
- "SELECT injection_points_wakeup('nbtree-vacuum-2');");
+ "SELECT injection_points_wakeup('nbtree-vacuum-1');");
-# Observe that second scan is reached.
$node->wait_for_event('client backend', 'nbtree-vacuum-1');
-$node->safe_psql('postgres',
- "SELECT injection_points_detach('nbtree-vacuum-1');");
$node->safe_psql('postgres',
"SELECT injection_points_wakeup('nbtree-vacuum-1');");
+$node->safe_psql('postgres',
+ "SELECT injection_points_detach('nbtree-vacuum-1');");
+
ok($psql_session->quit);
done_testing();
On 18 Mar 2025, at 23:21, Melanie Plageman <melanieplageman@gmail.com> wrote:
I actually think you could do the test with one injection point and
just wait on it twice
I was not sure we can safely wake up on injection point and stop and the same point on next iteration.
But, apparently, it is safe, due to wait_counts in InjectionPointSharedState.
So, yes, your change to the test seems correct to me. We can do the test with just one injection point.
Best regards, Andrey Borodin.
On Wed, Mar 19, 2025 at 5:26 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
So, yes, your change to the test seems correct to me. We can do the test with just one injection point.
Attached 0001 is what I plan to commit first thing tomorrow morning. I
moved the vacuum_delay_point() so that we would call
pgstat_progress_update_param() as soon as we were done processing the
block (instead of doing vacuum_delay_point() first). This also makes
it consistent with the other vacuum streaming read users. What do we
even use the progress update for here, though? For heap vacuuming it
makes sense because we report heap blocks vacuumed in
pg_stat_progress_vacuum, but we don't do that for index blocks
vacuumed...
I plan to commit 0001 (the read stream user) without 0002 and then
solicit more feedback on 0002.
0002 is a draft of the test. I want us to actually test the
backtracking behavior. To do this, I think we need two injection
points with the current injection point infrastructure. Which I don't
love. I think if we could pass variables into INJECTION_POINT(), we
could do the test with one injection point.
I also don't love how "failing" for this test is just it hanging
because the nbtree-vacuum-page-backtrack wait event never happens
(there is no specific assertion or anything in the test).
I do like that I moved the nbtree-vacuum-page injection point to the
top of nbtreevacuumpage because I think it can be used for other tests
in the future.
I also think it is worth making a btree directory in src/test instead
of adding this to src/test/modules/test_misc. In fact it might be
worth moving the gin and brin tests out of src/test/modules and making
a new "indexes" directory in src/test with gin, brin, and btree (also
some spgist tests are in there somewhere) subdirectories.
- Melanie
Attachments:
v10-0002-test-backtrack.patchtext/x-patch; charset=US-ASCII; name=v10-0002-test-backtrack.patchDownload
From 06689ada5011e1abeadaa0d33538e0b4069e5723 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 20 Mar 2025 16:09:41 -0400
Subject: [PATCH v10 2/2] test backtrack
---
src/backend/access/nbtree/nbtree.c | 4 +
src/test/modules/test_misc/meson.build | 1 +
.../modules/test_misc/t/008_vacuum_btree.pl | 96 +++++++++++++++++++
3 files changed, 101 insertions(+)
create mode 100644 src/test/modules/test_misc/t/008_vacuum_btree.pl
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 353600414a2..be29c4e70e2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -32,6 +32,7 @@
#include "storage/lmgr.h"
#include "utils/fmgrprotos.h"
#include "utils/index_selfuncs.h"
+#include "utils/injection_point.h"
#include "utils/memutils.h"
@@ -1178,6 +1179,7 @@ backtrack:
attempt_pagedel = false;
backtrack_to = P_NONE;
+ INJECTION_POINT("nbtree-vacuum-page");
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1468,6 +1470,8 @@ backtrack:
/* check for vacuum delay while not holding any buffer lock */
vacuum_delay_point(false);
+ INJECTION_POINT("nbtree-vacuum-page-backtrack");
+
/*
* We can't use _bt_getbuf() here because it always applies
* _bt_checkpage(), which will barf on an all-zero page. We want to
diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index 9c50de7efb0..ed0822f19ee 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -16,6 +16,7 @@ tests += {
't/005_timeouts.pl',
't/006_signal_autovacuum.pl',
't/007_catcache_inval.pl',
+ 't/008_vacuum_btree.pl',
],
},
}
diff --git a/src/test/modules/test_misc/t/008_vacuum_btree.pl b/src/test/modules/test_misc/t/008_vacuum_btree.pl
new file mode 100644
index 00000000000..d393e1809ce
--- /dev/null
+++ b/src/test/modules/test_misc/t/008_vacuum_btree.pl
@@ -0,0 +1,96 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test btree vacuum
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+ plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# Turn autovacuum off for the cluster. This could be done at the table level
+# instead. However, since this file exercises vacuum, turn off autovacuum
+# globally. This also allows use of non-local injection points in vacuum code.
+$node->append_conf('postgresql.conf', 'autovacuum = off');
+$node->start;
+
+# Check if extension injection_points is available
+if (!$node->check_extension('injection_points'))
+{
+ plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+my $psql_session = $node->background_psql('postgres');
+
+# Create a table with an index filled with dead index entries
+$node->safe_psql('postgres', q[
+ create table test_backtrack (col1 int);
+ insert into test_backtrack select generate_series(1,800);
+ create index on test_backtrack(col1);
+ delete from test_backtrack;
+ ]
+);
+
+# Attach to two injection points. The first one will allow us to stop between
+# vacuuming each index page. The second is our validation that we did backtrack.
+$psql_session->query_safe(
+ qq[
+ SELECT injection_points_set_local();
+ SELECT injection_points_attach('nbtree-vacuum-page', 'wait');
+ SELECT injection_points_attach('nbtree-vacuum-page-backtrack', 'wait');
+]);
+
+# Start a vacuum of the table and index.
+$psql_session->query_until(
+ qr/starting_bg_psql/,
+ q(\echo starting_bg_psql
+ vacuum (index_cleanup on, parallel 0) test_backtrack;
+ ));
+
+# The index vacuum should be waiting on our first injection point and is yet to
+# process any pages.
+$node->wait_for_event('client backend', 'nbtree-vacuum-page');
+
+# Wake up vacuum so it can process the first index leaf page.
+$node->safe_psql('postgres', "SELECT injection_points_wakeup('nbtree-vacuum-page');");
+
+# The first index leaf page is now vacuumed and vacuum should be waiting again
+# on the first injection point.
+$node->wait_for_event('client backend', 'nbtree-vacuum-page');
+
+# Insert data while vacuum is waiting to process the next leaf page. The
+# inserted data will force a page split in which some tuples from unprocessed
+# leaf pages will be moved to the first already vacuumed leaf page.
+$node->safe_psql('postgres',
+ "insert into test_backtrack select generate_series(1,300);");
+
+# Now we want the vacuum to continue. We don't want to wait on our first break
+# point again.
+# We need to make sure we are waiting before detaching and issuing a wakeup.
+# Otherwise there could be a race and the backend may not get woken up.
+$node->wait_for_event('client backend', 'nbtree-vacuum-page');
+$node->safe_psql('postgres', "SELECT injection_points_detach('nbtree-vacuum-page');");
+$node->safe_psql('postgres', "SELECT injection_points_wakeup('nbtree-vacuum-page');");
+
+# Wait on our second break point. Vacuum should have been forced to backtrack
+# and vacuum the first leaf page again to ensure it removed all dead index
+# entries.
+$node->wait_for_event('client backend', 'nbtree-vacuum-page-backtrack');
+
+# Once we wait on our second break point, we're done. Time to tell the backend
+# to detach and wake it up.
+$node->safe_psql('postgres', "SELECT injection_points_detach('nbtree-vacuum-page-backtrack');");
+$node->safe_psql('postgres', "SELECT injection_points_wakeup('nbtree-vacuum-page-backtrack');");
+
+ok($psql_session->quit);
+
+done_testing();
--
2.34.1
v10-0001-Use-streaming-read-I-O-in-btree-vacuuming.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Use-streaming-read-I-O-in-btree-vacuuming.patchDownload
From e558d78483d36ee510f13389029663e6bace9a27 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Thu, 20 Mar 2025 16:09:11 -0400
Subject: [PATCH v10 1/2] Use streaming read I/O in btree vacuuming
Btree vacuum processes all index pages in physical order. Now it uses
the read stream API to get the next buffer instead of explicitly
invoking ReadBuffer().
It is possible for concurrent indexes to cause page splits during index
vacuuming. This can lead to index entries that have yet to be vacuumed
being moved to pages that have already been vacuumed. Btree vacuum code
handles this by backtracking to reprocess those pages. So, while normal,
sequentially encountered pages are now read through the read stream API,
backtracked pages are still read with explicit ReadBuffer() calls.
Author: Andrey Borodin <x4mmm@yandex-team.ru>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_bW1UOyup%3DjdFw%2BkOF9bCaAm%3D9UpiyZtbPMn8n_vnP%2Big%40mail.gmail.com#3b3a84132fc683b3ee5b40bc4c2ea2a5
---
src/backend/access/nbtree/nbtree.c | 91 ++++++++++++++++++++++--------
1 file changed, 66 insertions(+), 25 deletions(-)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c0a8833e068..353600414a2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -86,7 +86,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
+static BlockNumber btvacuumpage(BTVacState *vstate, Buffer buf);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -991,8 +991,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTVacState vstate;
BlockNumber num_pages;
- BlockNumber scanblkno;
bool needLock;
+ BlockRangeReadStreamPrivate p;
+ ReadStream *stream = NULL;
/*
* Reset fields that track information about the entire index now. This
@@ -1061,9 +1062,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- scanblkno = BTREE_METAPAGE + 1;
+ p.current_blocknum = BTREE_METAPAGE + 1;
+ stream = read_stream_begin_relation(READ_STREAM_FULL,
+ info->strategy,
+ rel,
+ MAIN_FORKNUM,
+ block_range_read_stream_cb,
+ &p,
+ 0);
for (;;)
{
+ Buffer buf;
+
/* Get the current relation length */
if (needLock)
LockRelationForExtension(rel, ExclusiveLock);
@@ -1076,18 +1086,44 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages);
/* Quit if we've scanned the whole relation */
- if (scanblkno >= num_pages)
+ if (p.current_blocknum >= num_pages)
break;
- /* Iterate over pages, then loop back to recheck length */
- for (; scanblkno < num_pages; scanblkno++)
+
+
+ p.last_exclusive = num_pages;
+
+ /* Iterate over pages, then loop back to recheck relation length */
+ while (true)
{
- btvacuumpage(&vstate, scanblkno);
+ BlockNumber current_block;
+
+ /* call vacuum_delay_point while not holding any buffer lock */
+ vacuum_delay_point(false);
+
+ buf = read_stream_next_buffer(stream, NULL);
+
+ if (!BufferIsValid(buf))
+ break;
+
+ current_block = btvacuumpage(&vstate, buf);
+
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- scanblkno);
+ current_block);
}
+
+ Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
+
+ /*
+ * We have to reset the read stream to use it again. After returning
+ * InvalidBuffer, the read stream API won't invoke our callback again
+ * until the stream has been reset.
+ */
+ read_stream_reset(stream);
}
+ read_stream_end(stream);
+
/* Set statistics num_pages field to final size of index */
stats->num_pages = num_pages;
@@ -1111,14 +1147,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* btvacuumpage --- VACUUM one page
*
* This processes a single page for btvacuumscan(). In some cases we must
- * backtrack to re-examine and VACUUM pages that were the scanblkno during
+ * backtrack to re-examine and VACUUM pages that were on buf's page during
* a previous call here. This is how we handle page splits (that happened
* after our cycleid was acquired) whose right half page happened to reuse
* a block that we might have processed at some point before it was
* recycled (i.e. before the page split).
+ *
+ * Returns BlockNumber of a scanned page (not backtracked).
*/
-static void
-btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
+static BlockNumber
+btvacuumpage(BTVacState *vstate, Buffer buf)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
@@ -1129,7 +1167,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
bool attempt_pagedel;
BlockNumber blkno,
backtrack_to;
- Buffer buf;
+ BlockNumber scanblkno = BufferGetBlockNumber(buf);
Page page;
BTPageOpaque opaque;
@@ -1140,17 +1178,6 @@ backtrack:
attempt_pagedel = false;
backtrack_to = P_NONE;
- /* call vacuum_delay_point while not holding any buffer lock */
- vacuum_delay_point(false);
-
- /*
- * We can't use _bt_getbuf() here because it always applies
- * _bt_checkpage(), which will barf on an all-zero page. We want to
- * recycle all-zero pages, not fail. Also, we want to use a nondefault
- * buffer access strategy.
- */
- buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- info->strategy);
_bt_lockbuf(rel, buf, BT_READ);
page = BufferGetPage(buf);
opaque = NULL;
@@ -1186,7 +1213,7 @@ backtrack:
errmsg_internal("right sibling %u of scanblkno %u unexpectedly in an inconsistent state in index \"%s\"",
blkno, scanblkno, RelationGetRelationName(rel))));
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
/*
@@ -1206,7 +1233,7 @@ backtrack:
{
/* Done with current scanblkno (and all lower split pages) */
_bt_relbuf(rel, buf);
- return;
+ return scanblkno;
}
}
@@ -1437,8 +1464,22 @@ backtrack:
if (backtrack_to != P_NONE)
{
blkno = backtrack_to;
+
+ /* check for vacuum delay while not holding any buffer lock */
+ vacuum_delay_point(false);
+
+ /*
+ * We can't use _bt_getbuf() here because it always applies
+ * _bt_checkpage(), which will barf on an all-zero page. We want to
+ * recycle all-zero pages, not fail. Also, we want to use a
+ * nondefault buffer access strategy.
+ */
+ buf = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+ info->strategy);
goto backtrack;
}
+
+ return scanblkno;
}
/*
--
2.34.1
On 21 Mar 2025, at 05:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
On Wed, Mar 19, 2025 at 5:26 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
So, yes, your change to the test seems correct to me. We can do the test with just one injection point.
Attached 0001 is what I plan to commit first thing tomorrow morning. I
moved the vacuum_delay_point() so that we would call
pgstat_progress_update_param() as soon as we were done processing the
block (instead of doing vacuum_delay_point() first). This also makes
it consistent with the other vacuum streaming read users. What do we
even use the progress update for here, though? For heap vacuuming it
makes sense because we report heap blocks vacuumed in
pg_stat_progress_vacuum, but we don't do that for index blocks
vacuumed...
OK, the patch looks good to me. But there is no test for stream restart in current patch. I’m OK with it, just noting.
I plan to commit 0001 (the read stream user) without 0002 and then
solicit more feedback on 0002.0002 is a draft of the test. I want us to actually test the
backtracking behavior. To do this, I think we need two injection
points with the current injection point infrastructure. Which I don't
love. I think if we could pass variables into INJECTION_POINT(), we
could do the test with one injection point.I also don't love how "failing" for this test is just it hanging
because the nbtree-vacuum-page-backtrack wait event never happens
(there is no specific assertion or anything in the test).
I think timing out in 180s is OK as long as failure is not expected.
I do like that I moved the nbtree-vacuum-page injection point to the
top of nbtreevacuumpage because I think it can be used for other tests
in the future.
Cool!
I also think it is worth making a btree directory in src/test instead
of adding this to src/test/modules/test_misc. In fact it might be
worth moving the gin and brin tests out of src/test/modules and making
a new "indexes" directory in src/test with gin, brin, and btree (also
some spgist tests are in there somewhere) subdirectories.
Previously we groupped injection point tests in small number of modules, because it required additional changes to build files. And we only had like 3 or 5 tests.
I think if we have a handful of tests - it’s time to start organizing them in a structured way.
Thanks!
On Fri, Mar 21, 2025 at 8:19 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
On 21 Mar 2025, at 05:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
I also think it is worth making a btree directory in src/test instead
of adding this to src/test/modules/test_misc. In fact it might be
worth moving the gin and brin tests out of src/test/modules and making
a new "indexes" directory in src/test with gin, brin, and btree (also
some spgist tests are in there somewhere) subdirectories.Previously we groupped injection point tests in small number of modules, because it required additional changes to build files. And we only had like 3 or 5 tests.
I think if we have a handful of tests - it’s time to start organizing them in a structured way.
There are tests with injection points in all different test
directories now. I think regress is the only one where it would be too
much of a pain.
I've committed the btree and gist read stream users. I think we can
come back to the test after feature freeze and make sure it is super
solid.
Looking at the spgist read stream user, I see you didn't convert
spgprocesspending(). It seems like you could write a callback that
uses the posting list and streamify this as well.
- Melanie
On Fri, Mar 21, 2025 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
I've committed the btree and gist read stream users. I think we can
come back to the test after feature freeze and make sure it is super
solid.
I've now committed the spgist vacuum user as well. I'll mark the CF
entry as completed.
I wonder if we should do GIN?
Looking at the spgist read stream user, I see you didn't convert
spgprocesspending(). It seems like you could write a callback that
uses the posting list and streamify this as well.
It's probably not worth it -- since we process the pending list for
each page of the index.
- Melanie
On 22 Mar 2025, at 00:23, Melanie Plageman <melanieplageman@gmail.com> wrote:
I've committed the btree and gist read stream users.
Cool! Thanks!
I think we can
come back to the test after feature freeze and make sure it is super
solid.
+1.
On 22 Mar 2025, at 02:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
On Fri, Mar 21, 2025 at 3:23 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:I've committed the btree and gist read stream users. I think we can
come back to the test after feature freeze and make sure it is super
solid.I've now committed the spgist vacuum user as well. I'll mark the CF
entry as completed.
That's great! Thank you!
I wonder if we should do GIN?
GIN vacuum is a logical scan. Back in 2017 I was starting to work on it, but made some mistakes, that were reverted by fd83c83 from the released version. And I decided to back off for some time. Perhaps, now I can implement physical scan for GIN, that could benefit from read stream. But I doubt I will find committer for this in 19, let alone 18.
We can add some support for read stream for hashbulkdelete(): it's not that linear as B-tree, GiST and SP-GiST, it scans only beginning of hash buckets, but if buckets are small it might be more efficient.
Looking at the spgist read stream user, I see you didn't convert
spgprocesspending(). It seems like you could write a callback that
uses the posting list and streamify this as well.It's probably not worth it -- since we process the pending list for
each page of the index.
My understanding is that pending lists should be small on real workloads.
Thank you!
Best regards, Andrey Borodin.
On 22 Mar 2025, at 02:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
committed
There's a BF failure with just these changes [0]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-21%2019%3A09%3A59. But IMO it's unrelated.
There are 2 failed tests:
1. 'activeslot slot invalidation is logged with vacuum on pg_authid' is very similar to what is discussed here [1]/messages/by-id/386386.1737736935@sss.pgh.pa.us
2. '+ERROR: tuple concurrently deleted' in injection_points/isolation seems to be discussed here [2]/messages/by-id/20250304033442.11.nmisch@google.com
Thanks!
Best regards, Andrey Borodin.
[0]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-03-21%2019%3A09%3A59
[1]: /messages/by-id/386386.1737736935@sss.pgh.pa.us
[2]: /messages/by-id/20250304033442.11.nmisch@google.com
On Sun, Mar 23, 2025 at 1:02 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
There's a BF failure with just these changes [0]. But IMO it's unrelated.
There are 2 failed tests:
1. 'activeslot slot invalidation is logged with vacuum on pg_authid' is very similar to what is discussed here [1]
2. '+ERROR: tuple concurrently deleted' in injection_points/isolation seems to be discussed here [2]
Yep, I saw the recovery/035_standby_logical_decoding skink failure and
concluded it was caused by a pre-existing issue that is already being
discussed. I hadn't yet investigated the injection point isolation
test failure. So, it is good to know it seems like it is a known
issue. Thanks for being on the lookout.
- Melanie