nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

Started by Peter Geogheganabout 1 year ago8 messages
1 attachment(s)

Attached patch teaches btree_xlog_vacuum, nbtree VACUUM's REDO
routine, to reset the target page's opaque->btpo_cycleid to 0. This
makes the REDO routine match original execution, which seems like a
good idea on consistency grounds.

I propose this for the master branch only.
--
Peter Geoghegan

Attachments:

v1-0001-Clear-btpo_cycleid-in-btree_xlog_vacuum.patchapplication/octet-stream; name=v1-0001-Clear-btpo_cycleid-in-btree_xlog_vacuum.patchDownload
From d50df1e0cc72298ebe66e8a811ff525a6d09d2f9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 15 Nov 2024 11:26:59 -0500
Subject: [PATCH v1] Clear btpo_cycleid in btree_xlog_vacuum

---
 src/backend/access/nbtree/nbtxlog.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 7e91584f1..901b5f70b 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -634,10 +634,11 @@ btree_xlog_vacuum(XLogReaderState *record)
 			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
 		/*
-		 * Mark the page as not containing any LP_DEAD items --- see comments
-		 * in _bt_delitems_vacuum().
+		 * Clear the vacuum cycle ID, and mark the page as not containing any
+		 * LP_DEAD items
 		 */
 		opaque = BTPageGetOpaque(page);
+		opaque->btpo_cycleid = 0;
 		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
 		PageSetLSN(page, lsn);
@@ -698,7 +699,10 @@ btree_xlog_delete(XLogReaderState *record)
 		if (xlrec->ndeleted > 0)
 			PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 
-		/* Mark the page as not containing any LP_DEAD items */
+		/*
+		 * Do *not* clear the vacuum cycle ID, but do mark the page as not
+		 * containing any LP_DEAD items
+		 */
 		opaque = BTPageGetOpaque(page);
 		opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
 
-- 
2.45.2

#2Michael Paquier
michael@paquier.xyz
In reply to: Peter Geoghegan (#1)
Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

On Fri, Nov 15, 2024 at 11:33:37AM -0500, Peter Geoghegan wrote:

Attached patch teaches btree_xlog_vacuum, nbtree VACUUM's REDO
routine, to reset the target page's opaque->btpo_cycleid to 0. This
makes the REDO routine match original execution, which seems like a
good idea on consistency grounds.

I propose this for the master branch only.

Perhaps this patch should also explain why this is better this way?
This path does not manipulate btpo_level, for one.
--
Michael

#3Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Peter Geoghegan (#1)
Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

On 15 Nov 2024, at 21:33, Peter Geoghegan <pg@bowt.ie> wrote:

Attached patch teaches btree_xlog_vacuum, nbtree VACUUM's REDO
routine, to reset the target page's opaque->btpo_cycleid to 0. This
makes the REDO routine match original execution, which seems like a
good idea on consistency grounds.

I propose this for the master branch only.

The change seems correct to me: anyway cycle must be less than cycle of any future vacuum after promotion. I cannot say anything about beauty of resetting or not resetting the field.
I'd suggest renaming the field into something like "btpo_split_vaccycleid". I was aware of index vacuum backtracking, but it took me a while to build context again.

Best regards, Andrey Borodin.

In reply to: Andrey M. Borodin (#3)
Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

On Wed, Nov 20, 2024 at 4:41 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 15 Nov 2024, at 21:33, Peter Geoghegan <pg@bowt.ie> wrote:
I propose this for the master branch only.

The change seems correct to me: anyway cycle must be less than cycle of any future vacuum after promotion.

The cycles set in the page special area during page splits that happen
to run while a VACUUM also runs must use that same VACUUM's cycle ID
(which is stored in shared memory for the currently running VACUUM).
That way the VACUUM will know when it must backtrack later on, to
avoid missing index tuples that it is expected to remove.

It doesn't matter if the cycle_id that VACUUM sees is less than or
greater than its own one -- only that it matches its own one when it
needs to match to get correct behavior from VACUUM. (Though it's also
possible to get a false positive, in rare cases where we get unlucky
and there's a collision. This might waste cycles within VACUUM, but
it shouldn't lead to truly incorrect behavior.)

I cannot say anything about beauty of resetting or not resetting the field.

The main reason to do this is simple: we should make REDO routines
match original execution, unless there is a very good reason not to do
so -- original execution is "authoritative". There's no good reason
not to do so here.

There is a secondary reason to do things this way, though: resetting
the cycle ID within the REDO routine can save future work from within
btvacuumpage, after a standby gets promoted, when nbtree VACUUM runs
against an affected index for the first time. Note that nbtree VACUUM
goes out of its way to clear an old cycle ID in passing, even when it
has nothing else to do on the page. And so by taking care of that on
the REDO side, right from the start, the newly promoted standby won't
need to dirty so many pages during the first nbtree VACUUM after it is
promoted.

(Actually, it's worse than that -- we're not just dirtying pages these
days. It's quite likely that we'll need to write a WAL record to clear
the cycle ID, since page-level checksums are enabled by default.)

I'd suggest renaming the field into something like "btpo_split_vaccycleid". I was aware of index vacuum backtracking, but it took me a while to build context again.

I'm not inclined to change it now. It's been like this for many years now.

Pushed this patch just now. Thanks for the review!

--
Peter Geoghegan

In reply to: Michael Paquier (#2)
Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

On Wed, Nov 20, 2024 at 1:32 AM Michael Paquier <michael@paquier.xyz> wrote:

Perhaps this patch should also explain why this is better this way?

See the commit message, and my remarks to Andrey just now.

This path does not manipulate btpo_level, for one.

Why would it do so? Changing the btree level would corrupt the index?

--
Peter Geoghegan

#6Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Peter Geoghegan (#4)
Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

On 24 Dec 2024, at 01:46, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Nov 20, 2024 at 4:41 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 15 Nov 2024, at 21:33, Peter Geoghegan <pg@bowt.ie> wrote:
I propose this for the master branch only.

The change seems correct to me: anyway cycle must be less than cycle of any future vacuum after promotion.

The cycles set in the page special area during page splits that happen
to run while a VACUUM also runs must use that same VACUUM's cycle ID
(which is stored in shared memory for the currently running VACUUM).
That way the VACUUM will know when it must backtrack later on, to
avoid missing index tuples that it is expected to remove.

It doesn't matter if the cycle_id that VACUUM sees is less than or
greater than its own one -- only that it matches its own one when it
needs to match to get correct behavior from VACUUM. (Though it's also
possible to get a false positive, in rare cases where we get unlucky
and there's a collision. This might waste cycles within VACUUM, but
it shouldn't lead to truly incorrect behavior.)

I'm thinking more about it. We always reset btpo_cycleid even in redo of a split.
This "btpo_cycleid = 0;" reset can break two scenarios that are not currently supported by us, but might be supported in future.
This reset is based on the idea that crash recovery will interrupt vacuum. It is not true in these cases.

1. We are dealing with compute-storage separation system. We do not have filesystem and when we need to read a page we get it from some storage service, that rebuild pages from WAL. (e.g. Aurora and Neon) If we split a page during vacuum, evict it and read it from service - we will miss needed backtrack to the left page...
2. There's a tool for repairing pages with checksum violations - page repair. AFAIK it can request page from Standby, and if it does amidst vacuum, vacuum can get false negative for backtracking logic.

Thanks!

Best regards, Andrey Borodin.

In reply to: Andrey Borodin (#6)
Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

On Tue, Nov 18, 2025 at 1:32 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

I'm thinking more about it. We always reset btpo_cycleid even in redo of a split.
This "btpo_cycleid = 0;" reset can break two scenarios that are not currently supported by us, but might be supported in future.

I don't follow.

This reset is based on the idea that crash recovery will interrupt vacuum. It is not true in these cases.

It's also based on the idea that only one VACUUM operation can be
running at a time.

1. We are dealing with compute-storage separation system. We do not have filesystem and when we need to read a page we get it from some storage service, that rebuild pages from WAL. (e.g. Aurora and Neon) If we split a page during vacuum, evict it and read it from service - we will miss needed backtrack to the left page...

Are you arguing that the xl_btree_split record should include the cycleid?

I see that systems that are built on this architecture do something
along these lines:
https://github.com/neondatabase/postgres/commit/a9b92820c5d14dbff8f59ab65ffdaae92ab9c3c8

However, that seems well out of scope for core Postgres. At least for
the foreseeable future.

--
Peter Geoghegan

#8Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Peter Geoghegan (#7)
Re: nbtree VACUUM's REDO routine doesn't clear page's VACUUM cycle ID

On 18 Nov 2025, at 23:54, Peter Geoghegan <pg@bowt.ie> wrote:

Are you arguing that the xl_btree_split record should include the cycleid?

Yes.

I see that systems that are built on this architecture do something
along these lines:
https://github.com/neondatabase/postgres/commit/a9b92820c5d14dbff8f59ab65ffdaae92ab9c3c8

Thanks for the link. This solution seems surprisingly complicated.

However, that seems well out of scope for core Postgres. At least for
the foreseeable future.

Yes, I agree. It's just a case when redo routines do not match original execution, and it seemed interesting to me. I understand that for core Postgres it's extra overhead in WAL record only for a beautification.

Thanks!

Best regards, Andrey Borodin.