Orphan page in _bt_split
Hi hacker,
The function _bt_split creates new (right) page to split into.
The comment says:
/*
* Acquire a new right page to split into, now that left page has a new
* high key. From here on, it's not okay to throw an error without
* zeroing rightpage first. This coding rule ensures that we won't
* confuse future VACUUM operations, which might otherwise try to
re-find
* a downlink to a leftover junk page as the page undergoes deletion.
*
* It would be reasonable to start the critical section just after
the new
* rightpage buffer is acquired instead; that would allow us to avoid
* leftover junk pages without bothering to zero rightpage. We do
it this
* way because it avoids an unnecessary PANIC when either origpage
or its
* existing sibling page are corrupt.
*/
So we should zero this page in case of reporting error to "not confuse
vacuum.
It is done for all places where this function reports error before
entering critical section and wal-logging this page.
But before it there is call to _bt_getbuf:
/*
* We have to grab the original right sibling (if any) and update
its prev
* link. We are guaranteed that this is deadlock-free, since we couple
* the locks in the standard order: left to right.
*/
if (!isrightmost)
{
sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
_bp_getbuf just reads page in a shared buffer. But ReadBuffer actually
can invoke half of Postgres code (locating buffer, evicting victim,
writing to SMGR,..., So there are a lot of places where error can be
thrown (including even such error as statement timeout).
And in this case current transaction will be aborted without zeroing the
right page.
So vacuum will be confused.
It can be quite easily demonstrated by inserting error here:
/*
* We have to grab the original right sibling (if any) and update
its prev
* link. We are guaranteed that this is deadlock-free, since we couple
* the locks in the standard order: left to right.
*/
if (!isrightmost)
{
sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
elog(ERROR, "@@@ Terminated");
And doing something like this:
postgres=# create table t3(pk integer primary key, sk integer, payload
integer);
CREATE TABLE
postgres=# create index on t3(sk);
CREATE INDEX
postgres=# insert into t3 values
(generate_series(1,1000000),random()*1000000,0);
ERROR: @@@ Terminated
postgres=# vacuum t3;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
log:
2025-09-01 07:46:37.459 EEST [89290] ERROR: @@@ Terminated
2025-09-01 07:46:37.459 EEST [89290] STATEMENT: insert into t3 values
(generate_series(1,1000000),random()*1000000,0);
2025-09-01 07:46:42.391 EEST [89406] LOG: failed to re-find parent key
in index "t3_sk_idx" for deletion target page 4
2025-09-01 07:46:42.391 EEST [89406] CONTEXT: while vacuuming index
"t3_sk_idx" of relation "public.t3"
TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406
0 postgres 0x0000000104bf2f64
ExceptionalCondition + 216
1 postgres 0x00000001043ace94
_bt_lock_subtree_parent + 248
2 postgres 0x00000001043aaf98
_bt_mark_page_halfdead + 508
3 postgres 0x00000001043aaaec _bt_pagedel +
1068
4 postgres 0x00000001043b56cc btvacuumpage
+ 2116
5 postgres 0x00000001043b4dd4 btvacuumscan
+ 564
This code is not changed for quite long time so I wonder why nobody
noticed this error before?
Proposed patch is attached.
It creates copy of right page in the same way as for left (original) page.
And updates the page only in critical section.
Attachments:
bt_split.patchtext/plain; charset=UTF-8; name=bt_split.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index be60781fc98..aa792c59e12 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1707,19 +1707,12 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
/*
* Acquire a new right page to split into, now that left page has a new
- * high key. From here on, it's not okay to throw an error without
- * zeroing rightpage first. This coding rule ensures that we won't
- * confuse future VACUUM operations, which might otherwise try to re-find
- * a downlink to a leftover junk page as the page undergoes deletion.
- *
- * It would be reasonable to start the critical section just after the new
- * rightpage buffer is acquired instead; that would allow us to avoid
- * leftover junk pages without bothering to zero rightpage. We do it this
- * way because it avoids an unnecessary PANIC when either origpage or its
- * existing sibling page are corrupt.
+ * high key. To prevent confision of VACUUM with orphan page, we also
+ * first fill in-mempory copy oif this page.
*/
rbuf = _bt_allocbuf(rel, heaprel);
- rightpage = BufferGetPage(rbuf);
+ rightpage = PageGetTempPage(BufferGetPage(rbuf));
+ memset(BufferGetPage(rbuf), 0, BLCKSZ);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_allocbuf */
ropaque = BTPageGetOpaque(rightpage);
@@ -1768,7 +1761,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (PageAddItem(rightpage, (Item) righthighkey, itemsz, afterrightoff,
false, false) == InvalidOffsetNumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add high key to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1816,7 +1808,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(leftpage, newitemsz, newitem, afterleftoff,
false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1829,7 +1820,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1843,7 +1833,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
{
if (!_bt_pgaddtup(leftpage, itemsz, dataitem, afterleftoff, false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1855,7 +1844,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, itemsz, dataitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1876,7 +1864,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1896,7 +1883,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("right sibling's left-link doesn't match: "
@@ -1942,6 +1928,10 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
PageRestoreTempPage(leftpage, origpage);
/* leftpage, lopaque must not be used below here */
+ PageRestoreTempPage(rightpage, BufferGetPage(rbuf));
+ rightpage = BufferGetPage(rbuf);
+ ropaque = BTPageGetOpaque(rightpage);
+
MarkBufferDirty(buf);
MarkBufferDirty(rbuf);
On Mon, Sep 01, 2025 at 08:03:18AM +0300, Konstantin Knizhnik wrote:
_bp_getbuf just reads page in a shared buffer. But ReadBuffer actually can
invoke half of Postgres code (locating buffer, evicting victim, writing to
SMGR,..., So there are a lot of places where error can be thrown (including
even such error as statement timeout).
And in this case current transaction will be aborted without zeroing the
right page.
So vacuum will be confused.It can be quite easily demonstrated by inserting error here:
Nice investigation and report, that I assume you have just guessed
from a read of the code and that there could be plenty of errors that
could happen in this code path. It indeed looks like some weak
coding assumption introduced in this code path by 9b42e713761a from
2019, going down to v11.
We could have a SQL regression test for this case, just put a
INJECTION_POINT(), then force an ERROR callback to force an incorrect
state. The test can be made cheap enough.
This code is not changed for quite long time so I wonder why nobody noticed
this error before?
I am ready to believe that errors are just hard to reach in this path.
It creates copy of right page in the same way as for left (original) page.
And updates the page only in critical section.
Paying the cost of an extra allocation for a temporary page to bypass
the problem entirely surely makes the code safer in the long run, so
as any changes made in this area would never trigger the same problem
ever again. We do that already for GIN with critical sections, as one
example.
Peter G., as the committer of 9b42e713761a, an opinion?
--
Michael
On Mon, Sep 1, 2025 at 1:03 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
So we should zero this page in case of reporting error to "not confuse
vacuum.
It is done for all places where this function reports error before
entering critical section and wal-logging this page.
Right.
_bp_getbuf just reads page in a shared buffer. But ReadBuffer actually
can invoke half of Postgres code (locating buffer, evicting victim,
writing to SMGR,..., So there are a lot of places where error can be
thrown (including even such error as statement timeout).
And in this case current transaction will be aborted without zeroing the
right page.
So vacuum will be confused.
...
2025-09-01 07:46:37.459 EEST [89290] ERROR: @@@ Terminated
2025-09-01 07:46:37.459 EEST [89290] STATEMENT: insert into t3 values
(generate_series(1,1000000),random()*1000000,0);
2025-09-01 07:46:42.391 EEST [89406] LOG: failed to re-find parent key
in index "t3_sk_idx" for deletion target page 4
2025-09-01 07:46:42.391 EEST [89406] CONTEXT: while vacuuming index
"t3_sk_idx" of relation "public.t3"
TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406
This code is not changed for quite long time so I wonder why nobody
noticed this error before?
The assertion failure + log output that you've shown is from
_bt_lock_subtree_parent. I'm aware that that error appears from time
to time in the field. When we hit this log message in a non-assert
build, VACUUM should still be able to finish successfully.
My assumption up until now has been that the corruption underlying
these "failed to re-find parent" LOG messages is generally due to
breaking changes to collations (page deletion uses an insertion scan
key to "re-find" a downlink to the page undergoing deletion), and
generic corruption. But it now seems possible that some of them were
due to the problem that you've highlighted.
The problem that you've highlighted happens late within _bt_split,
when the contents of the new right sibling page (everything except its
LSN) has already been written to the page. So I'm fairly confident
that any real world problems would show themselves as "failed to
re-find parent" LOG message from VACUUM (there are no downlinks or
sibling pointers that point to the new right page, so only VACUUM will
ever be able to read the page).
Proposed patch is attached.
It creates copy of right page in the same way as for left (original) page.
And updates the page only in critical section.
I remember that when I worked on what became commit 9b42e71376 back in
2019 (which fixed a similar problem caused by the INCLUDE index
patch), Tom suggested that we do things this way defensively (without
being specifically aware of the _bt_getbuf hazard). That does seem
like the best approach.
I'm a little bit worried about the added overhead of allocating an
extra BLCKSZ buffer for this. I wonder if it would make sense to use
BLCKSZ arrays of char for both of the temp pages.
--
Peter Geoghegan
On Mon, Sep 1, 2025 at 1:35 AM Michael Paquier <michael@paquier.xyz> wrote:
Nice investigation and report, that I assume you have just guessed
from a read of the code and that there could be plenty of errors that
could happen in this code path. It indeed looks like some weak
coding assumption introduced in this code path by 9b42e713761a from
2019, going down to v11.
Commit 9b42e713761a really has nothing to do with this. It fixed a
similar issue that slipped in to Postgres 11. At worst, commit
9b42e713761a neglected to fix this other problem in passing.
This hazard has existing since commit 8fa30f906b, from 2010. That's
the commit that introduced the general idea of making _bt_split zero
its rightpage in order to make it safe to throw an ERROR instead of just
PANICing.
We could have a SQL regression test for this case, just put a
INJECTION_POINT(), then force an ERROR callback to force an incorrect
state. The test can be made cheap enough.This code is not changed for quite long time so I wonder why nobody noticed
this error before?I am ready to believe that errors are just hard to reach in this path.
Why?
There's just no reason to think that we'd ever be able to tie back one
of these LOG messages from VACUUM to the problem within _bt_split.
There's too many other forms of corruption that might result in VACUUM
logging this same error (e.g., breaking changes to a glibc collation).
An important case where this weakness will make life worse for users
is a checksum failure against the existing right sibling page -- since
those are not once off, transient errors (unlike, say, OOMs). Once you
have an index page with a bad checksum, there's a decent chance that
the application will attempt to insert onto the page to the immediate
left of that bad page. That'll trigger a split, sooner or later. Which
in turn triggers the problem that Konstantin reported. It's not going
to make the corruption problem markedly worse, but it's still not
great: there's no telling how many times successive inserters will try
and inevitably fail to split the same page, creating a new junk page
each time.
--
Peter Geoghegan
On Mon, Sep 1, 2025 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
There's just no reason to think that we'd ever be able to tie back one
of these LOG messages from VACUUM to the problem within _bt_split.
There's too many other forms of corruption that might result in VACUUM
logging this same error (e.g., breaking changes to a glibc collation).
Thinking about this some more, I guess it's generally fairly unlikely
that VACUUM would actually even attempt to delete such a page. The
only reason why it happens with Konstantin's test case is because the
whole inserting transaction aborts, leaving behind many garbage tuples
that VACUUM will remove, leading to an empty page. Without that,
VACUUM won't think to even try to delete a left over junk right
sibling page.
An important case where this weakness will make life worse for users
is a checksum failure against the existing right sibling page -- since
those are not once off, transient errors (unlike, say, OOMs). Once you
have an index page with a bad checksum, there's a decent chance that
the application will attempt to insert onto the page to the immediate
left of that bad page. That'll trigger a split, sooner or later.
Also rethinking this aspect: a checksum failure probably *isn't* going
to make much difference. Since that'll also cause bigger problems for
VACUUM than logging one of these "failed to re-find parent key"
messages.
I still think that we should do something about this. Not sure how I
feel about backpatching just yet.
--
Peter Geoghegan
On Mon, Sep 01, 2025 at 03:04:58PM -0400, Peter Geoghegan wrote:
This hazard has existing since commit 8fa30f906b, from 2010. That's
the commit that introduced the general idea of making _bt_split zero
its rightpage in order to make it safe to throw an ERROR instead of just
PANICing.
Thanks, you're right, I have fat-fingered this one. It's much older
than 2019.
--
Michael
On Mon, Sep 01, 2025 at 02:35:56PM -0400, Peter Geoghegan wrote:
I remember that when I worked on what became commit 9b42e71376 back in
2019 (which fixed a similar problem caused by the INCLUDE index
patch), Tom suggested that we do things this way defensively (without
being specifically aware of the _bt_getbuf hazard). That does seem
like the best approach.I'm a little bit worried about the added overhead of allocating an
extra BLCKSZ buffer for this. I wonder if it would make sense to use
BLCKSZ arrays of char for both of the temp pages.
Hmm. Yes, the allocation makes me a bit uneasy. Or it would be
possible to just use a PGAlignedBlock here? That's already used in
some code paths for page manipulations, forcing alignment.
--
Michael
On 02/09/2025 7:42 AM, Michael Paquier wrote:
On Mon, Sep 01, 2025 at 02:35:56PM -0400, Peter Geoghegan wrote:
I remember that when I worked on what became commit 9b42e71376 back in
2019 (which fixed a similar problem caused by the INCLUDE index
patch), Tom suggested that we do things this way defensively (without
being specifically aware of the _bt_getbuf hazard). That does seem
like the best approach.I'm a little bit worried about the added overhead of allocating an
extra BLCKSZ buffer for this. I wonder if it would make sense to use
BLCKSZ arrays of char for both of the temp pages.Hmm. Yes, the allocation makes me a bit uneasy. Or it would be
possible to just use a PGAlignedBlock here? That's already used in
some code paths for page manipulations, forcing alignment.
Do you suggest to replace `PageGetTempPage` with using local buffers?
Or may be change `PageGetTempPage*` to accept parameter with provided
local buffer?
But it is used in ~20 places. Change them all?
It seems to be too invasive change for such small fix, but I can do it.
Certainly copying data can be done in more efficient way, but I do not
thing that palloc here can have any noticeable impact on performance.
Please let me know if I should prepare new page avoiding memory
allocation or you prefer to do it yourself.
On 01/09/2025 10:18 PM, Peter Geoghegan wrote:
On Mon, Sep 1, 2025 at 3:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
There's just no reason to think that we'd ever be able to tie back one
of these LOG messages from VACUUM to the problem within _bt_split.
There's too many other forms of corruption that might result in VACUUM
logging this same error (e.g., breaking changes to a glibc collation).Thinking about this some more, I guess it's generally fairly unlikely
that VACUUM would actually even attempt to delete such a page. The
only reason why it happens with Konstantin's test case is because the
whole inserting transaction aborts, leaving behind many garbage tuples
that VACUUM will remove, leading to an empty page. Without that,
VACUUM won't think to even try to delete a left over junk right
sibling page.
But sooner or later vacuum will be called for this index and will
traverse this page, will not it?
There is not other way to reuse this page without deleting it or I am
missing something?
An important case where this weakness will make life worse for users
is a checksum failure against the existing right sibling page -- since
those are not once off, transient errors (unlike, say, OOMs). Once you
have an index page with a bad checksum, there's a decent chance that
the application will attempt to insert onto the page to the immediate
left of that bad page. That'll trigger a split, sooner or later.Also rethinking this aspect: a checksum failure probably *isn't* going
to make much difference. Since that'll also cause bigger problems for
VACUUM than logging one of these "failed to re-find parent key"
messages.
But vacuum is not just logging this message. It throws error which means
that vacuum for this relation will be performed any more.
On Wed, Sep 03, 2025 at 09:32:41AM +0300, Konstantin Knizhnik wrote:
On 01/09/2025 10:18 PM, Peter Geoghegan wrote:
Also rethinking this aspect: a checksum failure probably *isn't* going
to make much difference. Since that'll also cause bigger problems for
VACUUM than logging one of these "failed to re-find parent key"
messages.But vacuum is not just logging this message. It throws error which means
that vacuum for this relation will be performed any more.
Yeah. For a long-running autovacuum job, getting potentially random
in-flight failures is always annoying, more if these jobs deal with
wraparound.
--
Michael
On Wed, Sep 3, 2025 at 2:32 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
But sooner or later vacuum will be called for this index and will
traverse this page, will not it?
There is not other way to reuse this page without deleting it or I am
missing something?
That's true. But VACUUM won't even attempt to delete it unless it can
also remove all of the index tuples. Which, in general, probably won't
happen (it happened with your test case, but that's probably not
typical).
But vacuum is not just logging this message. It throws error which means
that vacuum for this relation will be performed any more.
What error? You showed an assertion failure, but that won't be hit in
release builds.
--
Peter Geoghegan
On 04/09/2025 3:55 AM, Peter Geoghegan wrote:
On Wed, Sep 3, 2025 at 2:32 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
But sooner or later vacuum will be called for this index and will
traverse this page, will not it?
There is not other way to reuse this page without deleting it or I am
missing something?That's true. But VACUUM won't even attempt to delete it unless it can
also remove all of the index tuples. Which, in general, probably won't
happen (it happened with your test case, but that's probably not
typical).But vacuum is not just logging this message. It throws error which means
that vacuum for this relation will be performed any more.What error? You showed an assertion failure, but that won't be hit in
release builds.
Sorry, I missed it with another "failed to re-find" error ("... for split").
Yes, this error can happen only for Postgres build with enabled casserts.
On Wed, Sep 3, 2025 at 2:25 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote:
Do you suggest to replace `PageGetTempPage` with using local buffers?
Or may be change `PageGetTempPage*` to accept parameter with provided
local buffer?
I think that _bt_split could easily be switched over to using 2 local
PGAlignedBlock variables, removing its use of PageGetTempPage. I don't
think that it is necessary to consider other PageGetTempPage callers.
--
Peter Geoghegan
On 13/09/2025 10:10 PM, Peter Geoghegan wrote:
On Wed, Sep 3, 2025 at 2:25 AM Konstantin Knizhnik<knizhnik@garret.ru> wrote:
Do you suggest to replace `PageGetTempPage` with using local buffers?
Or may be change `PageGetTempPage*` to accept parameter with provided
local buffer?I think that _bt_split could easily be switched over to using 2 local
PGAlignedBlock variables, removing its use of PageGetTempPage. I don't
think that it is necessary to consider other PageGetTempPage callers.
Attached please find updated version of the patch which uses
PGAlignedBlock instead of PageGetTempPage.
Attachments:
v2_bt_split.patchtext/plain; charset=UTF-8; name=v2_bt_split.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index be60781fc98..9a62e807f81 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1473,6 +1473,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
Page origpage;
Page leftpage,
rightpage;
+ PGAlignedBlock leftpage_buf,
+ rightpage_buf;
BlockNumber origpagenumber,
rightpagenumber;
BTPageOpaque ropaque,
@@ -1543,8 +1545,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
- /* Allocate temp buffer for leftpage */
- leftpage = PageGetTempPage(origpage);
+ leftpage = leftpage_buf.data;
+ memcpy(leftpage, origpage, BLCKSZ);
_bt_pageinit(leftpage, BufferGetPageSize(buf));
lopaque = BTPageGetOpaque(leftpage);
@@ -1707,19 +1709,13 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
/*
* Acquire a new right page to split into, now that left page has a new
- * high key. From here on, it's not okay to throw an error without
- * zeroing rightpage first. This coding rule ensures that we won't
- * confuse future VACUUM operations, which might otherwise try to re-find
- * a downlink to a leftover junk page as the page undergoes deletion.
- *
- * It would be reasonable to start the critical section just after the new
- * rightpage buffer is acquired instead; that would allow us to avoid
- * leftover junk pages without bothering to zero rightpage. We do it this
- * way because it avoids an unnecessary PANIC when either origpage or its
- * existing sibling page are corrupt.
+ * high key. To prevent confision of VACUUM with orphan page, we also
+ * first fill in-mempory copy oif this page.
*/
rbuf = _bt_allocbuf(rel, heaprel);
- rightpage = BufferGetPage(rbuf);
+ rightpage = rightpage_buf.data;
+ memcpy(rightpage, BufferGetPage(rbuf), BLCKSZ);
+ memset(BufferGetPage(rbuf), 0, BLCKSZ);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_allocbuf */
ropaque = BTPageGetOpaque(rightpage);
@@ -1768,7 +1764,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (PageAddItem(rightpage, (Item) righthighkey, itemsz, afterrightoff,
false, false) == InvalidOffsetNumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add high key to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1816,7 +1811,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(leftpage, newitemsz, newitem, afterleftoff,
false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1829,7 +1823,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1843,7 +1836,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
{
if (!_bt_pgaddtup(leftpage, itemsz, dataitem, afterleftoff, false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1855,7 +1847,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, itemsz, dataitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1876,7 +1867,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1896,7 +1886,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("right sibling's left-link doesn't match: "
@@ -1939,9 +1928,13 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* original. We need to do this before writing the WAL record, so that
* XLogInsert can WAL log an image of the page if necessary.
*/
- PageRestoreTempPage(leftpage, origpage);
+ memcpy(origpage, leftpage, BLCKSZ);
/* leftpage, lopaque must not be used below here */
+ rightpage = BufferGetPage(rbuf);
+ memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+ ropaque = BTPageGetOpaque(rightpage);
+
MarkBufferDirty(buf);
MarkBufferDirty(rbuf);
On Sun, Sep 14, 2025 at 04:40:54PM +0300, Konstantin Knizhnik wrote:
On 13/09/2025 10:10 PM, Peter Geoghegan wrote:
I think that _bt_split could easily be switched over to using 2 local
PGAlignedBlock variables, removing its use of PageGetTempPage. I don't
think that it is necessary to consider other PageGetTempPage callers.
Hmm. I didn't really agree with this last statement, especially if
these code paths don't need to manipulate Page pointers across the
stack. So I have taken this opportunity to double-check all the
existing calls of PageGetTempPage() which is an API old enough to vote
(105409746499?), while 44cac9346479 has introduced PGAlignedBlock much
later, seeing if there are some opportunities here:
- dataSplitPageInternal() and entrySplitPage() expect the contents of
the pages to be returned to the caller. We could use PGAlignedBlock
with pre-prepared pages that don't get allocated, but this requires
the callers of the internal routines to take responsibility for that,
like dataBeginPlaceToPage(). The complexity is not appealing in these
case.
- The same argument applies to the call in ginPlaceToPage(), passing
down the page to fillRoot().
- gistplacetopage(), with a copy of a page filled its special area,
manipulated across other calls.
- An optimization exists for btree_xlog_split() where a temporary page
is not manipulated, but the fact that we are in redo does not really
matter much for the extra allocation, I guess.
At the end, I don't feel excited about switching these cases, where
the gains are not obvious.
Attached please find updated version of the patch which uses PGAlignedBlock
instead of PageGetTempPage.
+ * high key. To prevent confision of VACUUM with orphan page, we also
+ * first fill in-mempory copy oif this page.
Three typos in two lines here (not sure about the best wording that
fits with VACUUM, but I like the suggested simplification):
s/in-mempory/in-memory/
s/confision/confusion/
s/oif/if/
Saying that, this patch sounds like a good idea, making these code
paths a bit more reliable for VACUUM, without paying the cost of an
allocation. At least for HEAD, that's nice to have. Peter, what do
you think?
--
Michael
On Tue, Sep 16, 2025 at 10:26:12AM +0900, Michael Paquier wrote:
Saying that, this patch sounds like a good idea, making these code
paths a bit more reliable for VACUUM, without paying the cost of an
allocation. At least for HEAD, that's nice to have. Peter, what do
you think?
Hearing nothing, I'm planning to have a second look at it and do
something on HEAD. Feel free if you have any comments.
--
Michael
On 18 Sep 2025, at 7:28 AM, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Sep 16, 2025 at 10:26:12AM +0900, Michael Paquier wrote:
Saying that, this patch sounds like a good idea, making these code
paths a bit more reliable for VACUUM, without paying the cost of an
allocation. At least for HEAD, that's nice to have. Peter, what do
you think?Hearing nothing, I'm planning to have a second look at it and do
something on HEAD. Feel free if you have any comments.
—
Michael
Attached please find rebased version of the patch with fixed mistypings.
Attachments:
v3_bt_split.patchapplication/octet-stream; name=v3_bt_split.patch; x-unix-mode=0644Download
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index be60781fc98..e0dcccc194d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1473,6 +1473,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
Page origpage;
Page leftpage,
rightpage;
+ PGAlignedBlock leftpage_buf,
+ rightpage_buf;
BlockNumber origpagenumber,
rightpagenumber;
BTPageOpaque ropaque,
@@ -1543,8 +1545,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
- /* Allocate temp buffer for leftpage */
- leftpage = PageGetTempPage(origpage);
+ leftpage = leftpage_buf.data;
+ memcpy(leftpage, origpage, BLCKSZ);
_bt_pageinit(leftpage, BufferGetPageSize(buf));
lopaque = BTPageGetOpaque(leftpage);
@@ -1707,19 +1709,13 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
/*
* Acquire a new right page to split into, now that left page has a new
- * high key. From here on, it's not okay to throw an error without
- * zeroing rightpage first. This coding rule ensures that we won't
- * confuse future VACUUM operations, which might otherwise try to re-find
- * a downlink to a leftover junk page as the page undergoes deletion.
- *
- * It would be reasonable to start the critical section just after the new
- * rightpage buffer is acquired instead; that would allow us to avoid
- * leftover junk pages without bothering to zero rightpage. We do it this
- * way because it avoids an unnecessary PANIC when either origpage or its
- * existing sibling page are corrupt.
+ * high key. To prevent confusion of VACUUM with orphan page, we also
+ * first fill in-memory copy of this page.
*/
rbuf = _bt_allocbuf(rel, heaprel);
- rightpage = BufferGetPage(rbuf);
+ rightpage = rightpage_buf.data;
+ memcpy(rightpage, BufferGetPage(rbuf), BLCKSZ);
+ memset(BufferGetPage(rbuf), 0, BLCKSZ);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_allocbuf */
ropaque = BTPageGetOpaque(rightpage);
@@ -1768,7 +1764,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (PageAddItem(rightpage, (Item) righthighkey, itemsz, afterrightoff,
false, false) == InvalidOffsetNumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add high key to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1816,7 +1811,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(leftpage, newitemsz, newitem, afterleftoff,
false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1829,7 +1823,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1843,7 +1836,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
{
if (!_bt_pgaddtup(leftpage, itemsz, dataitem, afterleftoff, false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1855,7 +1847,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, itemsz, dataitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1876,7 +1867,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1896,7 +1886,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("right sibling's left-link doesn't match: "
@@ -1939,9 +1928,13 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* original. We need to do this before writing the WAL record, so that
* XLogInsert can WAL log an image of the page if necessary.
*/
- PageRestoreTempPage(leftpage, origpage);
+ memcpy(origpage, leftpage, BLCKSZ);
/* leftpage, lopaque must not be used below here */
+ rightpage = BufferGetPage(rbuf);
+ memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+ ropaque = BTPageGetOpaque(rightpage);
+
MarkBufferDirty(buf);
MarkBufferDirty(rbuf);
On Mon, Sep 22, 2025 at 07:44:18PM +0300, Константин Книжник wrote:
Attached please find rebased version of the patch with fixed mistypings.
I have looked at v3.
- leftpage = PageGetTempPage(origpage);
+ leftpage = leftpage_buf.data;
+ memcpy(leftpage, origpage, BLCKSZ);
_bt_pageinit(leftpage, BufferGetPageSize(buf));
What's the point of copying the contents of origpage to leftpage?
_bt_pageinit() is going to initialize leftpage (plus a few more things
set like the page LSN, etc.), so the memcpy should not be necessary,
no?
+ rightpage = BufferGetPage(rbuf);
+ memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+ ropaque = BTPageGetOpaque(rightpage);
When we reach this state of the logic, aka at the beginning of the
critical section, the right and left pages are in a correct state,
and your code is OK because we copy the contents of the right page
back into its legitimate place. What is not OK to me is that the
copy of the "temporary" right page back to "rbuf" is not explained. I
think that this deserves a comment, especially the part about ropaque
which is set once by this proposal, then manipulated while in the
critical section.
--
Michael
On 25/09/2025 7:35 AM, Michael Paquier wrote:
On Mon, Sep 22, 2025 at 07:44:18PM +0300, Константин Книжник wrote:
Attached please find rebased version of the patch with fixed mistypings.
I have looked at v3.
- leftpage = PageGetTempPage(origpage); + leftpage = leftpage_buf.data; + memcpy(leftpage, origpage, BLCKSZ); _bt_pageinit(leftpage, BufferGetPageSize(buf));What's the point of copying the contents of origpage to leftpage?
_bt_pageinit() is going to initialize leftpage (plus a few more things
set like the page LSN, etc.), so the memcpy should not be necessary,
no?
Sorry, memcpy is really not needed here.
+ rightpage = BufferGetPage(rbuf); + memcpy(rightpage, rightpage_buf.data, BLCKSZ); + ropaque = BTPageGetOpaque(rightpage);When we reach this state of the logic, aka at the beginning of the
critical section, the right and left pages are in a correct state,
and your code is OK because we copy the contents of the right page
back into its legitimate place. What is not OK to me is that the
copy of the "temporary" right page back to "rbuf" is not explained. I
think that this deserves a comment, especially the part about ropaque
which is set once by this proposal, then manipulated while in the
critical section.
I added the comment, please check that it is clear and complete.
Updated version of the patch is attached.
Attachments:
v4_bt_split.patchtext/plain; charset=UTF-8; name=v4_bt_split.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index be60781fc98..268f02bbdf0 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1473,6 +1473,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
Page origpage;
Page leftpage,
rightpage;
+ PGAlignedBlock leftpage_buf,
+ rightpage_buf;
BlockNumber origpagenumber,
rightpagenumber;
BTPageOpaque ropaque,
@@ -1543,8 +1545,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
- /* Allocate temp buffer for leftpage */
- leftpage = PageGetTempPage(origpage);
+ leftpage = leftpage_buf.data;
_bt_pageinit(leftpage, BufferGetPageSize(buf));
lopaque = BTPageGetOpaque(leftpage);
@@ -1707,19 +1708,13 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
/*
* Acquire a new right page to split into, now that left page has a new
- * high key. From here on, it's not okay to throw an error without
- * zeroing rightpage first. This coding rule ensures that we won't
- * confuse future VACUUM operations, which might otherwise try to re-find
- * a downlink to a leftover junk page as the page undergoes deletion.
- *
- * It would be reasonable to start the critical section just after the new
- * rightpage buffer is acquired instead; that would allow us to avoid
- * leftover junk pages without bothering to zero rightpage. We do it this
- * way because it avoids an unnecessary PANIC when either origpage or its
- * existing sibling page are corrupt.
+ * high key. To prevent confusion of VACUUM with orphan page, we also
+ * first fill in-memory copy of this page.
*/
rbuf = _bt_allocbuf(rel, heaprel);
- rightpage = BufferGetPage(rbuf);
+ rightpage = rightpage_buf.data;
+ memcpy(rightpage, BufferGetPage(rbuf), BLCKSZ);
+ memset(BufferGetPage(rbuf), 0, BLCKSZ);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_allocbuf */
ropaque = BTPageGetOpaque(rightpage);
@@ -1768,7 +1763,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (PageAddItem(rightpage, (Item) righthighkey, itemsz, afterrightoff,
false, false) == InvalidOffsetNumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add high key to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1816,7 +1810,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(leftpage, newitemsz, newitem, afterleftoff,
false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1829,7 +1822,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1843,7 +1835,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
{
if (!_bt_pgaddtup(leftpage, itemsz, dataitem, afterleftoff, false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1855,7 +1846,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, itemsz, dataitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1876,7 +1866,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1896,7 +1885,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("right sibling's left-link doesn't match: "
@@ -1939,9 +1927,13 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* original. We need to do this before writing the WAL record, so that
* XLogInsert can WAL log an image of the page if necessary.
*/
- PageRestoreTempPage(leftpage, origpage);
+ memcpy(origpage, leftpage, BLCKSZ);
/* leftpage, lopaque must not be used below here */
+ rightpage = BufferGetPage(rbuf);
+ memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+ ropaque = BTPageGetOpaque(rightpage);
+
MarkBufferDirty(buf);
MarkBufferDirty(rbuf);
On 25/09/2025 7:35 AM, Michael Paquier wrote:
On Mon, Sep 22, 2025 at 07:44:18PM +0300, Константин Книжник wrote:
Attached please find rebased version of the patch with fixed mistypings.
I have looked at v3.
- leftpage = PageGetTempPage(origpage); + leftpage = leftpage_buf.data; + memcpy(leftpage, origpage, BLCKSZ); _bt_pageinit(leftpage, BufferGetPageSize(buf));What's the point of copying the contents of origpage to leftpage?
_bt_pageinit() is going to initialize leftpage (plus a few more things
set like the page LSN, etc.), so the memcpy should not be necessary,
no?+ rightpage = BufferGetPage(rbuf); + memcpy(rightpage, rightpage_buf.data, BLCKSZ); + ropaque = BTPageGetOpaque(rightpage);When we reach this state of the logic, aka at the beginning of the
critical section, the right and left pages are in a correct state,
and your code is OK because we copy the contents of the right page
back into its legitimate place. What is not OK to me is that the
copy of the "temporary" right page back to "rbuf" is not explained. I
think that this deserves a comment, especially the part about ropaque
which is set once by this proposal, then manipulated while in the
critical section.
--
Michael
Sorry, I have attached wrong patch.
Attachments:
v5_bt_split.patchtext/plain; charset=UTF-8; name=v5_bt_split.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index be60781fc98..ea1b61c85ed 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1473,6 +1473,8 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
Page origpage;
Page leftpage,
rightpage;
+ PGAlignedBlock leftpage_buf,
+ rightpage_buf;
BlockNumber origpagenumber,
rightpagenumber;
BTPageOpaque ropaque,
@@ -1543,8 +1545,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
- /* Allocate temp buffer for leftpage */
- leftpage = PageGetTempPage(origpage);
+ leftpage = leftpage_buf.data;
_bt_pageinit(leftpage, BufferGetPageSize(buf));
lopaque = BTPageGetOpaque(leftpage);
@@ -1707,19 +1708,13 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
/*
* Acquire a new right page to split into, now that left page has a new
- * high key. From here on, it's not okay to throw an error without
- * zeroing rightpage first. This coding rule ensures that we won't
- * confuse future VACUUM operations, which might otherwise try to re-find
- * a downlink to a leftover junk page as the page undergoes deletion.
- *
- * It would be reasonable to start the critical section just after the new
- * rightpage buffer is acquired instead; that would allow us to avoid
- * leftover junk pages without bothering to zero rightpage. We do it this
- * way because it avoids an unnecessary PANIC when either origpage or its
- * existing sibling page are corrupt.
+ * high key. To prevent confusion of VACUUM with orphan page, we also
+ * first fill in-memory copy of this page.
*/
rbuf = _bt_allocbuf(rel, heaprel);
- rightpage = BufferGetPage(rbuf);
+ rightpage = rightpage_buf.data;
+ memcpy(rightpage, BufferGetPage(rbuf), BLCKSZ);
+ memset(BufferGetPage(rbuf), 0, BLCKSZ);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_allocbuf */
ropaque = BTPageGetOpaque(rightpage);
@@ -1768,7 +1763,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (PageAddItem(rightpage, (Item) righthighkey, itemsz, afterrightoff,
false, false) == InvalidOffsetNumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add high key to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1816,7 +1810,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(leftpage, newitemsz, newitem, afterleftoff,
false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1829,7 +1822,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1843,7 +1835,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
{
if (!_bt_pgaddtup(leftpage, itemsz, dataitem, afterleftoff, false))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the left sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1855,7 +1846,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, itemsz, dataitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add old item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1876,7 +1866,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
if (!_bt_pgaddtup(rightpage, newitemsz, newitem, afterrightoff,
afterrightoff == minusinfoff))
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
elog(ERROR, "failed to add new item to the right sibling"
" while splitting block %u of index \"%s\"",
origpagenumber, RelationGetRelationName(rel));
@@ -1896,7 +1885,6 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
{
- memset(rightpage, 0, BufferGetPageSize(rbuf));
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("right sibling's left-link doesn't match: "
@@ -1939,9 +1927,19 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* original. We need to do this before writing the WAL record, so that
* XLogInsert can WAL log an image of the page if necessary.
*/
- PageRestoreTempPage(leftpage, origpage);
+ memcpy(origpage, leftpage, BLCKSZ);
/* leftpage, lopaque must not be used below here */
+ /*
+ * Now we are in critical section and it is not needed to maintain temporary
+ * copy of right page in the local memory. We can copy it to the destination buffer.
+ * Unlike leftpage, rightpage and ropaque are still needed to complete update
+ * of the page, so we need to correctly reinitialize them.
+ */
+ rightpage = BufferGetPage(rbuf);
+ memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+ ropaque = BTPageGetOpaque(rightpage);
+
MarkBufferDirty(buf);
MarkBufferDirty(rbuf);
On Thu, Sep 25, 2025 at 09:41:00AM +0300, Konstantin Knizhnik wrote:
Sorry, I have attached wrong patch.
Thanks, I was confused for a couple of minutes.
+ /*
+ * Now we are in critical section and it is not needed to maintain temporary
+ * copy of right page in the local memory. We can copy it to the destination buffer.
+ * Unlike leftpage, rightpage and ropaque are still needed to complete update
+ * of the page, so we need to correctly reinitialize them.
+ */
+ rightpage = BufferGetPage(rbuf);
+ memcpy(rightpage, rightpage_buf.data, BLCKSZ);
+ ropaque = BTPageGetOpaque(rightpage);
Hmm. This looks kind of explicit enough to document the purpose.
The wording could be simplified a bit more. I'll take it from there.
--
Michael
On Thu, Sep 25, 2025 at 03:45:21PM +0900, Michael Paquier wrote:
Hmm. This looks kind of explicit enough to document the purpose.
The wording could be simplified a bit more. I'll take it from there.
Reworded a bit more the comments, and applied on HEAD. There could be
an argument for back-patching, I guess, but the lack of complaints is
also an indicator to not do so, and VACUUM is able to handle the
post-failure cleanup should an ERROR happen in flight when splitting
a page.
--
Michael