Fixes for two separate bugs in nbtree VACUUM's page deletion
Attached are two patches, both of which are fixes for bugs in nbtree
VACUUM page deletion.
The first fix for a bug in commit 857f9c36cda. The immediate issue is
that the code that maintains the oldest btpo.xact in the index
accesses the special area of pages without holding a buffer pin. More
fundamentally, the logic for examining pages that are deleted by the
current VACUUM operation for the purposes of maintaining the oldest
btpo.xact needs to be pushed down into the guts of the page deletion
code. This has been described on other threads [1]/messages/by-id/CAH2-Wz=mWPBLZ2cr95cBV=kzPav8OOBtkHTfg+-+AUiozANK0w@mail.gmail.com[2]/messages/by-id/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com -- Peter Geoghegan, but I'm
starting a new one to focus attention on the bugs themselves (any
discussion of Valgrind instrumentation can remain on the other
thread).
The second fix is a spin-off of the first. It fixes a much less
serious issue that is present in all supported versions (it has
probably been around forever). The issue is with undercounting in the
"%u index pages have been deleted" figure reported by VACUUM VERBOSE.
For example, suppose I delete most of the tuples from a table with a
B-Tree index, leading to lots of empty pages that are candidates for
deletion (the specifics really aren't very important). I then run
VACUUM VERBOSE, and can see something like this:
4900 index pages have been deleted, 0 are currently reusable.
If I immediately run another VACUUM VERBOSE, I can see this:
4924 index pages have been deleted, 4924 are currently reusable.
It makes sense that the pages weren't reusable in the first VACUUM,
but were in the second VACUUM. But where did the extra 24 pages come
from? That doesn't make any sense at all. With the second patch
applied, the same test case shows the correct number ("4924 index
pages have been deleted") consistently. See the patch for details of
how the accounting is wrong. (The short version is that we need to
push the accounting down into the guts of page deletion, which is why
the second patch relies on the first patch.)
I would like to backpatch both patches to all branches that have
commit 857f9c36cda -- v11, v12, and master. The second issue isn't
serious, but it seems worth keeping v11+ in sync in this area. Note
that any backpatch theoretically creates an ABI break for callers of
the _bt_pagedel() function. Perhaps I could get around this by using
global state in the back branches or something, but that's ugly as
sin. Besides, I have a very hard time imagining an extension that
feels entitled to call _bt_pagedel(). There are all kinds of ways in
which that's a terrible idea. And it's hard to imagine how that could
ever seem useful. I'd like to hear what other people think about this
ABI issue in particular, though.
[1]: /messages/by-id/CAH2-Wz=mWPBLZ2cr95cBV=kzPav8OOBtkHTfg+-+AUiozANK0w@mail.gmail.com
[2]: /messages/by-id/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan
Attachments:
v1-0002-Fix-undercounting-in-VACUUM-VERBOSE-output.patchapplication/octet-stream; name=v1-0002-Fix-undercounting-in-VACUUM-VERBOSE-output.patchDownload
From 97b7c9ca6db58a7635f548f31a50e3d0868168ae Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sun, 26 Apr 2020 17:36:50 -0700
Subject: [PATCH v1 2/2] Fix undercounting in VACUUM VERBOSE output.
The logic for determining how many nbtree pages are deleted in an index
must compensate for the fact that indexes deleted by the current VACUUM
might be visited a second time by the outermost btvacuumscan() physical
order scan. btvacuumpage() avoided double-counting by only counting a
single page as deleted when there was one or more pages deleted.
This approach was better than simply accepting the return value of
_bt_pagedel() uncritically, but it was still slightly wrong. It failed
to account for the fact that _bt_pagedel() can legitimately delete pages
that the physical order scan will not visit again. This resulted in
undercounting in VACUUM VERBOSE output: the "%u index pages have been
deleted" figure often failed to include a small number of deleted
internal pages (as well as empty cousin leaf pages that couldn't be
deleted in an earlier _bt_pagedel() call due to the "rightmost page of a
parent" restriction).
Fix the accounting by teaching _bt_pagedel() about its caller's
requirements. It now counts pages that it knows the caller won't visit
again, and doesn't count pages that it knows the caller will visit
again. btvacuumpage() can now completely trust _bt_pagedel()'s return
value.
---
src/include/access/nbtree.h | 4 +--
src/backend/access/nbtree/nbtpage.c | 39 ++++++++++++++++++++++-------
src/backend/access/nbtree/nbtree.c | 33 +++++++++++++-----------
3 files changed, 50 insertions(+), 26 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bccd7ecc3..97c96d0a3a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1080,8 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable,
Relation heapRel);
-extern int _bt_pagedel(Relation rel, Buffer leafbuf,
- TransactionId *oldestBtpoXact);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
+ TransactionId *oldestBtpoXact);
/*
* prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 55175eb5c7..9c30196fc3 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,8 +38,10 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
+ BlockNumber vacscanblk,
bool *rightsib_empty,
- TransactionId *oldestBtpoXact);
+ TransactionId *oldestBtpoXact,
+ uint32 *ndeleted);
static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
OffsetNumber *deletable, int ndeletable);
static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
@@ -1489,7 +1491,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
*
* Returns the number of pages successfully deleted (zero if page cannot
* be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).
+ * were deleted too). The final count is calculated in a way that helps our
+ * caller to avoid double-counting. When we know that the outermost
+ * btvacuumscan physical-order loop over the index relation will go on to find
+ * a page that we delete now, we don't count it in return value.
*
* Maintains *oldestBtpoXact for caller for any pages that are deleted here
* (including caller's target page and any other internal pages or right
@@ -1501,10 +1506,11 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
* carefully, it's better to run it in a temp context that can be reset
* frequently.
*/
-int
+uint32
_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
{
- int ndeleted = 0;
+ uint32 ndeleted = 0;
+ BlockNumber vacscanblk;
BlockNumber rightsib;
bool rightsib_empty;
Page page;
@@ -1521,6 +1527,13 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
*/
BTStack stack = NULL;
+ /*
+ * Save original leafbuf block number from caller, to avoid
+ * double-counting pages. Only blocks that we delete that are <=
+ * vacscanblk get counted in ndeleted return value.
+ */
+ vacscanblk = BufferGetBlockNumber(leafbuf);
+
for (;;)
{
page = BufferGetPage(leafbuf);
@@ -1710,13 +1723,13 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
while (P_ISHALFDEAD(opaque))
{
/* Check for interrupts in _bt_unlink_halfdead_page */
- if (!_bt_unlink_halfdead_page(rel, leafbuf, &rightsib_empty,
- oldestBtpoXact))
+ if (!_bt_unlink_halfdead_page(rel, leafbuf, vacscanblk,
+ &rightsib_empty, oldestBtpoXact,
+ &ndeleted))
{
/* _bt_unlink_halfdead_page failed, released buffer */
return ndeleted;
}
- ndeleted++;
}
Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
@@ -1975,8 +1988,9 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* to avoid having to reacquire a lock we already released).
*/
static bool
-_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
- TransactionId *oldestBtpoXact)
+_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber vacscanblk,
+ bool *rightsib_empty, TransactionId *oldestBtpoXact,
+ uint32 *ndeleted)
{
BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
BlockNumber leafleftsib;
@@ -2372,6 +2386,13 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
*oldestBtpoXact = opaque->btpo.xact;
+ /*
+ * If btvacuumscan physical-order loop won't visit this page again and
+ * count it then, report it as being deleted here
+ */
+ if (target <= vacscanblk)
+ (*ndeleted)++;
+
/*
* Release the target, if it was not the leaf block. The leaf is always
* kept locked.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8ecd632446..43039c6d54 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1322,10 +1322,11 @@ restart:
else
{
/*
- * If the page has been split during this vacuum cycle, it seems
- * worth expending a write to clear btpo_cycleid even if we don't
- * have any deletions to do. (If we do, _bt_delitems_vacuum takes
- * care of this.) This ensures we won't process the page again.
+ * If the leaf page has been split during this vacuum cycle, it
+ * seems worth expending a write to clear btpo_cycleid even if we
+ * don't have any deletions to do. (If we do, _bt_delitems_vacuum
+ * takes care of this.) This ensures we won't process the page
+ * again.
*
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
@@ -1340,11 +1341,13 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples (live
- * table TIDs in posting lists are counted as separate live tuples).
- * We don't delete when recursing, though, to avoid putting entries
- * into freePages out-of-order (doesn't seem worth any extra code to
- * handle the case).
+ * If the leaf page is now empty, try to delete; else count the live
+ * tuples (live table TIDs in posting lists are counted as separate
+ * live tuples). We don't delete when recursing, though, to avoid
+ * putting entries into freePages out-of-order. Doesn't seem worth
+ * any extra code to handle the case, since it's very unlikely that
+ * we'll be able to delete a leaf page that's the right half of a page
+ * split that occurred after the current VACUUM operation began.
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
@@ -1357,17 +1360,17 @@ restart:
if (delete_now)
{
MemoryContext oldcontext;
- int ndel;
/* Run pagedel in a temp context to avoid memory leakage */
MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
- ndel = _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
-
- /* count only this page, else may double-count parent */
- if (ndel)
- stats->pages_deleted++;
+ /*
+ * We trust the _bt_pagedel return value because it does not include
+ * pages that a future call here from btvacuumscan is expected to
+ * count. There will be no double-counting.
+ */
+ stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
MemoryContextSwitchTo(oldcontext);
/* pagedel released buffer, so we shouldn't */
--
2.25.1
v1-0001-Fix-bug-in-nbtree-VACUUM-skip-full-scan-feature.patchapplication/octet-stream; name=v1-0001-Fix-bug-in-nbtree-VACUUM-skip-full-scan-feature.patchDownload
From 9f00e6e57f0c4899570b323949a240d40c6366fb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Sat, 25 Apr 2020 18:58:37 -0700
Subject: [PATCH v1 1/2] Fix bug in nbtree VACUUM "skip full scan" feature.
Commit 857f9c36cda (which taught nbtree to skip VACUUM's full index scan
during cleanup of nbtree indexes in some situations) made nbtree VACUUM
store the oldest btpo_xact in the nbtree metapage for the index as a
whole. This was based on the oldest observed value across every deleted
index page. However, it failed to handle pages that the current VACUUM
operation deletes correctly.
The most immediate problem with the handling of pages deleted by the
current VACUUM operation was that the special area of the page was
examined without a buffer pin. More fundamentally, the handling failed
to account for the full range of _bt_pagedel() behaviors. For example,
_bt_pagedel() sometimes deletes internal pages in passing, as part of
deleting an entire subtree with btvacuumpage() caller's page as the leaf
level page.
To fix, push down the logic for maintaining the oldest btpo_xact to
_bt_pagedel(). btvacuumpage() is strictly responsible for pages that
were fully deleted by a previous VACUUM operation, while _bt_pagedel()
is strictly responsible for pages that are deleted by the current VACUUM
operation (which includes half-dead pages from a previous interrupted
VACUUM operation).
Discussion: https://postgr.es/m/CAH2-Wz=mWPBLZ2cr95cBV=kzPav8OOBtkHTfg+-+AUiozANK0w@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
---
src/include/access/nbtree.h | 3 +-
src/backend/access/nbtree/nbtpage.c | 133 ++++++++++++++++++++--------
src/backend/access/nbtree/nbtree.c | 21 ++---
3 files changed, 111 insertions(+), 46 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9a3acd26b7..7bccd7ecc3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1080,7 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable,
Relation heapRel);
-extern int _bt_pagedel(Relation rel, Buffer buf);
+extern int _bt_pagedel(Relation rel, Buffer leafbuf,
+ TransactionId *oldestBtpoXact);
/*
* prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f5962f6163..55175eb5c7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -35,9 +35,11 @@
#include "utils/snapmgr.h"
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
+ BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
- bool *rightsib_empty);
+ bool *rightsib_empty,
+ TransactionId *oldestBtpoXact);
static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
OffsetNumber *deletable, int ndeletable);
static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
@@ -1470,27 +1472,37 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
}
/*
- * _bt_pagedel() -- Delete a page from the b-tree, if legal to do so.
+ * _bt_pagedel() -- Delete a leaf page from the b-tree, if legal to do so.
*
- * This action unlinks the page from the b-tree structure, removing all
+ * This action unlinks the leaf page from the b-tree structure, removing all
* pointers leading to it --- but not touching its own left and right links.
* The page cannot be physically reclaimed right away, since other processes
* may currently be trying to follow links leading to the page; they have to
* be allowed to use its right-link to recover. See nbtree/README.
*
* On entry, the target buffer must be pinned and locked (either read or write
- * lock is OK). This lock and pin will be dropped before exiting.
+ * lock is OK). The page must be an empty leaf-page, which may be half-dead
+ * already (a half-dead page should only be passed to us when an earlier
+ * VACUUM operation was interrupted, though). Note in particular that caller
+ * should never pass a buffer containing an existing deleted page here. The
+ * lock and pin on caller's buffer will be dropped before we return.
*
* Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or sibling pages were
- * deleted too).
+ * be deleted now; could be more than one if parent or right sibling pages
+ * were deleted too).
+ *
+ * Maintains *oldestBtpoXact for caller for any pages that are deleted here
+ * (including caller's target page and any other internal pages or right
+ * sibling pages that we manage to delete). Caller is responsible for
+ * maintaining *oldestBtpoXact in the case of pages that were deleted by a
+ * previous VACUUM.
*
* NOTE: this leaks memory. Rather than trying to clean up everything
* carefully, it's better to run it in a temp context that can be reset
* frequently.
*/
int
-_bt_pagedel(Relation rel, Buffer buf)
+_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
{
int ndeleted = 0;
BlockNumber rightsib;
@@ -1511,14 +1523,21 @@ _bt_pagedel(Relation rel, Buffer buf)
for (;;)
{
- page = BufferGetPage(buf);
+ page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/*
* Internal pages are never deleted directly, only as part of deleting
* the whole branch all the way down to leaf level.
+ *
+ * Also check for deleted pages here. Caller never passes us a fully
+ * deleted page. Only VACUUM can delete pages, so there can't have
+ * been a concurrent deletion. Assume that we reached any deleted
+ * page encountered here by following a sibling link, and that the
+ * index is corrupt.
*/
- if (!P_ISLEAF(opaque))
+ Assert(!P_ISDELETED(opaque));
+ if (!P_ISLEAF(opaque) || P_ISDELETED(opaque))
{
/*
* Pre-9.4 page deletion only marked internal pages as half-dead,
@@ -1537,13 +1556,21 @@ _bt_pagedel(Relation rel, Buffer buf)
errmsg("index \"%s\" contains a half-dead internal page",
RelationGetRelationName(rel)),
errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
- _bt_relbuf(rel, buf);
+
+ if (P_ISDELETED(opaque))
+ ereport(LOG,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" contains a deleted page with a live sibling link",
+ RelationGetRelationName(rel))));
+
+ _bt_relbuf(rel, leafbuf);
return ndeleted;
}
/*
* We can never delete rightmost pages nor root pages. While at it,
- * check that page is not already deleted and is empty.
+ * check that page is empty, since it's possible that the page was
+ * empty a moment ago, but has since had some inserts.
*
* To keep the algorithm simple, we also never delete an incompletely
* split page (they should be rare enough that this doesn't make any
@@ -1558,14 +1585,14 @@ _bt_pagedel(Relation rel, Buffer buf)
* to. On subsequent iterations, we know we stepped right from a page
* that passed these tests, so it's OK.
*/
- if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
+ if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
P_INCOMPLETE_SPLIT(opaque))
{
/* Should never fail to delete a half-dead page */
Assert(!P_ISHALFDEAD(opaque));
- _bt_relbuf(rel, buf);
+ _bt_relbuf(rel, leafbuf);
return ndeleted;
}
@@ -1603,7 +1630,7 @@ _bt_pagedel(Relation rel, Buffer buf)
* To avoid deadlocks, we'd better drop the leaf page lock
* before going further.
*/
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
/*
* Fetch the left sibling, to check that it's not marked with
@@ -1627,10 +1654,10 @@ _bt_pagedel(Relation rel, Buffer buf)
* incompletely-split page to be split again. So we don't
* need to walk right here.
*/
- if (lopaque->btpo_next == BufferGetBlockNumber(buf) &&
+ if (lopaque->btpo_next == BufferGetBlockNumber(leafbuf) &&
P_INCOMPLETE_SPLIT(lopaque))
{
- ReleaseBuffer(buf);
+ ReleaseBuffer(leafbuf);
_bt_relbuf(rel, lbuf);
return ndeleted;
}
@@ -1646,16 +1673,26 @@ _bt_pagedel(Relation rel, Buffer buf)
_bt_relbuf(rel, lbuf);
/*
- * Re-lock the leaf page, and start over, to re-check that the
- * page can still be deleted.
+ * Re-lock the leaf page, and start over to use our stack
+ * within _bt_mark_page_halfdead(). We must do it that way
+ * because it's possible that page can no longer be deleted.
+ * We need to recheck.
*/
- LockBuffer(buf, BT_WRITE);
+ LockBuffer(leafbuf, BT_WRITE);
continue;
}
- if (!_bt_mark_page_halfdead(rel, buf, stack))
+ /*
+ * See if it's safe to delete the leaf page, and determine how
+ * many parent/internal pages above the leaf level will be
+ * deleted. If it's safe then _bt_mark_page_halfdead will also
+ * perform the first phase of deletion, including marking the leaf
+ * page half-dead.
+ */
+ Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
+ if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
{
- _bt_relbuf(rel, buf);
+ _bt_relbuf(rel, leafbuf);
return ndeleted;
}
}
@@ -1663,23 +1700,32 @@ _bt_pagedel(Relation rel, Buffer buf)
/*
* Then unlink it from its siblings. Each call to
* _bt_unlink_halfdead_page unlinks the topmost page from the branch,
- * making it shallower. Iterate until the leaf page is gone.
+ * making it shallower. Iterate until the leaf page is deleted.
+ *
+ * _bt_unlink_halfdead_page should never fail, since we established
+ * that deletion is generally safe in _bt_mark_page_halfdead.
*/
rightsib_empty = false;
+ Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
while (P_ISHALFDEAD(opaque))
{
- /* will check for interrupts, once lock is released */
- if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
+ /* Check for interrupts in _bt_unlink_halfdead_page */
+ if (!_bt_unlink_halfdead_page(rel, leafbuf, &rightsib_empty,
+ oldestBtpoXact))
{
- /* _bt_unlink_halfdead_page already released buffer */
+ /* _bt_unlink_halfdead_page failed, released buffer */
return ndeleted;
}
ndeleted++;
}
+ Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
+ Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
+ *oldestBtpoXact));
+
rightsib = opaque->btpo_next;
- _bt_relbuf(rel, buf);
+ _bt_relbuf(rel, leafbuf);
/*
* Check here, as calling loops will have locks held, preventing
@@ -1705,7 +1751,7 @@ _bt_pagedel(Relation rel, Buffer buf)
if (!rightsib_empty)
break;
- buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+ leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
}
return ndeleted;
@@ -1909,9 +1955,19 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* of the whole branch, including the leaf page itself, iterate until the
* leaf page is deleted.
*
- * Returns 'false' if the page could not be unlinked (shouldn't happen).
- * If the (current) right sibling of the page is empty, *rightsib_empty is
- * set to true.
+ * Returns 'false' if the page could not be unlinked (shouldn't happen). If
+ * the right sibling of the current target page is empty, *rightsib_empty is
+ * set to true, allowing caller to delete the target's right sibling page in
+ * passing. Note that *rightsib_empty is only actually used by caller when
+ * target page is leafbuf, following last call here for leafbuf/the subtree
+ * containing leafbuf. (We always set *rightsib_empty for caller, just to be
+ * consistent.)
+ *
+ * We maintain *oldestBtpoXact for VACUUM operation as a whole here. This is
+ * necessary because we conservatively assume that there must be a call to
+ * ReadNewTransactionId() for each and every page that is deleted, including
+ * when we delete a series of pages that are all part of the same empty
+ * subtree -- see comments about the underlying assumption below.
*
* Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
* On success exit, we'll be holding pin and write lock. On failure exit,
@@ -1919,7 +1975,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* to avoid having to reacquire a lock we already released).
*/
static bool
-_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
+_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
+ TransactionId *oldestBtpoXact)
{
BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
BlockNumber leafleftsib;
@@ -2057,9 +2114,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
lbuf = InvalidBuffer;
/*
- * Next write-lock the target page itself. It should be okay to take just
- * a write lock not a superexclusive lock, since no scans would stop on an
- * empty page.
+ * Next write-lock the target page itself. It's okay to take a write lock
+ * rather than a superexclusive lock, since no scan will stop on an empty
+ * page.
*/
LockBuffer(buf, BT_WRITE);
page = BufferGetPage(buf);
@@ -2204,6 +2261,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
*/
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
opaque->btpo_flags &= ~BTP_HALF_DEAD;
opaque->btpo_flags |= BTP_DELETED;
opaque->btpo.xact = ReadNewTransactionId();
@@ -2309,6 +2367,11 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
_bt_relbuf(rel, lbuf);
_bt_relbuf(rel, rbuf);
+ /* maintain oldestBtpoXact */
+ if (!TransactionIdIsValid(*oldestBtpoXact) ||
+ TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
+ *oldestBtpoXact = opaque->btpo.xact;
+
/*
* Release the target, if it was not the leaf block. The leaf is always
* kept locked.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 36294789f3..8ecd632446 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1137,24 +1137,30 @@ restart:
/* Page is valid, see what to do with it */
if (_bt_page_recyclable(page))
{
- /* Okay to recycle this page */
+ /* Okay to recycle this page (which could be leaf or internal) */
RecordFreeIndexPage(rel, blkno);
vstate->totFreePages++;
stats->pages_deleted++;
}
else if (P_ISDELETED(opaque))
{
- /* Already deleted, but can't recycle yet */
+ /*
+ * Already deleted page (which could be leaf or internal). Can't
+ * recycle yet.
+ */
stats->pages_deleted++;
- /* Update the oldest btpo.xact */
+ /* Maintain the oldest btpo.xact */
if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
vstate->oldestBtpoXact = opaque->btpo.xact;
}
else if (P_ISHALFDEAD(opaque))
{
- /* Half-dead, try to delete */
+ /*
+ * Half-dead leaf page. Try to delete now. Might update
+ * oldestBtpoXact and pages_deleted below.
+ */
delete_now = true;
}
else if (P_ISLEAF(opaque))
@@ -1357,16 +1363,11 @@ restart:
MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
- ndel = _bt_pagedel(rel, buf);
+ ndel = _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
/* count only this page, else may double-count parent */
if (ndel)
- {
stats->pages_deleted++;
- if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
- TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
- vstate->oldestBtpoXact = opaque->btpo.xact;
- }
MemoryContextSwitchTo(oldcontext);
/* pagedel released buffer, so we shouldn't */
--
2.25.1
On Mon, Apr 27, 2020 at 11:02 AM Peter Geoghegan <pg@bowt.ie> wrote:
I would like to backpatch both patches to all branches that have
commit 857f9c36cda -- v11, v12, and master. The second issue isn't
serious, but it seems worth keeping v11+ in sync in this area. Note
that any backpatch theoretically creates an ABI break for callers of
the _bt_pagedel() function.
I plan to commit both fixes, while backpatching to v11, v12 and the
master branch on Friday -- barring objections.
I'm almost sure that the ABI thing is a non-issue, but it would be
nice to get that seconded.
--
Peter Geoghegan
On Tue, 28 Apr 2020 at 11:35, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Apr 27, 2020 at 11:02 AM Peter Geoghegan <pg@bowt.ie> wrote:
I would like to backpatch both patches to all branches that have
commit 857f9c36cda -- v11, v12, and master. The second issue isn't
serious, but it seems worth keeping v11+ in sync in this area. Note
that any backpatch theoretically creates an ABI break for callers of
the _bt_pagedel() function.I plan to commit both fixes, while backpatching to v11, v12 and the
master branch on Friday -- barring objections.
Thank you for the patches and for starting the new thread.
I agree with both patches.
For the first fix it seems better to push down the logic to the page
deletion code as your 0001 patch does so. The following change changes
the page deletion code so that it emits LOG message indicating the
index corruption when a deleted page is passed. But we used to ignore
in this case.
@@ -1511,14 +1523,21 @@ _bt_pagedel(Relation rel, Buffer buf)
for (;;)
{
- page = BufferGetPage(buf);
+ page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/*
* Internal pages are never deleted directly, only as part of deleting
* the whole branch all the way down to leaf level.
+ *
+ * Also check for deleted pages here. Caller never passes us a fully
+ * deleted page. Only VACUUM can delete pages, so there can't have
+ * been a concurrent deletion. Assume that we reached any deleted
+ * page encountered here by following a sibling link, and that the
+ * index is corrupt.
*/
- if (!P_ISLEAF(opaque))
+ Assert(!P_ISDELETED(opaque));
+ if (!P_ISLEAF(opaque) || P_ISDELETED(opaque))
{
/*
* Pre-9.4 page deletion only marked internal pages as half-dead,
@@ -1537,13 +1556,21 @@ _bt_pagedel(Relation rel, Buffer buf)
errmsg("index \"%s\" contains a half-dead
internal page",
RelationGetRelationName(rel)),
errhint("This can be caused by an interrupted
VACUUM in version 9.3 or older, before upgrade. Please REINDEX
it.")));
- _bt_relbuf(rel, buf);
+
+ if (P_ISDELETED(opaque))
+ ereport(LOG,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index \"%s\" contains a deleted page
with a live sibling link",
+ RelationGetRelationName(rel))));
+
+ _bt_relbuf(rel, leafbuf);
return ndeleted;
}
I agree with this change but since I'm not sure this change directly
relates to this bug I wonder if it can be a separate patch.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
I agree with both patches.
Thanks for the review.
For the first fix it seems better to push down the logic to the page
deletion code as your 0001 patch does so. The following change changes
the page deletion code so that it emits LOG message indicating the
index corruption when a deleted page is passed. But we used to ignore
in this case.
I agree with this change but since I'm not sure this change directly
relates to this bug I wonder if it can be a separate patch.
I am not surprised that you are concerned about this. That was the
hardest decision I had to make when writing the patch.
The central idea in the 0001-* patch (as I say at the end of the
commit message) is that we have a strict rule about where we handle
maintaining the oldest btpo.xact for already-deleted pages, and where
we handle it for pages that become deleted. The rules is this:
btvacuumpage() is strictly responsible for the former category (as
well as totally live pages), while _bt_pagedel() is responsible for
the latter category (this includes pages that started out as half-dead
but become deleted).
In order for this to work, _bt_pagedel() must not ever see a deleted
page that it did not delete itself. If I didn't explicitly take a
strict position on this, then I suppose that I'd have to have similar
handling for maintaining the oldest btpo.xact for existing deleted
pages encountered within _bt_pagedel(). But that doesn't make any
sense, even with corruption, so I took a strict position. Even with
corruption, how could we encounter an *existing* deleted page in
_bt_pagedel() but not encounter the same page within some call to
btvacuumpage() made from btvacuumscan()? In other words, how can we
fail to maintain the oldest btpo.xact for deleted pages even if we
assume that the index has a corrupt page with sibling links that point
to a fully deleted page? It seems impossible, even in this extreme,
contrived scenario.
Then there is the separate question of the new log message about
corruption. It's not too hard to see why _bt_pagedel() cannot ever
access an existing deleted page unless there is corruption:
* _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply
never pass a deleted page -- it handles those directly.
* _bt_pagedel() can only access additional leaf pages to consider
deleting them by following the right sibling link of the
btvacuumpage() caller's leaf page (or possibly some page even further
to the right if it ends up deleting many leaf pages in one go).
Following right links and finding a deleted page can only happen when
there is a concurrent page deletion by VACUUM, since the sibling links
to the deleted page are changed by the second stage of deletion (i.e.
by the atomic actionat where the page is actually marked deleted,
within _bt_unlink_halfdead_page()).
* There cannot be a concurrent VACUUM, though, since _bt_pagedel()
runs in VACUUM. (Note that we already depend on this for correctness
in _bt_unlink_halfdead_page(), which has a comment that says "a
half-dead page cannot resurrect because there can be only one vacuum
process running at a time".)
The strict position seems justified. It really isn't that strict. I'm
not concerned about the new LOG message concerning corruption being
annoying or wrong
--
Peter Geoghegan
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
For the first fix it seems better to push down the logic to the page
deletion code as your 0001 patch does so. The following change changes
the page deletion code so that it emits LOG message indicating the
index corruption when a deleted page is passed. But we used to ignore
in this case.
Attached is v2 of the patch set, which includes a new patch that I
want to target the master branch with. This patch (v2-0003-*)
refactors btvacuumscan().
This revision also changed the first bugfix patch. I have simplified
some of the details in nbtree.c that were added by commit 857f9c36cda.
Can't we call _bt_update_meta_cleanup_info() at a lower level, like in
the patch? AFAICT it makes more sense to just call it in
btvacuumscan(). Please let me know what you think of those changes.
The big change in the new master-only refactoring patch (v2-0003-*) is
that I now treat a call to btvacuumpage() that has to
"backtrack/recurse" and doesn't find a page that looks like the
expected right half of a page split (a page split that occurred since
the VACUUM operation began) as indicating corruption. This corruption
is logged. I believe that we should never see this happen, for reasons
that are similar to the reasons why _bt_pagedel() never finds a
deleted page when moving right, as I went into in my recent e-mail to
this thread. I think that this is worth tightening up for Postgres 13.
I will hold off on committing v2-0003-*, since I need to nail down the
reasoning for treating this condition as corruption. Plus it's not
urgent. I think that the general direction taken in v2-0003-* is the
right one, in any case. The "recursion" in btvacuumpage() doesn't make
much sense -- it obscures what's really going on IMV. Also, using the
variable name "scanblkno" at every level -- btvacuumscan(),
btvacuumpage(), and _bt_pagedel() -- makes it clear that it's the same
block at all levels of the code. And that the "backtrack/recursion"
case doesn't kill tuples from the "scanblkno" block (and doesn't
attempt to call _bt_pagedel(), which is designed only to handle blocks
passed by btvacuumpage() when they are the current "scanblkno"). It
seems unlikely that the VACUUM VERBOSE pages deleted accounting bug
would have happened if the code was already structured this way.
--
Peter Geoghegan
Attachments:
v2-0001-Fix-bug-in-nbtree-VACUUM-skip-full-scan-feature.patchapplication/octet-stream; name=v2-0001-Fix-bug-in-nbtree-VACUUM-skip-full-scan-feature.patchDownload
From 7bf56c30e5fa0cb2c7b10fb658ced3d9d69159a0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 28 Apr 2020 19:12:41 -0700
Subject: [PATCH v2 1/3] Fix bug in nbtree VACUUM "skip full scan" feature.
Commit 857f9c36cda (which taught nbtree to skip VACUUM's full index scan
during cleanup of nbtree indexes in some situations) made nbtree VACUUM
store the oldest btpo_xact in the nbtree metapage for the index as a
whole. This was based on the oldest observed value across every deleted
index page. However, it failed to handle pages that the current VACUUM
operation deletes correctly.
The most immediate problem with the handling of pages deleted by the
current VACUUM operation was that the special area of the page was
examined without a buffer pin. More fundamentally, the handling failed
to account for the full range of _bt_pagedel() behaviors. For example,
_bt_pagedel() sometimes deletes internal pages in passing, as part of
deleting an entire subtree with btvacuumpage() caller's page as the leaf
level page.
To fix, push down the logic for maintaining the oldest btpo_xact to
_bt_pagedel(). btvacuumpage() is strictly responsible for pages that
were fully deleted by a previous VACUUM operation, while _bt_pagedel()
is strictly responsible for pages that are deleted by the current VACUUM
operation (which includes half-dead pages from a previous interrupted
VACUUM operation).
Discussion: https://postgr.es/m/CAH2-Wz=mWPBLZ2cr95cBV=kzPav8OOBtkHTfg+-+AUiozANK0w@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
Backpatch: 11-, where the "skip full scan" feature was introduced.
---
src/include/access/nbtree.h | 3 +-
src/backend/access/nbtree/nbtpage.c | 134 ++++++++++++++++++++--------
src/backend/access/nbtree/nbtree.c | 85 ++++++++++--------
3 files changed, 147 insertions(+), 75 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9a3acd26b7..7bccd7ecc3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1080,7 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable,
Relation heapRel);
-extern int _bt_pagedel(Relation rel, Buffer buf);
+extern int _bt_pagedel(Relation rel, Buffer leafbuf,
+ TransactionId *oldestBtpoXact);
/*
* prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f5962f6163..c90a2012d5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -35,9 +35,11 @@
#include "utils/snapmgr.h"
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
+static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
+ BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
- bool *rightsib_empty);
+ bool *rightsib_empty,
+ TransactionId *oldestBtpoXact);
static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
OffsetNumber *deletable, int ndeletable);
static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
@@ -1470,27 +1472,37 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
}
/*
- * _bt_pagedel() -- Delete a page from the b-tree, if legal to do so.
+ * _bt_pagedel() -- Delete a leaf page from the b-tree, if legal to do so.
*
- * This action unlinks the page from the b-tree structure, removing all
+ * This action unlinks the leaf page from the b-tree structure, removing all
* pointers leading to it --- but not touching its own left and right links.
* The page cannot be physically reclaimed right away, since other processes
* may currently be trying to follow links leading to the page; they have to
* be allowed to use its right-link to recover. See nbtree/README.
*
* On entry, the target buffer must be pinned and locked (either read or write
- * lock is OK). This lock and pin will be dropped before exiting.
+ * lock is OK). The page must be an empty leaf-page, which may be half-dead
+ * already (a half-dead page should only be passed to us when an earlier
+ * VACUUM operation was interrupted, though). Note in particular that caller
+ * should never pass a buffer containing an existing deleted page here. The
+ * lock and pin on caller's buffer will be dropped before we return.
*
* Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or sibling pages were
- * deleted too).
+ * be deleted now; could be more than one if parent or right sibling pages
+ * were deleted too).
+ *
+ * Maintains *oldestBtpoXact for caller for any pages that are deleted here
+ * (including caller's target page and any other internal pages or right
+ * sibling pages that we manage to delete). Caller is responsible for
+ * maintaining *oldestBtpoXact in the case of pages that were deleted by a
+ * previous VACUUM.
*
* NOTE: this leaks memory. Rather than trying to clean up everything
* carefully, it's better to run it in a temp context that can be reset
* frequently.
*/
int
-_bt_pagedel(Relation rel, Buffer buf)
+_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
{
int ndeleted = 0;
BlockNumber rightsib;
@@ -1511,14 +1523,21 @@ _bt_pagedel(Relation rel, Buffer buf)
for (;;)
{
- page = BufferGetPage(buf);
+ page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
/*
* Internal pages are never deleted directly, only as part of deleting
* the whole branch all the way down to leaf level.
+ *
+ * Also check for deleted pages here. Caller never passes us a fully
+ * deleted page. Only VACUUM can delete pages, so there can't have
+ * been a concurrent deletion. Assume that we reached any deleted
+ * page encountered here by following a sibling link, and that the
+ * index is corrupt.
*/
- if (!P_ISLEAF(opaque))
+ Assert(!P_ISDELETED(opaque));
+ if (!P_ISLEAF(opaque) || P_ISDELETED(opaque))
{
/*
* Pre-9.4 page deletion only marked internal pages as half-dead,
@@ -1537,13 +1556,22 @@ _bt_pagedel(Relation rel, Buffer buf)
errmsg("index \"%s\" contains a half-dead internal page",
RelationGetRelationName(rel)),
errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
- _bt_relbuf(rel, buf);
+
+ if (P_ISDELETED(opaque))
+ ereport(LOG,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("found deleted block %u while following right link in index \"%s\"",
+ BufferGetBlockNumber(leafbuf),
+ RelationGetRelationName(rel))));
+
+ _bt_relbuf(rel, leafbuf);
return ndeleted;
}
/*
* We can never delete rightmost pages nor root pages. While at it,
- * check that page is not already deleted and is empty.
+ * check that page is empty, since it's possible that the page was
+ * empty a moment ago, but has since had some inserts.
*
* To keep the algorithm simple, we also never delete an incompletely
* split page (they should be rare enough that this doesn't make any
@@ -1558,14 +1586,14 @@ _bt_pagedel(Relation rel, Buffer buf)
* to. On subsequent iterations, we know we stepped right from a page
* that passed these tests, so it's OK.
*/
- if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
+ if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
P_INCOMPLETE_SPLIT(opaque))
{
/* Should never fail to delete a half-dead page */
Assert(!P_ISHALFDEAD(opaque));
- _bt_relbuf(rel, buf);
+ _bt_relbuf(rel, leafbuf);
return ndeleted;
}
@@ -1603,7 +1631,7 @@ _bt_pagedel(Relation rel, Buffer buf)
* To avoid deadlocks, we'd better drop the leaf page lock
* before going further.
*/
- LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
/*
* Fetch the left sibling, to check that it's not marked with
@@ -1627,10 +1655,10 @@ _bt_pagedel(Relation rel, Buffer buf)
* incompletely-split page to be split again. So we don't
* need to walk right here.
*/
- if (lopaque->btpo_next == BufferGetBlockNumber(buf) &&
+ if (lopaque->btpo_next == BufferGetBlockNumber(leafbuf) &&
P_INCOMPLETE_SPLIT(lopaque))
{
- ReleaseBuffer(buf);
+ ReleaseBuffer(leafbuf);
_bt_relbuf(rel, lbuf);
return ndeleted;
}
@@ -1646,16 +1674,26 @@ _bt_pagedel(Relation rel, Buffer buf)
_bt_relbuf(rel, lbuf);
/*
- * Re-lock the leaf page, and start over, to re-check that the
- * page can still be deleted.
+ * Re-lock the leaf page, and start over to use our stack
+ * within _bt_mark_page_halfdead(). We must do it that way
+ * because it's possible that page can no longer be deleted.
+ * We need to recheck.
*/
- LockBuffer(buf, BT_WRITE);
+ LockBuffer(leafbuf, BT_WRITE);
continue;
}
- if (!_bt_mark_page_halfdead(rel, buf, stack))
+ /*
+ * See if it's safe to delete the leaf page, and determine how
+ * many parent/internal pages above the leaf level will be
+ * deleted. If it's safe then _bt_mark_page_halfdead will also
+ * perform the first phase of deletion, including marking the leaf
+ * page half-dead.
+ */
+ Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
+ if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
{
- _bt_relbuf(rel, buf);
+ _bt_relbuf(rel, leafbuf);
return ndeleted;
}
}
@@ -1663,23 +1701,32 @@ _bt_pagedel(Relation rel, Buffer buf)
/*
* Then unlink it from its siblings. Each call to
* _bt_unlink_halfdead_page unlinks the topmost page from the branch,
- * making it shallower. Iterate until the leaf page is gone.
+ * making it shallower. Iterate until the leaf page is deleted.
+ *
+ * _bt_unlink_halfdead_page should never fail, since we established
+ * that deletion is generally safe in _bt_mark_page_halfdead.
*/
rightsib_empty = false;
+ Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
while (P_ISHALFDEAD(opaque))
{
- /* will check for interrupts, once lock is released */
- if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
+ /* Check for interrupts in _bt_unlink_halfdead_page */
+ if (!_bt_unlink_halfdead_page(rel, leafbuf, &rightsib_empty,
+ oldestBtpoXact))
{
- /* _bt_unlink_halfdead_page already released buffer */
+ /* _bt_unlink_halfdead_page failed, released buffer */
return ndeleted;
}
ndeleted++;
}
+ Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
+ Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
+ *oldestBtpoXact));
+
rightsib = opaque->btpo_next;
- _bt_relbuf(rel, buf);
+ _bt_relbuf(rel, leafbuf);
/*
* Check here, as calling loops will have locks held, preventing
@@ -1705,7 +1752,7 @@ _bt_pagedel(Relation rel, Buffer buf)
if (!rightsib_empty)
break;
- buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+ leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
}
return ndeleted;
@@ -1909,9 +1956,19 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* of the whole branch, including the leaf page itself, iterate until the
* leaf page is deleted.
*
- * Returns 'false' if the page could not be unlinked (shouldn't happen).
- * If the (current) right sibling of the page is empty, *rightsib_empty is
- * set to true.
+ * Returns 'false' if the page could not be unlinked (shouldn't happen). If
+ * the right sibling of the current target page is empty, *rightsib_empty is
+ * set to true, allowing caller to delete the target's right sibling page in
+ * passing. Note that *rightsib_empty is only actually used by caller when
+ * target page is leafbuf, following last call here for leafbuf/the subtree
+ * containing leafbuf. (We always set *rightsib_empty for caller, just to be
+ * consistent.)
+ *
+ * We maintain *oldestBtpoXact for VACUUM operation as a whole here. This is
+ * necessary because we conservatively assume that there must be a call to
+ * ReadNewTransactionId() for each and every page that is deleted, including
+ * when we delete a series of pages that are all part of the same empty
+ * subtree -- see comments about the underlying assumption below.
*
* Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
* On success exit, we'll be holding pin and write lock. On failure exit,
@@ -1919,7 +1976,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* to avoid having to reacquire a lock we already released).
*/
static bool
-_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
+_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
+ TransactionId *oldestBtpoXact)
{
BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
BlockNumber leafleftsib;
@@ -2057,9 +2115,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
lbuf = InvalidBuffer;
/*
- * Next write-lock the target page itself. It should be okay to take just
- * a write lock not a superexclusive lock, since no scans would stop on an
- * empty page.
+ * Next write-lock the target page itself. It's okay to take a write lock
+ * rather than a superexclusive lock, since no scan will stop on an empty
+ * page.
*/
LockBuffer(buf, BT_WRITE);
page = BufferGetPage(buf);
@@ -2204,6 +2262,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
*/
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
opaque->btpo_flags &= ~BTP_HALF_DEAD;
opaque->btpo_flags |= BTP_DELETED;
opaque->btpo.xact = ReadNewTransactionId();
@@ -2309,6 +2368,11 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
_bt_relbuf(rel, lbuf);
_bt_relbuf(rel, rbuf);
+ /* maintain oldestBtpoXact */
+ if (!TransactionIdIsValid(*oldestBtpoXact) ||
+ TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
+ *oldestBtpoXact = opaque->btpo.xact;
+
/*
* Release the target, if it was not the leaf block. The leaf is always
* kept locked.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 36294789f3..65f7fd1945 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -92,7 +92,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
- BTCycleId cycleid, TransactionId *oldestBtpoXact);
+ BTCycleId cycleid);
static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
BlockNumber orig_blkno);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
@@ -787,8 +787,16 @@ _bt_parallel_advance_array_keys(IndexScanDesc scan)
}
/*
- * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that
- * btbulkdelete() wasn't called.
+ * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup
+ *
+ * Called by btvacuumcleanup when btbulkdelete was never called, because no
+ * tuples needed to be deleted.
+ *
+ * When we return false, VACUUM can even skip the cleanup-only call to
+ * btvacuumscan (i.e. there will be no btvacuumscan call for the entire
+ * VACUUM). Otherwise a cleanup-only btvacuumscan call is required, which may
+ * recycle already-deleted pages, and gather accurate statistics for the
+ * index.
*/
static bool
_bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
@@ -815,8 +823,15 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
RecentGlobalXmin))
{
/*
- * If oldest btpo.xact in the deleted pages is older than
- * RecentGlobalXmin, then at least one deleted page can be recycled.
+ * If any oldest btpo.xact from a previously deleted page in the index
+ * is older than RecentGlobalXmin, then at least one deleted page can
+ * be recycled -- don't skip cleanup.
+ *
+ * Note that btvacuumpage currently doesn't make any effort to
+ * recognize when a recycled page is already in the FSM (i.e. put
+ * there by a previous VACUUM). We have to be conservative because
+ * the FSM isn't crash safe. Hopefully recycled pages get reused
+ * before too long.
*/
result = true;
}
@@ -873,20 +888,9 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* The ENSURE stuff ensures we clean up shared memory on failure */
PG_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel));
{
- TransactionId oldestBtpoXact;
-
cycleid = _bt_start_vacuum(rel);
- btvacuumscan(info, stats, callback, callback_state, cycleid,
- &oldestBtpoXact);
-
- /*
- * Update cleanup-related information in metapage. This information is
- * used only for cleanup but keeping them up to date can avoid
- * unnecessary cleanup even after bulkdelete.
- */
- _bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
- info->num_heap_tuples);
+ btvacuumscan(info, stats, callback, callback_state, cycleid);
}
PG_END_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel));
_bt_end_vacuum(rel);
@@ -918,18 +922,12 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
*/
if (stats == NULL)
{
- TransactionId oldestBtpoXact;
-
/* Check if we need a cleanup */
if (!_bt_vacuum_needs_cleanup(info))
return NULL;
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
- btvacuumscan(info, stats, NULL, NULL, 0, &oldestBtpoXact);
-
- /* Update cleanup-related information in the metapage */
- _bt_update_meta_cleanup_info(info->index, oldestBtpoXact,
- info->num_heap_tuples);
+ btvacuumscan(info, stats, NULL, NULL, 0);
}
/*
@@ -954,7 +952,9 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
* according to the vacuum callback, looking for empty pages that can be
* deleted, and looking for old deleted pages that can be recycled. Both
* btbulkdelete and btvacuumcleanup invoke this (the latter only if no
- * btbulkdelete call occurred).
+ * btbulkdelete call occurred and _bt_vacuum_needs_cleanup returned true).
+ * Note that this is also where the metadata used by _bt_vacuum_needs_cleanup
+ * is maintained.
*
* The caller is responsible for initially allocating/zeroing a stats struct
* and for obtaining a vacuum cycle ID if necessary.
@@ -962,7 +962,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
static void
btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
- BTCycleId cycleid, TransactionId *oldestBtpoXact)
+ BTCycleId cycleid)
{
Relation rel = info->index;
BTVacState vstate;
@@ -1046,6 +1046,15 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
MemoryContextDelete(vstate.pagedelcontext);
+ /*
+ * If we deleted any pages (or encountered any existing deleted pages)
+ * then update the oldest btpo.xact for the index as a whole in the index
+ * metapage. We also maintain the number of heap tuples for the last
+ * VACUUM that actually called btvacuumscan().
+ */
+ _bt_update_meta_cleanup_info(rel, vstate.oldestBtpoXact,
+ info->num_heap_tuples);
+
/*
* If we found any recyclable pages (and recorded them in the FSM), then
* forcibly update the upper-level FSM pages to ensure that searchers can
@@ -1064,9 +1073,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/* update statistics */
stats->num_pages = num_pages;
stats->pages_free = vstate.totFreePages;
-
- if (oldestBtpoXact)
- *oldestBtpoXact = vstate.oldestBtpoXact;
}
/*
@@ -1137,24 +1143,30 @@ restart:
/* Page is valid, see what to do with it */
if (_bt_page_recyclable(page))
{
- /* Okay to recycle this page */
+ /* Okay to recycle this page (which could be leaf or internal) */
RecordFreeIndexPage(rel, blkno);
vstate->totFreePages++;
stats->pages_deleted++;
}
else if (P_ISDELETED(opaque))
{
- /* Already deleted, but can't recycle yet */
+ /*
+ * Already deleted page (which could be leaf or internal). Can't
+ * recycle yet.
+ */
stats->pages_deleted++;
- /* Update the oldest btpo.xact */
+ /* Maintain the oldest btpo.xact */
if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
vstate->oldestBtpoXact = opaque->btpo.xact;
}
else if (P_ISHALFDEAD(opaque))
{
- /* Half-dead, try to delete */
+ /*
+ * Half-dead leaf page. Try to delete now. Might update
+ * oldestBtpoXact and pages_deleted below.
+ */
delete_now = true;
}
else if (P_ISLEAF(opaque))
@@ -1357,16 +1369,11 @@ restart:
MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
- ndel = _bt_pagedel(rel, buf);
+ ndel = _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
/* count only this page, else may double-count parent */
if (ndel)
- {
stats->pages_deleted++;
- if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
- TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact))
- vstate->oldestBtpoXact = opaque->btpo.xact;
- }
MemoryContextSwitchTo(oldcontext);
/* pagedel released buffer, so we shouldn't */
--
2.25.1
v2-0003-Refactor-btvacuumpage.patchapplication/octet-stream; name=v2-0003-Refactor-btvacuumpage.patchDownload
From 4f5368e7a5e39269101b5a74b251aab8ba09ff7e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 28 Apr 2020 19:12:41 -0700
Subject: [PATCH v2 3/3] Refactor btvacuumpage().
btvacuumscan() visits nbtree pages in physical order by calling
btvacuumpage() for each block. It is occasionally necessary for an
individual call to btvacuumpage() to "backtrack" to a block already
visited in the outermost btvacuumscan() loop. This is necessary to
avoid overlooking the right half page from a page split that used a
block that VACUUM already passed over (this happens when the new page
uses a block that came from the FSM).
Remove one of the arguments to btvacuumpage(), and give up on the idea
that it's a recursive function. Advertising btvacuumpage() as a
recursive function seems to make it more complicated for several
reasons. In reality the function always simulated recursion with a
loop, without ever actually calling itself. This wasn't just necessary
as a precaution as implied by the comments, though. There is no
reliable natural limit on the number of times we can backtrack, so true
recursion was and is totally unworkable.
There are important behavioral difference when "recursing"/backtracking,
mostly related to page deletion. That's what clinches it. We don't
perform page deletion when backtracking due to the extra complexity.
And when we recurse, we're not performing a physical order scan anymore,
so we expect fairly different conditions to hold for the page -- we
backtrack because the page is the right half of a page split that
occurred since the VACUUM operation began.
Recognizing that backtracking is quite different makes it possible to be
more particular about the page that we backtrack to. We now log cases
where backtracking fails to find a valid leaf right sibling page as
corruption.
Note that this commit uses the same variable name in btvacuumscan() and
btvacuumpage() as were used in the commit that fixed the under-counting
issue.
---
src/backend/access/nbtree/nbtinsert.c | 3 +
src/backend/access/nbtree/nbtree.c | 118 ++++++++++++++------------
2 files changed, 69 insertions(+), 52 deletions(-)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a70b64d964..99827b457b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1702,6 +1702,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
* Finish off remaining leftpage special area fields. They cannot be set
* before both origpage (leftpage) and rightpage buffers are acquired and
* locked.
+ *
+ * btpo_cycleid is only used with leaf pages, though we set it here in all
+ * cases just to be consistent.
*/
lopaque->btpo_next = rightpagenumber;
lopaque->btpo_cycleid = _bt_vacuum_cycleid(rel);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b20a592808..699d5a8d42 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -93,8 +93,7 @@ typedef struct BTParallelScanDescData *BTParallelScanDesc;
static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
IndexBulkDeleteCallback callback, void *callback_state,
BTCycleId cycleid);
-static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
- BlockNumber orig_blkno);
+static void btvacuumpage(BTVacState *vstate, BlockNumber scanblkno);
static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
IndexTuple posting,
OffsetNumber updatedoffset,
@@ -967,7 +966,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
Relation rel = info->index;
BTVacState vstate;
BlockNumber num_pages;
- BlockNumber blkno;
+ BlockNumber scanblkno;
bool needLock;
/*
@@ -1017,7 +1016,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
*/
needLock = !RELATION_IS_LOCAL(rel);
- blkno = BTREE_METAPAGE + 1;
+ scanblkno = BTREE_METAPAGE + 1;
for (;;)
{
/* Get the current relation length */
@@ -1032,15 +1031,15 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
num_pages);
/* Quit if we've scanned the whole relation */
- if (blkno >= num_pages)
+ if (scanblkno >= num_pages)
break;
/* Iterate over pages, then loop back to recheck length */
- for (; blkno < num_pages; blkno++)
+ for (; scanblkno < num_pages; scanblkno++)
{
- btvacuumpage(&vstate, blkno, blkno);
+ btvacuumpage(&vstate, scanblkno);
if (info->report_progress)
pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
- blkno);
+ scanblkno);
}
}
@@ -1078,31 +1077,42 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
/*
* btvacuumpage --- VACUUM one page
*
- * This processes a single page for btvacuumscan(). In some cases we
- * must go back and re-examine previously-scanned pages; this routine
- * recurses when necessary to handle that case.
+ * This processes a single page for btvacuumscan() (the page from caller's
+ * scanblkno argument).
*
- * blkno is the page to process. orig_blkno is the highest block number
- * reached by the outer btvacuumscan loop (the same as blkno, unless we
- * are recursing to re-examine a previous page).
+ * In some cases we must backtrack to re-examine and VACUUM pages
+ * previously-scanned as the scanblkno during a previous call here. This
+ * happens when there was a leaf page split that happened since the VACUUM
+ * operation began.
+ *
+ * When we delete the scanblkno page, we may also delete other pages in
+ * passing. This is needed when page deletion must delete the parent of the
+ * scanblkno page. In general, the parent page could be located at any block
+ * in the index (it may have already been scanblkno in a previous call here,
+ * or may yet become scanblkno is a future call here). Note that we never
+ * delete a page when we're backtracking (though we may do so when we're about
+ * to backtrack).
*/
static void
-btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+btvacuumpage(BTVacState *vstate, BlockNumber scanblkno)
{
IndexVacuumInfo *info = vstate->info;
IndexBulkDeleteResult *stats = vstate->stats;
IndexBulkDeleteCallback callback = vstate->callback;
void *callback_state = vstate->callback_state;
Relation rel = info->index;
- bool delete_now;
- BlockNumber recurse_to;
+ bool attempt_pagedel;
+ BlockNumber blkno;
+ BlockNumber backtrack_to;
Buffer buf;
Page page;
BTPageOpaque opaque = NULL;
-restart:
- delete_now = false;
- recurse_to = P_NONE;
+ blkno = scanblkno;
+
+backtrack:
+ attempt_pagedel = false;
+ backtrack_to = P_NONE;
/* call vacuum_delay_point while not holding any buffer lock */
vacuum_delay_point();
@@ -1124,17 +1134,23 @@ restart:
}
/*
- * If we are recursing, the only case we want to do anything with is a
- * live leaf page having the current vacuum cycle ID. Any other state
- * implies we already saw the page (eg, deleted it as being empty).
+ * If we are backtracking, the only case we want to do anything with is
+ * a live leaf page having the current vacuum cycle ID. Any other state
+ * implies that the index is corrupt.
*/
- if (blkno != orig_blkno)
+ if (blkno != scanblkno)
{
if (_bt_page_recyclable(page) ||
P_IGNORE(opaque) ||
!P_ISLEAF(opaque) ||
opaque->btpo_cycleid != vstate->cycleid)
{
+ Assert(false);
+ ereport(LOG,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg_internal("found invalid block %u while following right link from block %u in index \"%s\"",
+ blkno, scanblkno,
+ RelationGetRelationName(rel))));
_bt_relbuf(rel, buf);
return;
}
@@ -1167,7 +1183,7 @@ restart:
* Half-dead leaf page. Try to delete now. Might update
* oldestBtpoXact and pages_deleted below.
*/
- delete_now = true;
+ attempt_pagedel = true;
}
else if (P_ISLEAF(opaque))
{
@@ -1191,18 +1207,24 @@ restart:
LockBufferForCleanup(buf);
/*
- * Check whether we need to recurse back to earlier pages. What we
+ * Check whether we need to backtrack to earlier pages. What we
* are concerned about is a page split that happened since we started
- * the vacuum scan. If the split moved some tuples to a lower page
- * then we might have missed 'em. If so, set up for tail recursion.
- * (Must do this before possibly clearing btpo_cycleid below!)
+ * the vacuum scan. If the split moved tuples on the right half of
+ * the split (i.e. the tuples that sort high) to a new block that we
+ * already passed over, then we need to backtrack now. Must do this
+ * before possibly clearing btpo_cycleid below!
+ *
+ * Note that we don't need to specifically consider the left half page
+ * of any page splits that have occured since the start of the VACUUM
+ * scan. The left half is stored in the same block as the original
+ * page, so it can't get overlooked.
*/
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid &&
!(opaque->btpo_flags & BTP_SPLIT_END) &&
!P_RIGHTMOST(opaque) &&
- opaque->btpo_next < orig_blkno)
- recurse_to = opaque->btpo_next;
+ opaque->btpo_next < scanblkno)
+ backtrack_to = opaque->btpo_next;
/*
* When each VACUUM begins, it determines an OldestXmin cutoff value.
@@ -1313,7 +1335,7 @@ restart:
*/
if (ndeletable > 0 || nupdatable > 0)
{
- Assert(nhtidsdead >= Max(ndeletable, 1));
+ Assert(nhtidsdead >= ndeletable + nupdatable);
_bt_delitems_vacuum(rel, buf, deletable, ndeletable, updatable,
nupdatable);
@@ -1328,11 +1350,10 @@ restart:
else
{
/*
- * If the leaf page has been split during this vacuum cycle, it
- * seems worth expending a write to clear btpo_cycleid even if we
- * don't have any deletions to do. (If we do, _bt_delitems_vacuum
- * takes care of this.) This ensures we won't process the page
- * again.
+ * If the leaf page has been split during this vacuum cycle, clear
+ * btpo_cycleid, even though we don't have any deletions to do.
+ * (If we do, _bt_delitems_vacuum takes care of this.) This
+ * ensures we won't process the page again.
*
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
@@ -1349,21 +1370,21 @@ restart:
/*
* If the leaf page is now empty, try to delete; else count the live
* tuples (live table TIDs in posting lists are counted as separate
- * live tuples). We don't delete when recursing, though, to avoid
+ * live tuples). We don't delete when backtracking, though, to avoid
* putting entries into freePages out-of-order. Doesn't seem worth
* any extra code to handle the case, since it's very unlikely that
* we'll be able to delete a leaf page that's the right half of a page
* split that occurred after the current VACUUM operation began.
*/
if (minoff > maxoff)
- delete_now = (blkno == orig_blkno);
+ attempt_pagedel = (blkno == scanblkno);
else
stats->num_index_tuples += nhtidslive;
- Assert(!delete_now || nhtidslive == 0);
+ Assert(!attempt_pagedel || nhtidslive == 0);
}
- if (delete_now)
+ if (attempt_pagedel)
{
MemoryContext oldcontext;
@@ -1376,6 +1397,7 @@ restart:
* any page that a future call here from btvacuumscan is expected to
* count. There will be no double-counting.
*/
+ Assert(blkno == scanblkno);
stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
MemoryContextSwitchTo(oldcontext);
@@ -1384,18 +1406,10 @@ restart:
else
_bt_relbuf(rel, buf);
- /*
- * This is really tail recursion, but if the compiler is too stupid to
- * optimize it as such, we'd eat an uncomfortably large amount of stack
- * space per recursion level (due to the arrays used to track details of
- * deletable/updatable items). A failure is improbable since the number
- * of levels isn't likely to be large ... but just in case, let's
- * hand-optimize into a loop.
- */
- if (recurse_to != P_NONE)
+ if (backtrack_to != P_NONE)
{
- blkno = recurse_to;
- goto restart;
+ blkno = backtrack_to;
+ goto backtrack;
}
}
--
2.25.1
v2-0002-Fix-undercounting-in-VACUUM-VERBOSE-output.patchapplication/octet-stream; name=v2-0002-Fix-undercounting-in-VACUUM-VERBOSE-output.patchDownload
From ec5e4a907a71427540581003d81caa29486b337e Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 28 Apr 2020 19:12:41 -0700
Subject: [PATCH v2 2/3] Fix undercounting in VACUUM VERBOSE output.
The logic for determining how many nbtree pages are deleted in an index
must compensate for the fact that indexes deleted by the current VACUUM
might be visited a second time by the outermost btvacuumscan() physical
order scan. btvacuumpage() avoided double-counting by only counting a
single page as deleted when there was one or more pages deleted.
This approach was better than simply accepting the return value of
_bt_pagedel() uncritically, but it was still slightly wrong. It failed
to account for the fact that _bt_pagedel() can legitimately delete pages
that the physical order scan will not visit again. This resulted in
undercounting in VACUUM VERBOSE output: the "%u index pages have been
deleted" figure often failed to include a small number of deleted
internal pages (as well as empty cousin leaf pages that couldn't be
deleted in an earlier _bt_pagedel() call due to the "rightmost page of a
parent" restriction).
Fix the accounting by teaching _bt_pagedel() about its caller's
requirements. It now counts pages that it knows the caller won't visit
again, and doesn't count pages that it knows the caller will visit
again. btvacuumpage() can now completely trust _bt_pagedel()'s return
value.
---
src/include/access/nbtree.h | 4 +--
src/backend/access/nbtree/nbtpage.c | 42 ++++++++++++++++++++++-------
src/backend/access/nbtree/nbtree.c | 33 ++++++++++++-----------
3 files changed, 52 insertions(+), 27 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bccd7ecc3..97c96d0a3a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1080,8 +1080,8 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
extern void _bt_delitems_delete(Relation rel, Buffer buf,
OffsetNumber *deletable, int ndeletable,
Relation heapRel);
-extern int _bt_pagedel(Relation rel, Buffer leafbuf,
- TransactionId *oldestBtpoXact);
+extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf,
+ TransactionId *oldestBtpoXact);
/*
* prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c90a2012d5..b65ce4db32 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,8 +38,10 @@ static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
+ BlockNumber scanblkno,
bool *rightsib_empty,
- TransactionId *oldestBtpoXact);
+ TransactionId *oldestBtpoXact,
+ uint32 *ndeleted);
static TransactionId _bt_xid_horizon(Relation rel, Relation heapRel, Page page,
OffsetNumber *deletable, int ndeletable);
static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
@@ -1489,7 +1491,10 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
*
* Returns the number of pages successfully deleted (zero if page cannot
* be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).
+ * were deleted too). The final count is calculated in a way that helps our
+ * caller to avoid double-counting. When we know that the outermost
+ * btvacuumscan physical-order loop over the index relation will go on to find
+ * a page that we delete now, we don't count it in return value.
*
* Maintains *oldestBtpoXact for caller for any pages that are deleted here
* (including caller's target page and any other internal pages or right
@@ -1501,15 +1506,22 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
* carefully, it's better to run it in a temp context that can be reset
* frequently.
*/
-int
+uint32
_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
{
- int ndeleted = 0;
+ uint32 ndeleted = 0;
BlockNumber rightsib;
bool rightsib_empty;
Page page;
BTPageOpaque opaque;
+ /*
+ * Save original leafbuf block number from caller. Only deleted blocks
+ * that are <= scanblkno get counted in ndeleted return value. This
+ * approach helps caller to avoid double-counting deleted pages.
+ */
+ BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
+
/*
* "stack" is a search stack leading (approximately) to the target page.
* It is initially NULL, but when iterating, we keep it to avoid
@@ -1560,8 +1572,9 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
if (P_ISDELETED(opaque))
ereport(LOG,
(errcode(ERRCODE_INDEX_CORRUPTED),
- errmsg_internal("found deleted block %u while following right link in index \"%s\"",
+ errmsg_internal("found deleted block %u while following right link from block %u in index \"%s\"",
BufferGetBlockNumber(leafbuf),
+ scanblkno,
RelationGetRelationName(rel))));
_bt_relbuf(rel, leafbuf);
@@ -1711,13 +1724,13 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
while (P_ISHALFDEAD(opaque))
{
/* Check for interrupts in _bt_unlink_halfdead_page */
- if (!_bt_unlink_halfdead_page(rel, leafbuf, &rightsib_empty,
- oldestBtpoXact))
+ if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
+ &rightsib_empty, oldestBtpoXact,
+ &ndeleted))
{
/* _bt_unlink_halfdead_page failed, released buffer */
return ndeleted;
}
- ndeleted++;
}
Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
@@ -1976,8 +1989,9 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* to avoid having to reacquire a lock we already released).
*/
static bool
-_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
- TransactionId *oldestBtpoXact)
+_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
+ bool *rightsib_empty, TransactionId *oldestBtpoXact,
+ uint32 *ndeleted)
{
BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
BlockNumber leafleftsib;
@@ -2373,6 +2387,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
*oldestBtpoXact = opaque->btpo.xact;
+ /*
+ * If btvacuumscan won't revisit this page in a future btvacuumpage call
+ * and count it as deleted then, we count it as deleted by current
+ * btvacuumpage call
+ */
+ if (target <= scanblkno)
+ (*ndeleted)++;
+
/*
* Release the target, if it was not the leaf block. The leaf is always
* kept locked.
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 65f7fd1945..b20a592808 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1328,10 +1328,11 @@ restart:
else
{
/*
- * If the page has been split during this vacuum cycle, it seems
- * worth expending a write to clear btpo_cycleid even if we don't
- * have any deletions to do. (If we do, _bt_delitems_vacuum takes
- * care of this.) This ensures we won't process the page again.
+ * If the leaf page has been split during this vacuum cycle, it
+ * seems worth expending a write to clear btpo_cycleid even if we
+ * don't have any deletions to do. (If we do, _bt_delitems_vacuum
+ * takes care of this.) This ensures we won't process the page
+ * again.
*
* We treat this like a hint-bit update because there's no need to
* WAL-log it.
@@ -1346,11 +1347,13 @@ restart:
}
/*
- * If it's now empty, try to delete; else count the live tuples (live
- * table TIDs in posting lists are counted as separate live tuples).
- * We don't delete when recursing, though, to avoid putting entries
- * into freePages out-of-order (doesn't seem worth any extra code to
- * handle the case).
+ * If the leaf page is now empty, try to delete; else count the live
+ * tuples (live table TIDs in posting lists are counted as separate
+ * live tuples). We don't delete when recursing, though, to avoid
+ * putting entries into freePages out-of-order. Doesn't seem worth
+ * any extra code to handle the case, since it's very unlikely that
+ * we'll be able to delete a leaf page that's the right half of a page
+ * split that occurred after the current VACUUM operation began.
*/
if (minoff > maxoff)
delete_now = (blkno == orig_blkno);
@@ -1363,17 +1366,17 @@ restart:
if (delete_now)
{
MemoryContext oldcontext;
- int ndel;
/* Run pagedel in a temp context to avoid memory leakage */
MemoryContextReset(vstate->pagedelcontext);
oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
- ndel = _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
-
- /* count only this page, else may double-count parent */
- if (ndel)
- stats->pages_deleted++;
+ /*
+ * We trust the _bt_pagedel return value because it does not include
+ * any page that a future call here from btvacuumscan is expected to
+ * count. There will be no double-counting.
+ */
+ stats->pages_deleted += _bt_pagedel(rel, buf, &vstate->oldestBtpoXact);
MemoryContextSwitchTo(oldcontext);
/* pagedel released buffer, so we shouldn't */
--
2.25.1
On Wed, 29 Apr 2020 at 01:17, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:I agree with both patches.
Thanks for the review.
For the first fix it seems better to push down the logic to the page
deletion code as your 0001 patch does so. The following change changes
the page deletion code so that it emits LOG message indicating the
index corruption when a deleted page is passed. But we used to ignore
in this case.I agree with this change but since I'm not sure this change directly
relates to this bug I wonder if it can be a separate patch.I am not surprised that you are concerned about this. That was the
hardest decision I had to make when writing the patch.The central idea in the 0001-* patch (as I say at the end of the
commit message) is that we have a strict rule about where we handle
maintaining the oldest btpo.xact for already-deleted pages, and where
we handle it for pages that become deleted. The rules is this:
btvacuumpage() is strictly responsible for the former category (as
well as totally live pages), while _bt_pagedel() is responsible for
the latter category (this includes pages that started out as half-dead
but become deleted).In order for this to work, _bt_pagedel() must not ever see a deleted
page that it did not delete itself. If I didn't explicitly take a
strict position on this, then I suppose that I'd have to have similar
handling for maintaining the oldest btpo.xact for existing deleted
pages encountered within _bt_pagedel(). But that doesn't make any
sense, even with corruption, so I took a strict position. Even with
corruption, how could we encounter an *existing* deleted page in
_bt_pagedel() but not encounter the same page within some call to
btvacuumpage() made from btvacuumscan()? In other words, how can we
fail to maintain the oldest btpo.xact for deleted pages even if we
assume that the index has a corrupt page with sibling links that point
to a fully deleted page? It seems impossible, even in this extreme,
contrived scenario.Then there is the separate question of the new log message about
corruption. It's not too hard to see why _bt_pagedel() cannot ever
access an existing deleted page unless there is corruption:* _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply
never pass a deleted page -- it handles those directly.* _bt_pagedel() can only access additional leaf pages to consider
deleting them by following the right sibling link of the
btvacuumpage() caller's leaf page (or possibly some page even further
to the right if it ends up deleting many leaf pages in one go).
Following right links and finding a deleted page can only happen when
there is a concurrent page deletion by VACUUM, since the sibling links
to the deleted page are changed by the second stage of deletion (i.e.
by the atomic actionat where the page is actually marked deleted,
within _bt_unlink_halfdead_page()).* There cannot be a concurrent VACUUM, though, since _bt_pagedel()
runs in VACUUM. (Note that we already depend on this for correctness
in _bt_unlink_halfdead_page(), which has a comment that says "a
half-dead page cannot resurrect because there can be only one vacuum
process running at a time".)The strict position seems justified. It really isn't that strict. I'm
not concerned about the new LOG message concerning corruption being
annoying or wrong
Thank you for the explanation. This makes sense to me. We've ever been
able to regard it as an index corruption that an existing deleted page
is detected within _bt_pagedel(). But with changes required for fixing
this bug, the responsibilities will change so that _bt_pagedel() must
not see a deleted page. Therefore we can take a more strict position.
It's not that since this bug could lead an index corruption we will
start warning it.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 30 Apr 2020 at 07:29, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:For the first fix it seems better to push down the logic to the page
deletion code as your 0001 patch does so. The following change changes
the page deletion code so that it emits LOG message indicating the
index corruption when a deleted page is passed. But we used to ignore
in this case.Attached is v2 of the patch set, which includes a new patch that I
want to target the master branch with. This patch (v2-0003-*)
refactors btvacuumscan().This revision also changed the first bugfix patch. I have simplified
some of the details in nbtree.c that were added by commit 857f9c36cda.
Can't we call _bt_update_meta_cleanup_info() at a lower level, like in
the patch? AFAICT it makes more sense to just call it in
btvacuumscan(). Please let me know what you think of those changes.
Yes, I agree.
The big change in the new master-only refactoring patch (v2-0003-*) is
that I now treat a call to btvacuumpage() that has to
"backtrack/recurse" and doesn't find a page that looks like the
expected right half of a page split (a page split that occurred since
the VACUUM operation began) as indicating corruption. This corruption
is logged. I believe that we should never see this happen, for reasons
that are similar to the reasons why _bt_pagedel() never finds a
deleted page when moving right, as I went into in my recent e-mail to
this thread. I think that this is worth tightening up for Postgres 13.I will hold off on committing v2-0003-*, since I need to nail down the
reasoning for treating this condition as corruption. Plus it's not
urgent. I think that the general direction taken in v2-0003-* is the
right one, in any case. The "recursion" in btvacuumpage() doesn't make
much sense -- it obscures what's really going on IMV. Also, using the
variable name "scanblkno" at every level -- btvacuumscan(),
btvacuumpage(), and _bt_pagedel() -- makes it clear that it's the same
block at all levels of the code. And that the "backtrack/recursion"
case doesn't kill tuples from the "scanblkno" block (and doesn't
attempt to call _bt_pagedel(), which is designed only to handle blocks
passed by btvacuumpage() when they are the current "scanblkno"). It
seems unlikely that the VACUUM VERBOSE pages deleted accounting bug
would have happened if the code was already structured this way.
+1 for refactoring. I often get confused that btvacuumpage() takes two
block numbers (blkno and orig_blkno) in spite of the same value. Your
change also makes it clear.
For the part of treating that case as an index corruption I will need
some time to review because of lacking knowledge of btree indexes. So
I'll review it later.
Regards,
--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 29, 2020 at 11:38 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
Thank you for the explanation. This makes sense to me.
Pushed both of the fixes.
Thanks!
--
Peter Geoghegan
On Thu, Apr 30, 2020 at 12:20 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
For the part of treating that case as an index corruption I will need
some time to review because of lacking knowledge of btree indexes. So
I'll review it later.
I pushed the refactoring patch today. Thanks for the review.
The final test for corruption that I added to btvacuumscan() is less
aggressive than what you saw in the patch I posted. We only report
corruption when backtracking/recursing if the page is "new", not a
leaf page, or is half-dead. We don't treat a fully deleted page as
corruption, because there is a case where the same call to
btvacuumscan() may have called _bt_pagedel() already, which may have
deleted the block that we backtrack/recurse to. The "sibling links
cannot point to a deleted page without concurrent deletion, and we
know that can't happen because we are VACUUM" stuff doesn't really
apply -- we remember which block we will recurse to *before* we
actually call _bt_pagedel().
--
Peter Geoghegan