Cleaning up nbtree after logical decoding on standby work
Commit 61b313e4, which prepared index access method code for the
logical decoding on standbys work, made quite a few mechanical
changes. Many routines were revised in order to make sure that heaprel
was available in _bt_getbuf()'s P_NEW (new page allocation) path. But
this went further than it really had to. Many of the changes to nbtree
seem excessive.
We only need a heaprel in those code paths that might end up calling
_bt_getbuf() with blkno = P_NEW. This includes most callers that pass
access = BT_WRITE, and all callers that pass access = BT_READ. This
doesn't have to be haphazard -- there just aren't that many places
that can allocate new nbtree pages. It's just page splits, and new
root page allocations (which are actually a slightly different kind of
page split). The rule doesn't need to be complicated (to be fair it
looks more complicated than it really is).
Attached patch completely removes the changes to _bt_getbuf()'s
signature from 61b313e4. This is possible without any loss of
functionality. The patch splits _bt_getbuf () in two: the code that
handles BT_READ/BT_WRITE stays in _bt_getbuf(), which is now far
shorter. Handling of new page allocation is moved to a new routine
I've called _bt_alloc(). This is clearer in general, and makes it
clear what the rules really are. Any code that might need to call
_bt_alloc() must be prepared for that, by having a heaprel to pass to
it (the slightly complicated case is interrupted page splits).
It's possible that Bertand would have done it this way to begin with
were it not for the admittedly pretty bad nbtree convention around
P_NEW. It would be nice to get rid of P_NEW in the near future, too --
I gather that there was discussion of that in the context of recent
work in this area.
--
Peter Geoghegan
Attachments:
v1-0001-nbtree-Remove-use-of-P_NEW.patchapplication/octet-stream; name=v1-0001-nbtree-Remove-use-of-P_NEW.patchDownload
From 1bb5ca06449e321290a84685f224229360600bee Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 23 May 2023 22:08:56 -0700
Subject: [PATCH v1 1/2] nbtree: Remove use of P_NEW.
---
src/include/access/nbtree.h | 22 +-
src/include/utils/tuplesort.h | 4 +-
src/backend/access/heap/heapam_handler.c | 9 +-
src/backend/access/nbtree/nbtinsert.c | 34 ++-
src/backend/access/nbtree/nbtpage.c | 326 +++++++++++----------
src/backend/access/nbtree/nbtree.c | 4 +-
src/backend/access/nbtree/nbtsearch.c | 43 ++-
src/backend/access/nbtree/nbtsort.c | 2 +-
src/backend/access/nbtree/nbtutils.c | 7 +-
src/backend/optimizer/util/plancat.c | 2 +-
src/backend/utils/sort/tuplesortvariants.c | 5 +-
contrib/amcheck/verify_nbtree.c | 15 +-
12 files changed, 247 insertions(+), 226 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d68478609..150054e32 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -293,6 +293,7 @@ BTPageIsRecyclable(Page page, Relation heaprel)
BTPageOpaque opaque;
Assert(!PageIsNew(page));
+ Assert(heaprel != NULL);
/* Recycling okay iff page is deleted and safexid is old enough */
opaque = BTPageGetOpaque(page);
@@ -1194,18 +1195,17 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
*/
extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
-extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
- BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
extern void _bt_upgrademetapage(Page page);
extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
-extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
-extern int _bt_getrootheight(Relation rel, Relation heaprel);
-extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
+extern Buffer _bt_gettrueroot(Relation rel);
+extern int _bt_getrootheight(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
bool *allequalimage);
extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
- int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_allocbuf(Relation rel, Relation heaprel);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
BlockNumber blkno, int access);
extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1237,13 +1237,13 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
- bool rightmost, Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+ Snapshot snapshot);
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 656a2e989..af057b635 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,9 +399,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
- Relation indexRel,
- Relation heaprel,
- int workMem,
+ Relation indexRel, int workMem,
SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc2..0755be839 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,14 +731,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*multi_cutoff);
- /*
- * Set up sorting if wanted. NewHeap is being passed to
- * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
- * really matter, as the goal is to have a heap relation being passed to
- * _bt_log_reuse_page() (which should not be called from this code path).
- */
+ /* Set up sorting if wanted */
if (use_sort)
- tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
+ tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
maintenance_work_mem,
NULL, TUPLESORT_NONE);
else
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 84bbd78d5..5f8886e8d 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -110,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
/* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, heapRel, itup);
+ itup_key = _bt_mkscankey(rel, itup);
if (checkingunique)
{
@@ -1024,13 +1024,15 @@ _bt_findinsertloc(Relation rel,
* indexes.
*/
static void
-_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate,
+ BTStack stack)
{
Page page;
BTPageOpaque opaque;
Buffer rbuf;
BlockNumber rblkno;
+ Assert(heaprel != NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -1257,7 +1259,7 @@ _bt_insertonpg(Relation rel,
Assert(!isleaf);
Assert(BufferIsValid(cbuf));
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -1421,7 +1423,7 @@ _bt_insertonpg(Relation rel,
* call _bt_getrootheight while holding a buffer lock.
*/
if (BlockNumberIsValid(blockcache) &&
- _bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
+ _bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, blockcache);
}
@@ -1715,7 +1717,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* way because it avoids an unnecessary PANIC when either origpage or its
* existing sibling page are corrupt.
*/
- rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rbuf = _bt_allocbuf(rel, heaprel);
rightpage = BufferGetPage(rbuf);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_getbuf */
@@ -1888,7 +1890,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
*/
if (!isrightmost)
{
- sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
+ sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
spage = BufferGetPage(sbuf);
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
@@ -2102,6 +2104,8 @@ _bt_insert_parent(Relation rel,
bool isroot,
bool isonly)
{
+ Assert(heaprel != NULL);
+
/*
* Here we have to do something Lehman and Yao don't talk about: deal with
* a root split and construction of a new root. If our stack is empty
@@ -2161,8 +2165,7 @@ _bt_insert_parent(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* Find the leftmost page at the next level up */
- pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
- NULL);
+ pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2230,6 +2233,9 @@ _bt_insert_parent(Relation rel,
*
* On entry, 'lbuf' must be locked in write-mode. On exit, it is unlocked
* and unpinned.
+ *
+ * Caller must provide a valid heaprel, since finishing a page split will
+ * require allocating a new page when the parent page splits in turn.
*/
void
_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
@@ -2243,9 +2249,10 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
bool wasonly;
Assert(P_INCOMPLETE_SPLIT(lpageop));
+ Assert(heaprel != NULL);
/* Lock right sibling, the one missing the downlink */
- rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
rpage = BufferGetPage(rbuf);
rpageop = BTPageGetOpaque(rpage);
@@ -2257,7 +2264,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
BTMetaPageData *metad;
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2323,10 +2330,11 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
Page page;
BTPageOpaque opaque;
- buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
+ buf = _bt_getbuf(rel, blkno, BT_WRITE);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
+ Assert(heaprel != NULL);
if (P_INCOMPLETE_SPLIT(opaque))
{
_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
@@ -2459,12 +2467,12 @@ _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
lopaque = BTPageGetOpaque(lpage);
/* get a new root page */
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heaprel);
rootpage = BufferGetPage(rootbuf);
rootblknum = BufferGetBlockNumber(rootbuf);
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 486d1563d..5e15190f3 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,10 +38,11 @@
#include "utils/snapmgr.h"
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
+static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+ FullTransactionId safexid, bool isCatalogRel);
+static void _bt_delitems_delete(Relation rel, Buffer buf,
TransactionId snapshotConflictHorizon,
+ bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable);
static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
@@ -177,7 +178,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
* index tuples needed to be deleted.
*/
bool
-_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
+_bt_vacuum_needs_cleanup(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -190,7 +191,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
*
* Note that we deliberately avoid using cached version of metapage here.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
btm_version = metad->btm_version;
@@ -230,7 +231,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
* finalized.
*/
void
-_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
{
Buffer metabuf;
Page metapg;
@@ -254,7 +255,7 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* no longer used as of PostgreSQL 14. We set it to -1.0 on rewrite, just
* to be consistent.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -326,6 +327,9 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* NOTE that the returned root page will have only a read lock set
* on it even if access = BT_WRITE!
*
+ * If access = BT_WRITE, heaprel must be set; otherwise caller should
+ * just pass NULL. See _bt_allocbuf for an explanation.
+ *
* The returned page is not necessarily the true root --- it could be
* a "fast root" (a page that is alone in its level due to deletions).
* Also, if the root page is split while we are "in flight" to it,
@@ -349,6 +353,9 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
uint32 rootlevel;
BTMetaPageData *metad;
+ Assert(access == BT_READ || access == BT_WRITE);
+ Assert(access == BT_READ || heaprel != NULL);
+
/*
* Try to use previously-cached metapage data to find the root. This
* normally saves one buffer access per index search, which is a very
@@ -369,7 +376,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
Assert(rootblkno != P_NONE);
rootlevel = metad->btm_fastlevel;
- rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
+ rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -395,7 +402,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
rel->rd_amcache = NULL;
}
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/* if no root page initialized yet, do it */
@@ -436,7 +443,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* the new root page. Since this is the first page in the tree, it's
* a leaf as well as the root.
*/
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heaprel);
rootblkno = BufferGetBlockNumber(rootbuf);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -573,7 +580,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* moving to the root --- that'd deadlock against any concurrent root split.)
*/
Buffer
-_bt_gettrueroot(Relation rel, Relation heaprel)
+_bt_gettrueroot(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -595,7 +602,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
pfree(rel->rd_amcache);
rel->rd_amcache = NULL;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metaopaque = BTPageGetOpaque(metapg);
metad = BTPageGetMeta(metapg);
@@ -668,7 +675,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
* about updating previously cached data.
*/
int
-_bt_getrootheight(Relation rel, Relation heaprel)
+_bt_getrootheight(Relation rel)
{
BTMetaPageData *metad;
@@ -676,7 +683,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -732,7 +739,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
* pg_upgrade'd from Postgres 12.
*/
void
-_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
{
BTMetaPageData *metad;
@@ -740,7 +747,7 @@ _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequ
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -824,8 +831,8 @@ _bt_checkpage(Relation rel, Buffer buf)
* Log the reuse of a page from the FSM.
*/
static void
-_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid)
+_bt_log_reuse_page(Relation rel, BlockNumber blkno, FullTransactionId safexid,
+ bool isCatalogRel)
{
xl_btree_reuse_page xlrec_reuse;
@@ -836,10 +843,10 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
*/
/* XLOG stuff */
- xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
xlrec_reuse.locator = rel->rd_locator;
xlrec_reuse.block = blkno;
xlrec_reuse.snapshotConflictHorizon = safexid;
+ xlrec_reuse.isCatalogRel = isCatalogRel;
XLogBeginInsert();
XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
@@ -848,10 +855,7 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
}
/*
- * _bt_getbuf() -- Get a buffer by block number for read or write.
- *
- * blkno == P_NEW means to get an unallocated index page. The page
- * will be initialized before returning it.
+ * _bt_getbuf() -- Get an existing block in a buffer, for read or write.
*
* The general rule in nbtree is that it's never okay to access a
* page without holding both a buffer pin and a buffer lock on
@@ -860,136 +864,154 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
* When this routine returns, the appropriate lock is set on the
* requested buffer and its reference count has been incremented
* (ie, the buffer is "locked and pinned"). Also, we apply
- * _bt_checkpage to sanity-check the page (except in P_NEW case),
- * and perform Valgrind client requests that help Valgrind detect
- * unsafe page accesses.
+ * _bt_checkpage to sanity-check the page, and perform Valgrind
+ * client requests that help Valgrind detect unsafe page accesses.
*
* Note: raw LockBuffer() calls are disallowed in nbtree; all
* buffer lock requests need to go through wrapper functions such
* as _bt_lockbuf().
*/
Buffer
-_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access)
{
Buffer buf;
- if (blkno != P_NEW)
+ Assert(BlockNumberIsValid(blkno));
+
+ /* Read an existing block of the relation */
+ buf = ReadBuffer(rel, blkno);
+ _bt_lockbuf(rel, buf, access);
+ _bt_checkpage(rel, buf);
+
+ return buf;
+}
+
+/*
+ * _bt_allocbuf() -- Allocate a new block, returned locked for write.
+ *
+ * Routine returns an unallocated index page, with buffer pin and write lock
+ * still held. The page will be initialized with the appropriate amount of
+ * special area space for an nbtree page, too.
+ *
+ * Callers are required to pass a valid heaprel. We need heaprel so that we
+ * can handle generating a snapshotConflictHorizon that makes reusing a page
+ * from the FSM safe for queries that may be running on standbys.
+ */
+Buffer
+_bt_allocbuf(Relation rel, Relation heaprel)
+{
+ Buffer buf;
+ BlockNumber blkno;
+ Page page;
+
+ Assert(heaprel != NULL);
+
+ /*
+ * First see if the FSM knows of any free pages.
+ *
+ * We can't trust the FSM's report unreservedly; we have to check that the
+ * page is still free. (For example, an already-free page could have been
+ * re-used between the time the last VACUUM scanned it and the time the
+ * VACUUM made its FSM updates.)
+ *
+ * In fact, it's worse than that: we can't even assume that it's safe to
+ * take a lock on the reported page. If somebody else has a lock on it,
+ * or even worse our own caller does, we could deadlock. (The own-caller
+ * scenario is actually not improbable. Consider an index on a serial or
+ * timestamp column. Nearly all splits will be at the rightmost page, so
+ * it's entirely likely that _bt_split will call us while holding a lock
+ * on the page most recently acquired from FSM. A VACUUM running
+ * concurrently with the previous split could well have placed that page
+ * back in FSM.)
+ *
+ * To get around that, we ask for only a conditional lock on the reported
+ * page. If we fail, then someone else is using the page, and we may
+ * reasonably assume it's not free. (If we happen to be wrong, the worst
+ * consequence is the page will be lost to use till the next VACUUM, which
+ * is no big problem.)
+ */
+ for (;;)
{
- /* Read an existing block of the relation */
+ blkno = GetFreeIndexPage(rel);
+ if (blkno == InvalidBlockNumber)
+ break;
buf = ReadBuffer(rel, blkno);
- _bt_lockbuf(rel, buf, access);
- _bt_checkpage(rel, buf);
- }
- else
- {
- Page page;
-
- Assert(access == BT_WRITE);
-
- /*
- * First see if the FSM knows of any free pages.
- *
- * We can't trust the FSM's report unreservedly; we have to check that
- * the page is still free. (For example, an already-free page could
- * have been re-used between the time the last VACUUM scanned it and
- * the time the VACUUM made its FSM updates.)
- *
- * In fact, it's worse than that: we can't even assume that it's safe
- * to take a lock on the reported page. If somebody else has a lock
- * on it, or even worse our own caller does, we could deadlock. (The
- * own-caller scenario is actually not improbable. Consider an index
- * on a serial or timestamp column. Nearly all splits will be at the
- * rightmost page, so it's entirely likely that _bt_split will call us
- * while holding a lock on the page most recently acquired from FSM. A
- * VACUUM running concurrently with the previous split could well have
- * placed that page back in FSM.)
- *
- * To get around that, we ask for only a conditional lock on the
- * reported page. If we fail, then someone else is using the page,
- * and we may reasonably assume it's not free. (If we happen to be
- * wrong, the worst consequence is the page will be lost to use till
- * the next VACUUM, which is no big problem.)
- */
- for (;;)
+ if (_bt_conditionallockbuf(rel, buf))
{
- blkno = GetFreeIndexPage(rel);
- if (blkno == InvalidBlockNumber)
- break;
- buf = ReadBuffer(rel, blkno);
- if (_bt_conditionallockbuf(rel, buf))
+ page = BufferGetPage(buf);
+
+ /*
+ * It's possible to find an all-zeroes page in an index. For
+ * example, a backend might successfully extend the relation
+ * one page and then crash before it is able to make a WAL
+ * entry for adding the page. If we find a zeroed page then
+ * reclaim it immediately.
+ */
+ if (PageIsNew(page))
{
- page = BufferGetPage(buf);
+ /* Okay to use page. Initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
+ }
+
+ if (BTPageIsRecyclable(page, heaprel))
+ {
+ FullTransactionId safexid;
+ bool isCatalogRel;
/*
- * It's possible to find an all-zeroes page in an index. For
- * example, a backend might successfully extend the relation
- * one page and then crash before it is able to make a WAL
- * entry for adding the page. If we find a zeroed page then
- * reclaim it immediately.
+ * If we are generating WAL for Hot Standby then create a
+ * WAL record that will allow us to conflict with queries
+ * running on standby, in case they have snapshots older
+ * than safexid value
*/
- if (PageIsNew(page))
+ if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
{
- /* Okay to use page. Initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
+ safexid = BTPageGetDeleteXid(page);
+ isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
+ _bt_log_reuse_page(rel, blkno, safexid, isCatalogRel);
}
- if (BTPageIsRecyclable(page, heaprel))
- {
- /*
- * If we are generating WAL for Hot Standby then create a
- * WAL record that will allow us to conflict with queries
- * running on standby, in case they have snapshots older
- * than safexid value
- */
- if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
- _bt_log_reuse_page(rel, heaprel, blkno,
- BTPageGetDeleteXid(page));
-
- /* Okay to use page. Re-initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
- }
- elog(DEBUG2, "FSM returned nonrecyclable page");
- _bt_relbuf(rel, buf);
- }
- else
- {
- elog(DEBUG2, "FSM returned nonlockable page");
- /* couldn't get lock, so just drop pin */
- ReleaseBuffer(buf);
+ /* Okay to use page. Re-initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
}
+ elog(DEBUG2, "FSM returned nonrecyclable page");
+ _bt_relbuf(rel, buf);
+ }
+ else
+ {
+ elog(DEBUG2, "FSM returned nonlockable page");
+ /* couldn't get lock, so just drop pin */
+ ReleaseBuffer(buf);
}
-
- /*
- * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
- * we risk a race condition against btvacuumscan --- see comments
- * therein. This forces us to repeat the valgrind request that
- * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
- * without introducing a race.
- */
- buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
- EB_LOCK_FIRST);
- if (!RelationUsesLocalBuffers(rel))
- VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
-
- /* Initialize the new page before returning it */
- page = BufferGetPage(buf);
- Assert(PageIsNew(page));
- _bt_pageinit(page, BufferGetPageSize(buf));
}
- /* ref count and lock type are correct */
+ /*
+ * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or we
+ * risk a race condition against btvacuumscan --- see comments therein.
+ * This forces us to repeat the valgrind request that _bt_lockbuf()
+ * otherwise would make, as we can't use _bt_lockbuf() without introducing
+ * a race.
+ */
+ buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL, EB_LOCK_FIRST);
+ if (!RelationUsesLocalBuffers(rel))
+ VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
+ /* Initialize the new page before returning it */
+ page = BufferGetPage(buf);
+ Assert(PageIsNew(page));
+ _bt_pageinit(page, BufferGetPageSize(buf));
+
return buf;
}
/*
* _bt_relandgetbuf() -- release a locked buffer and get another one.
*
- * This is equivalent to _bt_relbuf followed by _bt_getbuf, with the
- * exception that blkno may not be P_NEW. Also, if obuf is InvalidBuffer
- * then it reduces to just _bt_getbuf; allowing this case simplifies some
- * callers.
+ * This is equivalent to _bt_relbuf followed by _bt_getbuf. Also, if obuf is
+ * InvalidBuffer then it reduces to just _bt_getbuf; allowing this case
+ * simplifies some callers.
*
* The original motivation for using this was to avoid two entries to the
* bufmgr when one would do. However, now it's mainly just a notational
@@ -1001,7 +1023,7 @@ _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
{
Buffer buf;
- Assert(blkno != P_NEW);
+ Assert(BlockNumberIsValid(blkno));
if (BufferIsValid(obuf))
_bt_unlockbuf(rel, obuf);
buf = ReleaseAndReadBuffer(obuf, rel, blkno);
@@ -1272,14 +1294,14 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* (a version that lacks the TIDs that are to be deleted).
*
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but it needs its own snapshotConflictHorizon (caller gets this
- * from tableam). This is used by the REDO routine to generate recovery
+ * the page, but it needs its own snapshotConflictHorizon and isCatalogRel
+ * (from the tableam). This is used by the REDO routine to generate recovery
* conflicts. The other difference is that only _bt_delitems_vacuum will
* clear page's VACUUM cycle ID.
*/
static void
-_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
- TransactionId snapshotConflictHorizon,
+_bt_delitems_delete(Relation rel, Buffer buf,
+ TransactionId snapshotConflictHorizon, bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable)
{
@@ -1343,10 +1365,10 @@ _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
XLogRecPtr recptr;
xl_btree_delete xlrec_delete;
- xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
xlrec_delete.ndeleted = ndeletable;
xlrec_delete.nupdated = nupdatable;
+ xlrec_delete.isCatalogRel = isCatalogRel;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1517,6 +1539,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
{
Page page = BufferGetPage(buf);
TransactionId snapshotConflictHorizon;
+ bool isCatalogRel;
OffsetNumber postingidxoffnum = InvalidOffsetNumber;
int ndeletable = 0,
nupdatable = 0;
@@ -1525,6 +1548,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
/* Use tableam interface to determine which tuples to delete first */
snapshotConflictHorizon = table_index_delete_tuples(heapRel, delstate);
+ isCatalogRel = RelationIsAccessibleInLogicalDecoding(heapRel);
/* Should not WAL-log snapshotConflictHorizon unless it's required */
if (!XLogStandbyInfoActive())
@@ -1670,8 +1694,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
}
/* Physically delete tuples (or TIDs) using deletable (or updatable) */
- _bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
- ndeletable, updatable, nupdatable);
+ _bt_delitems_delete(rel, buf, snapshotConflictHorizon, isCatalogRel,
+ deletable, ndeletable, updatable, nupdatable);
/* be tidy */
for (int i = 0; i < nupdatable; i++)
@@ -1692,8 +1716,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
* same level must always be locked left to right to avoid deadlocks.
*/
static bool
-_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
- BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
{
Buffer buf;
Page page;
@@ -1704,7 +1727,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
if (leftsib == P_NONE)
return false;
- buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
+ buf = _bt_getbuf(rel, leftsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1750,7 +1773,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
* to-be-deleted subtree.)
*/
static bool
-_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
{
Buffer buf;
Page page;
@@ -1759,7 +1782,7 @@ _bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrights
Assert(leafrightsib != P_NONE);
- buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
+ buf = _bt_getbuf(rel, leafrightsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1948,18 +1971,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* marked with INCOMPLETE_SPLIT flag before proceeding
*/
Assert(leafblkno == scanblkno);
- if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
+ if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
{
ReleaseBuffer(leafbuf);
return;
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
+ itup_key = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, vstate->info->heaprel, itup_key,
- &sleafbuf, BT_READ, NULL);
+ stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ,
+ NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
@@ -1990,7 +2013,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* leafbuf page half-dead.
*/
Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
- if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
+ if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf,
+ stack))
{
_bt_relbuf(rel, leafbuf);
return;
@@ -2053,7 +2077,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
if (!rightsib_empty)
break;
- leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
}
}
@@ -2108,7 +2132,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
* delete the downlink. It would fail the "right sibling of target page
* is also the next child in parent page" cross-check below.
*/
- if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
+ if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
{
elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
leafblkno, leafrightsib);
@@ -2352,7 +2376,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
Assert(target != leafblkno);
/* Fetch the block number of the target's left sibling */
- buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
+ buf = _bt_getbuf(rel, target, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
leftsib = opaque->btpo_prev;
@@ -2379,7 +2403,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
_bt_lockbuf(rel, leafbuf, BT_WRITE);
if (leftsib != P_NONE)
{
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2427,7 +2451,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
CHECK_FOR_INTERRUPTS();
/* step right one page */
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
}
@@ -2491,7 +2515,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* And next write-lock the (current) right sibling.
*/
rightsib = opaque->btpo_next;
- rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
page = BufferGetPage(rbuf);
opaque = BTPageGetOpaque(page);
@@ -2547,8 +2571,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
if (P_RIGHTMOST(opaque))
{
/* rightsib will be the only one left on the level */
- metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
- BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2906,7 +2929,7 @@ _bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
*
* Note: We deliberately avoid completing incomplete splits here.
*/
- if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
+ if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
return false;
/* Recurse to examine child page's grandparent page */
@@ -2976,6 +2999,7 @@ _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate)
Relation heaprel = vstate->info->heaprel;
Assert(stats->pages_newly_deleted >= vstate->npendingpages);
+ Assert(heaprel != NULL);
if (vstate->npendingpages == 0)
{
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b1519..4553aaee5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -835,7 +835,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (stats == NULL)
{
/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
- if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
+ if (!_bt_vacuum_needs_cleanup(info->index))
return NULL;
/*
@@ -871,7 +871,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
*/
Assert(stats->pages_deleted >= stats->pages_free);
num_delpages = stats->pages_deleted - stats->pages_free;
- _bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
+ _bt_set_cleanup_info(info->index, num_delpages);
/*
* It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce..8eb255473 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,8 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
- Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
@@ -75,6 +74,10 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
* _bt_search() -- Search the tree for a particular scankey,
* or more precisely for the first leaf page it could be on.
*
+ * rel must always be provided. heaprel must be provided by callers that pass
+ * access = BT_WRITE, since we may need to allocate a new root page for caller
+ * in passing (or when finishing a page split results in a parent page split).
+ *
* The passed scankey is an insertion-type scankey (see nbtree/README),
* but it can omit the rightmost column(s) of the index.
*
@@ -222,8 +225,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*
* If forupdate is true, we will attempt to finish any incomplete splits
* that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
+ * insertion, because we don't allow inserting on a page before the split is
+ * completed. 'heaprel' and 'stack' are only used if forupdate is true.
*
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
@@ -295,7 +298,7 @@ _bt_moveright(Relation rel,
_bt_relbuf(rel, buf);
/* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, heaprel, blkno, access);
+ buf = _bt_getbuf(rel, blkno, access);
continue;
}
@@ -862,7 +865,6 @@ bool
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
- Relation heaprel = scan->heapRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Buffer buf;
BTStack stack;
@@ -1355,7 +1357,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
}
/* Initialize remaining insertion scan key fields */
- _bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
+ _bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1366,7 +1368,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -2007,7 +2009,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
/* step right one page */
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2081,8 +2083,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
if (BTScanPosIsPinned(so->currPos))
_bt_lockbuf(rel, so->currPos.buf, BT_READ);
else
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
for (;;)
{
@@ -2096,8 +2097,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
}
/* Step to next physical page */
- so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
- so->currPos.buf, scan->xs_snapshot);
+ so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+ scan->xs_snapshot);
/* if we're physically at end of index, return failure */
if (so->currPos.buf == InvalidBuffer)
@@ -2144,8 +2145,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
BTScanPosInvalidate(so->currPos);
return false;
}
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
- BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
}
}
}
@@ -2190,7 +2190,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
* again if it's important.
*/
static Buffer
-_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
{
Page page;
BTPageOpaque opaque;
@@ -2218,7 +2218,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
_bt_relbuf(rel, buf);
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
- buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
+ buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2309,7 +2309,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
* The returned buffer is pinned and read-locked.
*/
Buffer
-_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot)
{
Buffer buf;
@@ -2325,9 +2325,9 @@ _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
* smarter about intermediate levels.)
*/
if (level == 0)
- buf = _bt_getroot(rel, heaprel, BT_READ);
+ buf = _bt_getroot(rel, NULL, BT_READ);
else
- buf = _bt_gettrueroot(rel, heaprel);
+ buf = _bt_gettrueroot(rel);
if (!BufferIsValid(buf))
return InvalidBuffer;
@@ -2408,8 +2408,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
* version of _bt_search(). We don't maintain a stack since we know we
* won't need it.
*/
- buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
- ScanDirectionIsBackward(dir), scan->xs_snapshot);
+ buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
if (!BufferIsValid(buf))
{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 6ad3f3c54..c2665fce4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
+ wstate.inskey = _bt_mkscankey(wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 05abf3603..7da499c4d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
* field themselves.
*/
BTScanInsert
-_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup)
{
BTScanInsert key;
ScanKey skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
if (itup)
- _bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
else
{
/* Utility statement callers can set these fields themselves */
@@ -1761,8 +1761,7 @@ _bt_killitems(IndexScanDesc scan)
droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
- buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
page = BufferGetPage(buf);
if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 65adf04c4..39932d3c2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
* For btrees, get tree height while we have the index
* open
*/
- info->tree_height = _bt_getrootheight(indexRelation, relation);
+ info->tree_height = _bt_getrootheight(indexRelation);
}
else
{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 018810692..eb6cfcfd0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,7 +207,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
Tuplesortstate *
tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- Relation heaprel,
int workMem,
SortCoordinate coordinate, int sortopt)
{
@@ -261,7 +260,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
arg->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
if (arg->indexInfo->ii_Expressions != NULL)
{
@@ -362,7 +361,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
/* Prepare SortSupport data for each column */
base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6979aff72..94a975932 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,6 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
OffsetNumber upperbound);
static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
- Relation heaprel,
IndexTuple itup);
static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
Page page, OffsetNumber offset);
@@ -332,7 +331,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
RelationGetRelationName(indrel))));
/* Extract metadata from metapage, and sanitize it in passing */
- _bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
+ _bt_metaversion(indrel, &heapkeyspace, &allequalimage);
if (allequalimage && !heapkeyspace)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1259,7 +1258,7 @@ bt_target_page_check(BtreeCheckState *state)
}
/* Build insertion scankey for current page offset */
- skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
+ skey = bt_mkscankey_pivotsearch(state->rel, itup);
/*
* Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1769,7 +1768,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
* memory remaining allocated.
*/
firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
- return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
+ return bt_mkscankey_pivotsearch(state->rel, firstitup);
}
/*
@@ -2682,7 +2681,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
Buffer lbuf;
bool exists;
- key = _bt_mkscankey(state->rel, state->heaprel, itup);
+ key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
/*
@@ -2695,7 +2694,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
*/
Assert(state->readonly && state->rootdescend);
exists = false;
- stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
+ stack = _bt_search(state->rel, NULL, key, &lbuf, BT_READ, NULL);
if (BufferIsValid(lbuf))
{
@@ -3134,11 +3133,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
* the scankey is greater.
*/
static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
- skey = _bt_mkscankey(rel, heaprel, itup);
+ skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
return skey;
--
2.40.1
On Thu, May 25, 2023 at 06:50:31PM -0700, Peter Geoghegan wrote:
It's possible that Bertand would have done it this way to begin with
were it not for the admittedly pretty bad nbtree convention around
P_NEW. It would be nice to get rid of P_NEW in the near future, too --
I gather that there was discussion of that in the context of recent
work in this area.
Nice cleanup overall.
+ if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
{
- /* Okay to use page. Initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
+ safexid = BTPageGetDeleteXid(page);
+ isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
+ _bt_log_reuse_page(rel, blkno, safexid, isCatalogRel);
There is only one caller of _bt_log_reuse_page(), so assigning a
boolean rather than the heap relation is a bit strange to me. I think
that calling RelationIsAccessibleInLogicalDecoding() within
_bt_log_reuse_page() where xlrec_reuse is filled with its data is much
more natural, like HEAD. One argument in favor of HEAD is that it is
not possible to pass down a wrong value for isCatalogRel, but your
patch would make that possible.
--
Michael
On 2023-May-25, Peter Geoghegan wrote:
Attached patch completely removes the changes to _bt_getbuf()'s
signature from 61b313e4.
I suppose you're not thinking of applying this to current master, but
instead just leave it for when pg17 opens, right? I mean, clearly it
seems far too invasive to put it in after beta1. On the other hand,
it's painful to know that we're going to have code that exists only on
16 and not any other release, in an area that's likely to have bugs here
and there, so we're going to need to heavily adjust backpatches for 16
especially.
I can't make up my mind about this. What do others think?
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)
On Fri, May 26, 2023 at 12:46 AM Michael Paquier <michael@paquier.xyz> wrote:
Nice cleanup overall.
Thanks.
To be clear, when I said "it would be nice to get rid of P_NEW", what
I meant was that it would be nice to go much further than what I've
done in the patch by getting rid of the general idea of P_NEW. So the
handling of P_NEW at the top of ReadBuffer_common() would be removed,
for example. (Note that nbtree doesn't actually rely on that code,
even now; while its use of the P_NEW constant creates the impression
that it needs that bufmgr.c code, it actually doesn't, even now.)
+ if (XLogStandbyInfoActive() && RelationNeedsWAL(rel)) { - /* Okay to use page. Initialize and return it. */ - _bt_pageinit(page, BufferGetPageSize(buf)); - return buf; + safexid = BTPageGetDeleteXid(page); + isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel); + _bt_log_reuse_page(rel, blkno, safexid, isCatalogRel);There is only one caller of _bt_log_reuse_page(), so assigning a
boolean rather than the heap relation is a bit strange to me. I think
that calling RelationIsAccessibleInLogicalDecoding() within
_bt_log_reuse_page() where xlrec_reuse is filled with its data is much
more natural, like HEAD.
Attached is v2, which deals with this by moving the code from
_bt_log_reuse_page() into _bt_allocbuf() itself -- there is no need
for a separate logging function. This structure seems like a clear
improvement, since such logging is largely the point of having a
separate _bt_allocbuf() function that deals with new page allocations
and requires a valid heapRel in all cases.
v2 also renames "heaprel" to "heapRel" in function signatures, for
consistency with older code that always used that convention.
--
Peter Geoghegan
Attachments:
v2-0001-nbtree-Remove-use-of-P_NEW.patchapplication/octet-stream; name=v2-0001-nbtree-Remove-use-of-P_NEW.patchDownload
From 6f5a304570ca3946d39b02652b32ab16b92c2f03 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 23 May 2023 22:08:56 -0700
Subject: [PATCH v2 1/2] nbtree: Remove use of P_NEW.
---
src/include/access/nbtree.h | 38 ++-
src/include/utils/tuplesort.h | 4 +-
src/backend/access/heap/heapam_handler.c | 9 +-
src/backend/access/nbtree/nbtinsert.c | 86 ++---
src/backend/access/nbtree/nbtpage.c | 371 +++++++++++----------
src/backend/access/nbtree/nbtree.c | 4 +-
src/backend/access/nbtree/nbtsearch.c | 59 ++--
src/backend/access/nbtree/nbtsort.c | 2 +-
src/backend/access/nbtree/nbtutils.c | 7 +-
src/backend/optimizer/util/plancat.c | 2 +-
src/backend/utils/sort/tuplesortvariants.c | 5 +-
contrib/amcheck/verify_nbtree.c | 15 +-
12 files changed, 305 insertions(+), 297 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d68478609..b690e2476 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -288,16 +288,19 @@ BTPageGetDeleteXid(Page page)
* well need special handling for new pages anyway.
*/
static inline bool
-BTPageIsRecyclable(Page page, Relation heaprel)
+BTPageIsRecyclable(Page page, Relation heapRel)
{
BTPageOpaque opaque;
Assert(!PageIsNew(page));
+ Assert(heapRel != NULL);
/* Recycling okay iff page is deleted and safexid is old enough */
opaque = BTPageGetOpaque(page);
if (P_ISDELETED(opaque))
{
+ FullTransactionId safexid = BTPageGetDeleteXid(page);
+
/*
* The page was deleted, but when? If it was just deleted, a scan
* might have seen the downlink to it, and will read the page later.
@@ -308,7 +311,7 @@ BTPageIsRecyclable(Page page, Relation heaprel)
* anyone. If not, then no scan that's still in progress could have
* seen its downlink, and we can recycle it.
*/
- return GlobalVisCheckRemovableFullXid(heaprel, BTPageGetDeleteXid(page));
+ return GlobalVisCheckRemovableFullXid(heapRel, safexid);
}
return false;
@@ -1177,9 +1180,9 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, bool indexUnchanged,
Relation heapRel);
-extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+extern void _bt_finish_split(Relation rel, Relation heapRel, Buffer lbuf,
BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+extern Buffer _bt_getstackbuf(Relation rel, Relation heapRel, BTStack stack,
BlockNumber child);
/*
@@ -1194,18 +1197,17 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
*/
extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
-extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
- BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
-extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
-extern int _bt_getrootheight(Relation rel, Relation heaprel);
-extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heapRel, int access);
+extern Buffer _bt_gettrueroot(Relation rel);
+extern int _bt_getrootheight(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
bool *allequalimage);
extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
- int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_allocbuf(Relation rel, Relation heapRel);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
BlockNumber blkno, int access);
extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1228,22 +1230,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+extern BTStack _bt_search(Relation rel, Relation heapRel, BTScanInsert key,
Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+extern Buffer _bt_moveright(Relation rel, Relation heapRel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
- bool rightmost, Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+ Snapshot snapshot);
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 656a2e989..af057b635 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,9 +399,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
- Relation indexRel,
- Relation heaprel,
- int workMem,
+ Relation indexRel, int workMem,
SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc2..0755be839 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,14 +731,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*multi_cutoff);
- /*
- * Set up sorting if wanted. NewHeap is being passed to
- * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
- * really matter, as the goal is to have a heap relation being passed to
- * _bt_log_reuse_page() (which should not be called from this code path).
- */
+ /* Set up sorting if wanted */
if (use_sort)
- tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
+ tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
maintenance_work_mem,
NULL, TUPLESORT_NONE);
else
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 84bbd78d5..4bbeb9742 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,7 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+static BTStack _bt_search_insert(Relation rel, Relation heapRel,
BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
@@ -42,9 +42,9 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
bool indexUnchanged,
BTStack stack,
Relation heapRel);
-static void _bt_stepright(Relation rel, Relation heaprel,
+static void _bt_stepright(Relation rel, Relation heapRel,
BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
+static void _bt_insertonpg(Relation rel, Relation heapRel, BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
BTStack stack,
@@ -53,13 +53,13 @@ static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key
OffsetNumber newitemoff,
int postingoff,
bool split_only_page);
-static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+static Buffer _bt_split(Relation rel, Relation heapRel, BTScanInsert itup_key,
Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+static void _bt_insert_parent(Relation rel, Relation heapRel, Buffer buf,
Buffer rbuf, BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
+static Buffer _bt_newlevel(Relation rel, Relation heapRel, Buffer lbuf, Buffer rbuf);
static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off, bool newfirstdataitem);
static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -110,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
/* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, heapRel, itup);
+ itup_key = _bt_mkscankey(rel, itup);
if (checkingunique)
{
@@ -314,7 +314,7 @@ search:
* since each per-backend cache won't stay valid for long.
*/
static BTStack
-_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heapRel, BTInsertState insertstate)
{
Assert(insertstate->buf == InvalidBuffer);
Assert(!insertstate->bounds_valid);
@@ -377,7 +377,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
}
/* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+ return _bt_search(rel, heapRel, insertstate->itup_key, &insertstate->buf,
BT_WRITE, NULL);
}
@@ -1024,13 +1024,15 @@ _bt_findinsertloc(Relation rel,
* indexes.
*/
static void
-_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heapRel, BTInsertState insertstate,
+ BTStack stack)
{
Page page;
BTPageOpaque opaque;
Buffer rbuf;
BlockNumber rblkno;
+ Assert(heapRel != NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -1050,7 +1052,7 @@ _bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack
*/
if (P_INCOMPLETE_SPLIT(opaque))
{
- _bt_finish_split(rel, heaprel, rbuf, stack);
+ _bt_finish_split(rel, heapRel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}
@@ -1101,7 +1103,7 @@ _bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack
*/
static void
_bt_insertonpg(Relation rel,
- Relation heaprel,
+ Relation heapRel,
BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -1145,7 +1147,7 @@ _bt_insertonpg(Relation rel,
/*
* Every internal page should have exactly one negative infinity item at
- * all times. Only _bt_split() and _bt_newroot() should add items that
+ * all times. Only _bt_split() and _bt_newlevel() should add items that
* become negative infinity items through truncation, since they're the
* only routines that allocate new internal pages.
*/
@@ -1212,7 +1214,7 @@ _bt_insertonpg(Relation rel,
Assert(!split_only_page);
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+ rbuf = _bt_split(rel, heapRel, itup_key, buf, cbuf, newitemoff, itemsz,
itup, origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
@@ -1236,7 +1238,7 @@ _bt_insertonpg(Relation rel,
* page.
*----------
*/
- _bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
+ _bt_insert_parent(rel, heapRel, buf, rbuf, stack, isroot, isonly);
}
else
{
@@ -1250,14 +1252,14 @@ _bt_insertonpg(Relation rel,
* only one on its tree level, but was not the root, it may have been
* the "fast root". We need to ensure that the fast root link points
* at or above the current page. We can safely acquire a lock on the
- * metapage here --- see comments for _bt_newroot().
+ * metapage here --- see comments for _bt_newlevel().
*/
if (unlikely(split_only_page))
{
Assert(!isleaf);
Assert(BufferIsValid(cbuf));
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -1421,7 +1423,7 @@ _bt_insertonpg(Relation rel,
* call _bt_getrootheight while holding a buffer lock.
*/
if (BlockNumberIsValid(blockcache) &&
- _bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
+ _bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, blockcache);
}
@@ -1462,7 +1464,7 @@ _bt_insertonpg(Relation rel,
* The pin and lock on buf are maintained.
*/
static Buffer
-_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+_bt_split(Relation rel, Relation heapRel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
{
@@ -1715,7 +1717,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* way because it avoids an unnecessary PANIC when either origpage or its
* existing sibling page are corrupt.
*/
- rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rbuf = _bt_allocbuf(rel, heapRel);
rightpage = BufferGetPage(rbuf);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_getbuf */
@@ -1888,7 +1890,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
*/
if (!isrightmost)
{
- sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
+ sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
spage = BufferGetPage(sbuf);
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
@@ -2095,13 +2097,15 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
*/
static void
_bt_insert_parent(Relation rel,
- Relation heaprel,
+ Relation heapRel,
Buffer buf,
Buffer rbuf,
BTStack stack,
bool isroot,
bool isonly)
{
+ Assert(heapRel != NULL);
+
/*
* Here we have to do something Lehman and Yao don't talk about: deal with
* a root split and construction of a new root. If our stack is empty
@@ -2121,8 +2125,8 @@ _bt_insert_parent(Relation rel,
Assert(stack == NULL);
Assert(isonly);
- /* create a new root node and update the metapage */
- rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
+ /* create a new root node one level up and update the metapage */
+ rootbuf = _bt_newlevel(rel, heapRel, buf, rbuf);
/* release the split buffers */
_bt_relbuf(rel, rootbuf);
_bt_relbuf(rel, rbuf);
@@ -2161,8 +2165,7 @@ _bt_insert_parent(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* Find the leftmost page at the next level up */
- pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
- NULL);
+ pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2188,7 +2191,7 @@ _bt_insert_parent(Relation rel,
* new downlink will be inserted at the correct offset. Even buf's
* parent may have changed.
*/
- pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
+ pbuf = _bt_getstackbuf(rel, heapRel, stack, bknum);
/*
* Unlock the right child. The left child will be unlocked in
@@ -2212,7 +2215,7 @@ _bt_insert_parent(Relation rel,
RelationGetRelationName(rel), bknum, rbknum)));
/* Recursively insert into the parent */
- _bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
+ _bt_insertonpg(rel, heapRel, NULL, pbuf, buf, stack->bts_parent,
new_item, MAXALIGN(IndexTupleSize(new_item)),
stack->bts_offset + 1, 0, isonly);
@@ -2230,9 +2233,12 @@ _bt_insert_parent(Relation rel,
*
* On entry, 'lbuf' must be locked in write-mode. On exit, it is unlocked
* and unpinned.
+ *
+ * Caller must provide a valid heapRel, since finishing a page split will
+ * require allocating a new page when the parent page splits in turn.
*/
void
-_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heapRel, Buffer lbuf, BTStack stack)
{
Page lpage = BufferGetPage(lbuf);
BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2243,9 +2249,10 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
bool wasonly;
Assert(P_INCOMPLETE_SPLIT(lpageop));
+ Assert(heapRel != NULL);
/* Lock right sibling, the one missing the downlink */
- rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
rpage = BufferGetPage(rbuf);
rpageop = BTPageGetOpaque(rpage);
@@ -2257,7 +2264,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
BTMetaPageData *metad;
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2274,7 +2281,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
elog(DEBUG1, "finishing incomplete split of %u/%u",
BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
- _bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
+ _bt_insert_parent(rel, heapRel, lbuf, rbuf, stack, wasroot, wasonly);
}
/*
@@ -2309,7 +2316,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
* offset number bts_offset + 1.
*/
Buffer
-_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heapRel, BTStack stack, BlockNumber child)
{
BlockNumber blkno;
OffsetNumber start;
@@ -2323,13 +2330,14 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
Page page;
BTPageOpaque opaque;
- buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
+ buf = _bt_getbuf(rel, blkno, BT_WRITE);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
+ Assert(heapRel != NULL);
if (P_INCOMPLETE_SPLIT(opaque))
{
- _bt_finish_split(rel, heaprel, buf, stack->bts_parent);
+ _bt_finish_split(rel, heapRel, buf, stack->bts_parent);
continue;
}
@@ -2415,7 +2423,7 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
}
/*
- * _bt_newroot() -- Create a new root page for the index.
+ * _bt_newlevel() -- Create a new level above former root page.
*
* We've just split the old root page and need to create a new one.
* In order to do this, we add a new root page to the file, then lock
@@ -2433,7 +2441,7 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
* lbuf, rbuf & rootbuf.
*/
static Buffer
-_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
+_bt_newlevel(Relation rel, Relation heapRel, Buffer lbuf, Buffer rbuf)
{
Buffer rootbuf;
Page lpage,
@@ -2459,12 +2467,12 @@ _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
lopaque = BTPageGetOpaque(lpage);
/* get a new root page */
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heapRel);
rootpage = BufferGetPage(rootbuf);
rootblknum = BufferGetBlockNumber(rootbuf);
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 486d1563d..d95d444af 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,22 +38,21 @@
#include "utils/snapmgr.h"
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Buffer buf,
TransactionId snapshotConflictHorizon,
+ bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable);
static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
OffsetNumber *updatedoffsets,
Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+static bool _bt_mark_page_halfdead(Relation rel, Relation heapRel,
Buffer leafbuf, BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
BlockNumber scanblkno,
bool *rightsib_empty,
BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heapRel,
BlockNumber child, BTStack stack,
Buffer *subtreeparent, OffsetNumber *poffset,
BlockNumber *topparent,
@@ -177,7 +176,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
* index tuples needed to be deleted.
*/
bool
-_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
+_bt_vacuum_needs_cleanup(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -190,7 +189,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
*
* Note that we deliberately avoid using cached version of metapage here.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
btm_version = metad->btm_version;
@@ -230,7 +229,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
* finalized.
*/
void
-_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
{
Buffer metabuf;
Page metapg;
@@ -254,7 +253,7 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* no longer used as of PostgreSQL 14. We set it to -1.0 on rewrite, just
* to be consistent.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -326,6 +325,9 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* NOTE that the returned root page will have only a read lock set
* on it even if access = BT_WRITE!
*
+ * If access = BT_WRITE, heapRel must be set; otherwise caller should
+ * just pass NULL. See _bt_allocbuf for an explanation.
+ *
* The returned page is not necessarily the true root --- it could be
* a "fast root" (a page that is alone in its level due to deletions).
* Also, if the root page is split while we are "in flight" to it,
@@ -339,7 +341,7 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* The metadata page is not locked or pinned on exit.
*/
Buffer
-_bt_getroot(Relation rel, Relation heaprel, int access)
+_bt_getroot(Relation rel, Relation heapRel, int access)
{
Buffer metabuf;
Buffer rootbuf;
@@ -369,7 +371,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
Assert(rootblkno != P_NONE);
rootlevel = metad->btm_fastlevel;
- rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
+ rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -395,7 +397,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
rel->rd_amcache = NULL;
}
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/* if no root page initialized yet, do it */
@@ -428,7 +430,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* to optimize this case.)
*/
_bt_relbuf(rel, metabuf);
- return _bt_getroot(rel, heaprel, access);
+ return _bt_getroot(rel, heapRel, access);
}
/*
@@ -436,7 +438,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* the new root page. Since this is the first page in the tree, it's
* a leaf as well as the root.
*/
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heapRel);
rootblkno = BufferGetBlockNumber(rootbuf);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -573,7 +575,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* moving to the root --- that'd deadlock against any concurrent root split.)
*/
Buffer
-_bt_gettrueroot(Relation rel, Relation heaprel)
+_bt_gettrueroot(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -595,7 +597,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
pfree(rel->rd_amcache);
rel->rd_amcache = NULL;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metaopaque = BTPageGetOpaque(metapg);
metad = BTPageGetMeta(metapg);
@@ -668,7 +670,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
* about updating previously cached data.
*/
int
-_bt_getrootheight(Relation rel, Relation heaprel)
+_bt_getrootheight(Relation rel)
{
BTMetaPageData *metad;
@@ -676,7 +678,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -732,7 +734,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
* pg_upgrade'd from Postgres 12.
*/
void
-_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
{
BTMetaPageData *metad;
@@ -740,7 +742,7 @@ _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequ
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -821,37 +823,7 @@ _bt_checkpage(Relation rel, Buffer buf)
}
/*
- * Log the reuse of a page from the FSM.
- */
-static void
-_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid)
-{
- xl_btree_reuse_page xlrec_reuse;
-
- /*
- * Note that we don't register the buffer with the record, because this
- * operation doesn't modify the page. This record only exists to provide a
- * conflict point for Hot Standby.
- */
-
- /* XLOG stuff */
- xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
- xlrec_reuse.locator = rel->rd_locator;
- xlrec_reuse.block = blkno;
- xlrec_reuse.snapshotConflictHorizon = safexid;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
-
- XLogInsert(RM_BTREE_ID, XLOG_BTREE_REUSE_PAGE);
-}
-
-/*
- * _bt_getbuf() -- Get a buffer by block number for read or write.
- *
- * blkno == P_NEW means to get an unallocated index page. The page
- * will be initialized before returning it.
+ * _bt_getbuf() -- Get an existing block in a buffer, for read or write.
*
* The general rule in nbtree is that it's never okay to access a
* page without holding both a buffer pin and a buffer lock on
@@ -860,136 +832,167 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
* When this routine returns, the appropriate lock is set on the
* requested buffer and its reference count has been incremented
* (ie, the buffer is "locked and pinned"). Also, we apply
- * _bt_checkpage to sanity-check the page (except in P_NEW case),
- * and perform Valgrind client requests that help Valgrind detect
- * unsafe page accesses.
+ * _bt_checkpage to sanity-check the page, and perform Valgrind
+ * client requests that help Valgrind detect unsafe page accesses.
*
* Note: raw LockBuffer() calls are disallowed in nbtree; all
* buffer lock requests need to go through wrapper functions such
* as _bt_lockbuf().
*/
Buffer
-_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access)
{
Buffer buf;
- if (blkno != P_NEW)
+ Assert(BlockNumberIsValid(blkno));
+
+ /* Read an existing block of the relation */
+ buf = ReadBuffer(rel, blkno);
+ _bt_lockbuf(rel, buf, access);
+ _bt_checkpage(rel, buf);
+
+ return buf;
+}
+
+/*
+ * _bt_allocbuf() -- Allocate a new block/page.
+ *
+ * Routine returns a pinned and write-locked buffer that contains a newly
+ * allocated index page. The page will be initialized with the appropriate
+ * amount of special area space for an nbtree page, too.
+ *
+ * Callers are required to pass a valid heapRel. We need heapRel so that we
+ * can handle generating a snapshotConflictHorizon that makes reusing a page
+ * from the FSM safe for queries that may be running on standbys.
+ */
+Buffer
+_bt_allocbuf(Relation rel, Relation heapRel)
+{
+ Buffer buf;
+ BlockNumber blkno;
+ Page page;
+
+ Assert(heapRel != NULL);
+
+ /*
+ * First see if the FSM knows of any free pages.
+ *
+ * We can't trust the FSM's report unreservedly; we have to check that the
+ * page is still free. (For example, an already-free page could have been
+ * re-used between the time the last VACUUM scanned it and the time the
+ * VACUUM made its FSM updates.)
+ *
+ * In fact, it's worse than that: we can't even assume that it's safe to
+ * take a lock on the reported page. If somebody else has a lock on it,
+ * or even worse our own caller does, we could deadlock. (The own-caller
+ * scenario is actually not improbable. Consider an index on a serial or
+ * timestamp column. Nearly all splits will be at the rightmost page, so
+ * it's entirely likely that _bt_split will call us while holding a lock
+ * on the page most recently acquired from FSM. A VACUUM running
+ * concurrently with the previous split could well have placed that page
+ * back in FSM.)
+ *
+ * To get around that, we ask for only a conditional lock on the reported
+ * page. If we fail, then someone else is using the page, and we may
+ * reasonably assume it's not free. (If we happen to be wrong, the worst
+ * consequence is the page will be lost to use till the next VACUUM, which
+ * is no big problem.)
+ */
+ for (;;)
{
- /* Read an existing block of the relation */
+ blkno = GetFreeIndexPage(rel);
+ if (blkno == InvalidBlockNumber)
+ break;
buf = ReadBuffer(rel, blkno);
- _bt_lockbuf(rel, buf, access);
- _bt_checkpage(rel, buf);
- }
- else
- {
- Page page;
-
- Assert(access == BT_WRITE);
-
- /*
- * First see if the FSM knows of any free pages.
- *
- * We can't trust the FSM's report unreservedly; we have to check that
- * the page is still free. (For example, an already-free page could
- * have been re-used between the time the last VACUUM scanned it and
- * the time the VACUUM made its FSM updates.)
- *
- * In fact, it's worse than that: we can't even assume that it's safe
- * to take a lock on the reported page. If somebody else has a lock
- * on it, or even worse our own caller does, we could deadlock. (The
- * own-caller scenario is actually not improbable. Consider an index
- * on a serial or timestamp column. Nearly all splits will be at the
- * rightmost page, so it's entirely likely that _bt_split will call us
- * while holding a lock on the page most recently acquired from FSM. A
- * VACUUM running concurrently with the previous split could well have
- * placed that page back in FSM.)
- *
- * To get around that, we ask for only a conditional lock on the
- * reported page. If we fail, then someone else is using the page,
- * and we may reasonably assume it's not free. (If we happen to be
- * wrong, the worst consequence is the page will be lost to use till
- * the next VACUUM, which is no big problem.)
- */
- for (;;)
+ if (_bt_conditionallockbuf(rel, buf))
{
- blkno = GetFreeIndexPage(rel);
- if (blkno == InvalidBlockNumber)
- break;
- buf = ReadBuffer(rel, blkno);
- if (_bt_conditionallockbuf(rel, buf))
- {
- page = BufferGetPage(buf);
+ page = BufferGetPage(buf);
+ /*
+ * It's possible to find an all-zeroes page in an index. For
+ * example, a backend might successfully extend the relation one
+ * page and then crash before it is able to make a WAL entry for
+ * adding the page. If we find a zeroed page then reclaim it
+ * immediately.
+ */
+ if (PageIsNew(page))
+ {
+ /* Okay to use page. Initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
+ }
+
+ if (BTPageIsRecyclable(page, heapRel))
+ {
/*
- * It's possible to find an all-zeroes page in an index. For
- * example, a backend might successfully extend the relation
- * one page and then crash before it is able to make a WAL
- * entry for adding the page. If we find a zeroed page then
- * reclaim it immediately.
+ * If we are generating WAL for Hot Standby then create a WAL
+ * record that will allow us to conflict with queries running
+ * on standby, in case they have snapshots older than safexid
+ * value
*/
- if (PageIsNew(page))
+ if (RelationNeedsWAL(rel) && XLogStandbyInfoActive())
{
- /* Okay to use page. Initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
- }
+ xl_btree_reuse_page xlrec_reuse;
- if (BTPageIsRecyclable(page, heaprel))
- {
/*
- * If we are generating WAL for Hot Standby then create a
- * WAL record that will allow us to conflict with queries
- * running on standby, in case they have snapshots older
- * than safexid value
+ * Note that we don't register the buffer with the record,
+ * because this operation doesn't modify the page (that
+ * already happened, back when VACUUM deleted the page).
+ * This record only exists to provide a conflict point for
+ * Hot Standby. See record REDO routine comments.
*/
- if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
- _bt_log_reuse_page(rel, heaprel, blkno,
- BTPageGetDeleteXid(page));
+ xlrec_reuse.locator = rel->rd_locator;
+ xlrec_reuse.block = blkno;
+ xlrec_reuse.snapshotConflictHorizon = BTPageGetDeleteXid(page);
+ xlrec_reuse.isCatalogRel =
+ RelationIsAccessibleInLogicalDecoding(heapRel);
- /* Okay to use page. Re-initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
+
+ XLogInsert(RM_BTREE_ID, XLOG_BTREE_REUSE_PAGE);
}
- elog(DEBUG2, "FSM returned nonrecyclable page");
- _bt_relbuf(rel, buf);
- }
- else
- {
- elog(DEBUG2, "FSM returned nonlockable page");
- /* couldn't get lock, so just drop pin */
- ReleaseBuffer(buf);
+
+ /* Okay to use page. Re-initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
}
+ elog(DEBUG2, "FSM returned nonrecyclable page");
+ _bt_relbuf(rel, buf);
+ }
+ else
+ {
+ elog(DEBUG2, "FSM returned nonlockable page");
+ /* couldn't get lock, so just drop pin */
+ ReleaseBuffer(buf);
}
-
- /*
- * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
- * we risk a race condition against btvacuumscan --- see comments
- * therein. This forces us to repeat the valgrind request that
- * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
- * without introducing a race.
- */
- buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
- EB_LOCK_FIRST);
- if (!RelationUsesLocalBuffers(rel))
- VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
-
- /* Initialize the new page before returning it */
- page = BufferGetPage(buf);
- Assert(PageIsNew(page));
- _bt_pageinit(page, BufferGetPageSize(buf));
}
- /* ref count and lock type are correct */
+ /*
+ * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or we
+ * risk a race condition against btvacuumscan --- see comments therein.
+ * This forces us to repeat the valgrind request that _bt_lockbuf()
+ * otherwise would make, as we can't use _bt_lockbuf() without introducing
+ * a race.
+ */
+ buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL, EB_LOCK_FIRST);
+ if (!RelationUsesLocalBuffers(rel))
+ VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
+ /* Initialize the new page before returning it */
+ page = BufferGetPage(buf);
+ Assert(PageIsNew(page));
+ _bt_pageinit(page, BufferGetPageSize(buf));
+
return buf;
}
/*
* _bt_relandgetbuf() -- release a locked buffer and get another one.
*
- * This is equivalent to _bt_relbuf followed by _bt_getbuf, with the
- * exception that blkno may not be P_NEW. Also, if obuf is InvalidBuffer
- * then it reduces to just _bt_getbuf; allowing this case simplifies some
- * callers.
+ * This is equivalent to _bt_relbuf followed by _bt_getbuf. Also, if obuf is
+ * InvalidBuffer then it reduces to just _bt_getbuf; allowing this case
+ * simplifies some callers.
*
* The original motivation for using this was to avoid two entries to the
* bufmgr when one would do. However, now it's mainly just a notational
@@ -1001,7 +1004,7 @@ _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
{
Buffer buf;
- Assert(blkno != P_NEW);
+ Assert(BlockNumberIsValid(blkno));
if (BufferIsValid(obuf))
_bt_unlockbuf(rel, obuf);
buf = ReleaseAndReadBuffer(obuf, rel, blkno);
@@ -1272,14 +1275,14 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* (a version that lacks the TIDs that are to be deleted).
*
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but it needs its own snapshotConflictHorizon (caller gets this
- * from tableam). This is used by the REDO routine to generate recovery
+ * the page, but it needs its own snapshotConflictHorizon and isCatalogRel
+ * (from the tableam). This is used by the REDO routine to generate recovery
* conflicts. The other difference is that only _bt_delitems_vacuum will
* clear page's VACUUM cycle ID.
*/
static void
-_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
- TransactionId snapshotConflictHorizon,
+_bt_delitems_delete(Relation rel, Buffer buf,
+ TransactionId snapshotConflictHorizon, bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable)
{
@@ -1343,10 +1346,10 @@ _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
XLogRecPtr recptr;
xl_btree_delete xlrec_delete;
- xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
xlrec_delete.ndeleted = ndeletable;
xlrec_delete.nupdated = nupdatable;
+ xlrec_delete.isCatalogRel = isCatalogRel;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1517,6 +1520,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
{
Page page = BufferGetPage(buf);
TransactionId snapshotConflictHorizon;
+ bool isCatalogRel;
OffsetNumber postingidxoffnum = InvalidOffsetNumber;
int ndeletable = 0,
nupdatable = 0;
@@ -1525,6 +1529,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
/* Use tableam interface to determine which tuples to delete first */
snapshotConflictHorizon = table_index_delete_tuples(heapRel, delstate);
+ isCatalogRel = RelationIsAccessibleInLogicalDecoding(heapRel);
/* Should not WAL-log snapshotConflictHorizon unless it's required */
if (!XLogStandbyInfoActive())
@@ -1670,8 +1675,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
}
/* Physically delete tuples (or TIDs) using deletable (or updatable) */
- _bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
- ndeletable, updatable, nupdatable);
+ _bt_delitems_delete(rel, buf, snapshotConflictHorizon, isCatalogRel,
+ deletable, ndeletable, updatable, nupdatable);
/* be tidy */
for (int i = 0; i < nupdatable; i++)
@@ -1692,8 +1697,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
* same level must always be locked left to right to avoid deadlocks.
*/
static bool
-_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
- BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
{
Buffer buf;
Page page;
@@ -1704,7 +1708,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
if (leftsib == P_NONE)
return false;
- buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
+ buf = _bt_getbuf(rel, leftsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1750,7 +1754,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
* to-be-deleted subtree.)
*/
static bool
-_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
{
Buffer buf;
Page page;
@@ -1759,7 +1763,7 @@ _bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrights
Assert(leafrightsib != P_NONE);
- buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
+ buf = _bt_getbuf(rel, leafrightsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1948,18 +1952,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* marked with INCOMPLETE_SPLIT flag before proceeding
*/
Assert(leafblkno == scanblkno);
- if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
+ if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
{
ReleaseBuffer(leafbuf);
return;
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
+ itup_key = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, vstate->info->heaprel, itup_key,
- &sleafbuf, BT_READ, NULL);
+ stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ,
+ NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
@@ -1990,7 +1994,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* leafbuf page half-dead.
*/
Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
- if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
+ if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf,
+ stack))
{
_bt_relbuf(rel, leafbuf);
return;
@@ -2053,7 +2058,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
if (!rightsib_empty)
break;
- leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
}
}
@@ -2072,7 +2077,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* successfully.
*/
static bool
-_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+_bt_mark_page_halfdead(Relation rel, Relation heapRel, Buffer leafbuf,
BTStack stack)
{
BlockNumber leafblkno;
@@ -2108,7 +2113,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
* delete the downlink. It would fail the "right sibling of target page
* is also the next child in parent page" cross-check below.
*/
- if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
+ if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
{
elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
leafblkno, leafrightsib);
@@ -2132,7 +2137,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
*/
topparent = leafblkno;
topparentrightsib = leafrightsib;
- if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
+ if (!_bt_lock_subtree_parent(rel, heapRel, leafblkno, stack,
&subtreeparent, &poffset,
&topparent, &topparentrightsib))
return false;
@@ -2352,7 +2357,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
Assert(target != leafblkno);
/* Fetch the block number of the target's left sibling */
- buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
+ buf = _bt_getbuf(rel, target, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
leftsib = opaque->btpo_prev;
@@ -2379,7 +2384,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
_bt_lockbuf(rel, leafbuf, BT_WRITE);
if (leftsib != P_NONE)
{
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2427,7 +2432,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
CHECK_FOR_INTERRUPTS();
/* step right one page */
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
}
@@ -2491,7 +2496,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* And next write-lock the (current) right sibling.
*/
rightsib = opaque->btpo_next;
- rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
page = BufferGetPage(rbuf);
opaque = BTPageGetOpaque(page);
@@ -2538,7 +2543,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* of doing so are slim, and the locking considerations daunting.)
*
* We can safely acquire a lock on the metapage here --- see comments for
- * _bt_newroot().
+ * _bt_newlevel().
*/
if (leftsib == P_NONE && rightsib_is_rightmost)
{
@@ -2547,8 +2552,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
if (P_RIGHTMOST(opaque))
{
/* rightsib will be the only one left on the level */
- metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
- BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2789,7 +2793,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* parent block in the leafbuf page using BTreeTupleSetTopParent()).
*/
static bool
-_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+_bt_lock_subtree_parent(Relation rel, Relation heapRel, BlockNumber child,
BTStack stack, Buffer *subtreeparent,
OffsetNumber *poffset, BlockNumber *topparent,
BlockNumber *topparentrightsib)
@@ -2806,7 +2810,7 @@ _bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
* Locate the pivot tuple whose downlink points to "child". Write lock
* the parent page itself.
*/
- pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
+ pbuf = _bt_getstackbuf(rel, heapRel, stack, child);
if (pbuf == InvalidBuffer)
{
/*
@@ -2906,11 +2910,11 @@ _bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
*
* Note: We deliberately avoid completing incomplete splits here.
*/
- if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
+ if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
return false;
/* Recurse to examine child page's grandparent page */
- return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
+ return _bt_lock_subtree_parent(rel, heapRel, parent, stack->bts_parent,
subtreeparent, poffset,
topparent, topparentrightsib);
}
@@ -2976,6 +2980,7 @@ _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate)
Relation heaprel = vstate->info->heaprel;
Assert(stats->pages_newly_deleted >= vstate->npendingpages);
+ Assert(heaprel != NULL);
if (vstate->npendingpages == 0)
{
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b1519..4553aaee5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -835,7 +835,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (stats == NULL)
{
/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
- if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
+ if (!_bt_vacuum_needs_cleanup(info->index))
return NULL;
/*
@@ -871,7 +871,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
*/
Assert(stats->pages_deleted >= stats->pages_free);
num_delpages = stats->pages_deleted - stats->pages_free;
- _bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
+ _bt_set_cleanup_info(info->index, num_delpages);
/*
* It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce..fb44ea70b 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,8 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
- Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
@@ -75,6 +74,10 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
* _bt_search() -- Search the tree for a particular scankey,
* or more precisely for the first leaf page it could be on.
*
+ * rel must always be provided. heapRel must be provided by callers that pass
+ * access = BT_WRITE, since we may need to allocate a new root page for caller
+ * in passing (or when finishing a page split results in a parent page split).
+ *
* The passed scankey is an insertion-type scankey (see nbtree/README),
* but it can omit the rightmost column(s) of the index.
*
@@ -94,14 +97,18 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
* during the search will be finished.
*/
BTStack
-_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+_bt_search(Relation rel, Relation heapRel, BTScanInsert key, Buffer *bufP,
int access, Snapshot snapshot)
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ /* heapRel must be set whenever _bt_allocbuf is reachable */
+ Assert(access == BT_READ || access == BT_WRITE);
+ Assert(access == BT_READ || heapRel != NULL);
+
/* Get the root page to start with */
- *bufP = _bt_getroot(rel, heaprel, access);
+ *bufP = _bt_getroot(rel, heapRel, access);
/* If index is empty and access = BT_READ, no root page is created. */
if (!BufferIsValid(*bufP))
@@ -130,7 +137,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* also taken care of in _bt_getstackbuf). But this is a good
* opportunity to finish splits of internal pages too.
*/
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+ *bufP = _bt_moveright(rel, heapRel, key, *bufP, (access == BT_WRITE),
stack_in, page_access, snapshot);
/* if this is a leaf page, we're done */
@@ -191,7 +198,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* but before we acquired a write lock. If it has, we may need to
* move right to its new sibling. Do that.
*/
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+ *bufP = _bt_moveright(rel, heapRel, key, *bufP, true, stack_in, BT_WRITE,
snapshot);
}
@@ -222,8 +229,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*
* If forupdate is true, we will attempt to finish any incomplete splits
* that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
+ * insertion, because we don't allow inserting on a page before the split is
+ * completed. 'heapRel' and 'stack' are only used if forupdate is true.
*
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
@@ -235,7 +242,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*/
Buffer
_bt_moveright(Relation rel,
- Relation heaprel,
+ Relation heapRel,
BTScanInsert key,
Buffer buf,
bool forupdate,
@@ -290,12 +297,12 @@ _bt_moveright(Relation rel,
}
if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, heaprel, buf, stack);
+ _bt_finish_split(rel, heapRel, buf, stack);
else
_bt_relbuf(rel, buf);
/* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, heaprel, blkno, access);
+ buf = _bt_getbuf(rel, blkno, access);
continue;
}
@@ -862,7 +869,6 @@ bool
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
- Relation heaprel = scan->heapRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Buffer buf;
BTStack stack;
@@ -1355,7 +1361,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
}
/* Initialize remaining insertion scan key fields */
- _bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
+ _bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1366,7 +1372,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -2007,7 +2013,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
/* step right one page */
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2081,8 +2087,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
if (BTScanPosIsPinned(so->currPos))
_bt_lockbuf(rel, so->currPos.buf, BT_READ);
else
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
for (;;)
{
@@ -2096,8 +2101,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
}
/* Step to next physical page */
- so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
- so->currPos.buf, scan->xs_snapshot);
+ so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+ scan->xs_snapshot);
/* if we're physically at end of index, return failure */
if (so->currPos.buf == InvalidBuffer)
@@ -2144,8 +2149,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
BTScanPosInvalidate(so->currPos);
return false;
}
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
- BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
}
}
}
@@ -2190,7 +2194,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
* again if it's important.
*/
static Buffer
-_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
{
Page page;
BTPageOpaque opaque;
@@ -2218,7 +2222,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
_bt_relbuf(rel, buf);
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
- buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
+ buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2309,7 +2313,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
* The returned buffer is pinned and read-locked.
*/
Buffer
-_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot)
{
Buffer buf;
@@ -2325,9 +2329,9 @@ _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
* smarter about intermediate levels.)
*/
if (level == 0)
- buf = _bt_getroot(rel, heaprel, BT_READ);
+ buf = _bt_getroot(rel, NULL, BT_READ);
else
- buf = _bt_gettrueroot(rel, heaprel);
+ buf = _bt_gettrueroot(rel);
if (!BufferIsValid(buf))
return InvalidBuffer;
@@ -2408,8 +2412,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
* version of _bt_search(). We don't maintain a stack since we know we
* won't need it.
*/
- buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
- ScanDirectionIsBackward(dir), scan->xs_snapshot);
+ buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
if (!BufferIsValid(buf))
{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 6ad3f3c54..c2665fce4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
+ wstate.inskey = _bt_mkscankey(wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 05abf3603..7da499c4d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
* field themselves.
*/
BTScanInsert
-_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup)
{
BTScanInsert key;
ScanKey skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
if (itup)
- _bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
else
{
/* Utility statement callers can set these fields themselves */
@@ -1761,8 +1761,7 @@ _bt_killitems(IndexScanDesc scan)
droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
- buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
page = BufferGetPage(buf);
if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 65adf04c4..39932d3c2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
* For btrees, get tree height while we have the index
* open
*/
- info->tree_height = _bt_getrootheight(indexRelation, relation);
+ info->tree_height = _bt_getrootheight(indexRelation);
}
else
{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 018810692..eb6cfcfd0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,7 +207,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
Tuplesortstate *
tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- Relation heaprel,
int workMem,
SortCoordinate coordinate, int sortopt)
{
@@ -261,7 +260,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
arg->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
if (arg->indexInfo->ii_Expressions != NULL)
{
@@ -362,7 +361,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
/* Prepare SortSupport data for each column */
base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6979aff72..94a975932 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,6 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
OffsetNumber upperbound);
static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
- Relation heaprel,
IndexTuple itup);
static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
Page page, OffsetNumber offset);
@@ -332,7 +331,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
RelationGetRelationName(indrel))));
/* Extract metadata from metapage, and sanitize it in passing */
- _bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
+ _bt_metaversion(indrel, &heapkeyspace, &allequalimage);
if (allequalimage && !heapkeyspace)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1259,7 +1258,7 @@ bt_target_page_check(BtreeCheckState *state)
}
/* Build insertion scankey for current page offset */
- skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
+ skey = bt_mkscankey_pivotsearch(state->rel, itup);
/*
* Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1769,7 +1768,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
* memory remaining allocated.
*/
firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
- return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
+ return bt_mkscankey_pivotsearch(state->rel, firstitup);
}
/*
@@ -2682,7 +2681,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
Buffer lbuf;
bool exists;
- key = _bt_mkscankey(state->rel, state->heaprel, itup);
+ key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
/*
@@ -2695,7 +2694,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
*/
Assert(state->readonly && state->rootdescend);
exists = false;
- stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
+ stack = _bt_search(state->rel, NULL, key, &lbuf, BT_READ, NULL);
if (BufferIsValid(lbuf))
{
@@ -3134,11 +3133,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
* the scankey is greater.
*/
static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
- skey = _bt_mkscankey(rel, heaprel, itup);
+ skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
return skey;
--
2.40.1
On Fri, May 26, 2023 at 1:56 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I suppose you're not thinking of applying this to current master, but
instead just leave it for when pg17 opens, right? I mean, clearly it
seems far too invasive to put it in after beta1.
I was planning on targeting 16 with this. Although I only posted a
patch recently, I complained about the problems in this area shortly
after the code first went in. It's fairly obvious to me that the
changes made to nbtree went quite a bit further than they needed to.
Admittedly that's partly because I'm an expert on this particular
code.
On the other hand,
it's painful to know that we're going to have code that exists only on
16 and not any other release, in an area that's likely to have bugs here
and there, so we're going to need to heavily adjust backpatches for 16
especially.
Right -- it's important to keep things reasonably consistent to ease
backpatching. Though I doubt that changes to nbtree itself will turn
out to be buggy -- with or without my patch. The changes to nbtree
were all pretty mechanical. A little too mechanical, in fact.
As I said already, there just aren't that many ways that new nbtree
pages can come into existence -- it's naturally limited to page splits
(including root page splits), and the case where we need to add a new
root page that's also a leaf page at the point that the first ever
tuple is inserted into the index (before that we just have a metapage)
-- so I only have three _bt_allocbuf() callers to worry about. It's
completely self-evident (even to people that know little about nbtree)
that the only type of page access that could possibly need a heapRel
in the first place is P_NEW (i.e., a new page allocation). Once you
know all that, this situation begins to look much more
straightforward.
Now, to be fair to Bertrand, it *looks* more complicated than it
really is, in large part due to the obscure case where VACUUM finishes
an interrupted page split (during page deletion), which itself goes on
to cause a page split one level up. So it's possible (barely) that
VACUUM will enlarge an index by one page, which requires a heapRel,
just like any other place where an index is enlarged by a page split
(I documented all this in commit 35bc0ec7).
I've added several defensive assertions that make it hard to get the
details wrong. These will catch the issue much earlier than the main
"heapRel != NULL" assertion in _bt_allocbuf(). So, the rules are
reasonably straightforward and enforceable.
--
Peter Geoghegan
On Fri, May 26, 2023 at 10:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
I've added several defensive assertions that make it hard to get the
details wrong. These will catch the issue much earlier than the main
"heapRel != NULL" assertion in _bt_allocbuf(). So, the rules are
reasonably straightforward and enforceable.
Though it's not an issue new to 16, or a problem that anybody is
obligated to deal with on this thread, I wonder: Why is it okay that
we're using "rel" (the index rel) for our TestForOldSnapshot() calls
in nbtree, rather than using heapRel? Why wouldn't the rules be the
same as they are for the new code paths needed for logical decoding on
standbys? (Namely, the "use heapRel, not rel" rule.)
More concretely, I'm pretty sure that RelationIsUsedAsCatalogTable()
(which TestForOldSnapshot() relies on) gives an answer that would
change if we decided to pass heapRel to the main TestForOldSnapshot()
call within _bt_moveright(), instead of doing what we actually do,
which is to just pass it the index rel. I suppose that that
interaction might have been overlooked when bugfix commit bf9a60ee33
first added RelationIsUsedAsCatalogTable() -- since that happened a
couple of months after the initial "snapshot too old" commit went in,
a fix that happened under time pressure.
More generally, the high level rules/invariants that govern when
TestForOldSnapshot() should be called (and with what rel/snapshot)
feel less than worked out. I find it suspicious that there isn't any
attempt to relate TestForOldSnapshot() behaviors to the conceptually
similar PredicateLockPage() behavior. We don't need predicate locks on
internal pages, but TestForOldSnapshot() *does* get called for
internal pages. Many PredicateLockPage() calls happen very close to
TestForOldSnapshot() calls, each of which use the same snapshot -- not
addressing that seems like a glaring omission to me.
Basically it seems like there should be one standard set of rules for
all this stuff. Though it's not the fault of Bertrand or Andres, all
that we have now is two poorly documented sets of rules that partially
overlap. This has long bothered me.
--
Peter Geoghegan
Hi,
On 2023-05-25 18:50:31 -0700, Peter Geoghegan wrote:
Commit 61b313e4, which prepared index access method code for the
logical decoding on standbys work, made quite a few mechanical
changes. Many routines were revised in order to make sure that heaprel
was available in _bt_getbuf()'s P_NEW (new page allocation) path. But
this went further than it really had to. Many of the changes to nbtree
seem excessive.We only need a heaprel in those code paths that might end up calling
_bt_getbuf() with blkno = P_NEW. This includes most callers that pass
access = BT_WRITE, and all callers that pass access = BT_READ. This
doesn't have to be haphazard -- there just aren't that many places
that can allocate new nbtree pages.
What do we gain by not passing down the heap relation to those places?
If you're concerned about the efficiency of passing down the parameters,
I doubt it will make a meaningful difference, because the parameter can
just stay in the register to be passed down further.
Note that I do agree with some aspects of the change for other reasons,
see below...
It's just page splits, and new
root page allocations (which are actually a slightly different kind of
page split). The rule doesn't need to be complicated (to be fair it
looks more complicated than it really is).Attached patch completely removes the changes to _bt_getbuf()'s
signature from 61b313e4. This is possible without any loss of
functionality. The patch splits _bt_getbuf () in two: the code that
handles BT_READ/BT_WRITE stays in _bt_getbuf(), which is now far
shorter. Handling of new page allocation is moved to a new routine
I've called _bt_alloc(). This is clearer in general, and makes it
clear what the rules really are. Any code that might need to call
_bt_alloc() must be prepared for that, by having a heaprel to pass to
it (the slightly complicated case is interrupted page splits).
I think it's a very good idea to split the "new page" case off
_bt_getbuf(). We probably should evolve the design in the area, and
that will be easier with such a change.
Greetings,
Andres Freund
On Fri, May 26, 2023 at 2:49 PM Andres Freund <andres@anarazel.de> wrote:
What do we gain by not passing down the heap relation to those places?
Just clearer code, free from noisey changes. Easier when backpatching, too.
If you're concerned about the efficiency of passing down the parameters,
I doubt it will make a meaningful difference, because the parameter can
just stay in the register to be passed down further.
I'm not concerned about the efficiency of passing down heapRel in so
many places.
As things stand, there is no suggestion that any _bt_getbuf() call is
exempt from the requirement to pass down a heaprel -- every caller
(internal and external) goes to the trouble of making sure that they
comply with the apparent requirement to supply a heapRel. In some
cases callers do so just to be able to read the metapage. Even code as
far removed from nbtree as heapam_relation_copy_for_cluster() will now
go to the trouble of passing its own heap rel, just to perform a
CLUSTER-based tuplesort. The relevant tuplesort call site even has
comments that try to justify this approach, with a reference to
_bt_log_reuse_page(). So heapam_handler.c now references a static
helper function private to nbtpage.c -- an obvious modularity
violation.
It's not even the modularity violation itself that bothers me. It's
just 100% unnecessary for heapam_relation_copy_for_cluster() to do any
of this, because there simply isn't going to be a call to
_bt_log_reuse_page() during its cluster tuplesort, no matter what.
This has nothing to do with any underlying implementation detail from
nbtree, or from any other index AM.
--
Peter Geoghegan
On Fri, May 26, 2023 at 10:56:53AM +0200, Alvaro Herrera wrote:
I suppose you're not thinking of applying this to current master, but
instead just leave it for when pg17 opens, right? I mean, clearly it
seems far too invasive to put it in after beta1. On the other hand,
it's painful to know that we're going to have code that exists only on
16 and not any other release, in an area that's likely to have bugs here
and there, so we're going to need to heavily adjust backpatches for 16
especially.I can't make up my mind about this. What do others think?
When I looked at the patch yesterday, my impression was that this
would be material for v17 as it is refactoring work, not v16.
--
Michael
On Fri, May 26, 2023 at 4:05 PM Michael Paquier <michael@paquier.xyz> wrote:
On Fri, May 26, 2023 at 10:56:53AM +0200, Alvaro Herrera wrote:
I can't make up my mind about this. What do others think?
When I looked at the patch yesterday, my impression was that this
would be material for v17 as it is refactoring work, not v16.
I'd have thought the subject line "Cleaning up nbtree after logical
decoding on standby work" made it quite clear that this patch was
targeting 16.
It's not refactoring work -- not really. The whole idea of outright
removing the use of P_NEW in nbtree was where I landed with this after
a couple of hours of work. In fact I almost posted a version without
that, though that was worse in every way to my final approach.
I first voiced concerns about this whole area way back on April 4,
which is only 3 days after commit 61b313e4 went in:
/messages/by-id/CAH2-Wz=jGryxWm74G1khSt0zNPUNhezYJnvSjNo2t3Jswtb8ww@mail.gmail.com
--
Peter Geoghegan
On Fri, May 26, 2023 at 04:48:37PM -0700, Peter Geoghegan wrote:
I'd have thought the subject line "Cleaning up nbtree after logical
decoding on standby work" made it quite clear that this patch was
targeting 16.
Hmm, okay. I was understanding that as something for v17, honestly.
It's not refactoring work -- not really. The whole idea of outright
removing the use of P_NEW in nbtree was where I landed with this after
a couple of hours of work. In fact I almost posted a version without
that, though that was worse in every way to my final approach.I first voiced concerns about this whole area way back on April 4,
which is only 3 days after commit 61b313e4 went in:/messages/by-id/CAH2-Wz=jGryxWm74G1khSt0zNPUNhezYJnvSjNo2t3Jswtb8ww@mail.gmail.com
Sure. My take is that if this patch were to be sent at the beginning
of April, it could have been considered in v16. However, deciding
such a matter at the end of May after beta1 has been released is a
different call. You may want to make sure that the RMT is OK with
that, at the end.
--
Michael
On 29/05/2023 03:31, Michael Paquier wrote:
On Fri, May 26, 2023 at 04:48:37PM -0700, Peter Geoghegan wrote:
I'd have thought the subject line "Cleaning up nbtree after logical
decoding on standby work" made it quite clear that this patch was
targeting 16.Hmm, okay. I was understanding that as something for v17, honestly.
IMO this makes sense for v16. These new arguments were introduced in
v16, so if we have second thoughts, now is the right time to change
them, before v16 is released. It will reduce the backpatching effort in
the future; if we apply this in v17, then v16 will be the odd one out.
For the patch itself:
@@ -75,6 +74,10 * _bt_search() -- Search the tree for a particular scankey, * or more precisely for the first leaf page it could be on. * + * rel must always be provided. heaprel must be provided by callers that pass + * access = BT_WRITE, since we may need to allocate a new root page for caller + * in passing (or when finishing a page split results in a parent page split). + * * The passed scankey is an insertion-type scankey (see nbtree/README), * but it can omit the rightmost column(s) of the index. *
Maybe add an assertion for that in _bt_search(), too. I know you added
one in _bt_getroot(), and _bt_search() calls that as the very first
thing. But I think it would be useful as documentation in _bt_search(), too.
Maybe it would be more straightforward to always require heapRel in
_bt_search() and _bt_getroot(), regardless of whether it's BT_READ or
BT_WRITE, even if the functions don't use it with BT_READ. It would be
less mental effort in the callers to just always pass in 'heapRel'.
Overall, +1 on this patch, and +1 for committing it to v16.
--
Heikki Linnakangas
Neon (https://neon.tech)
Hi,
On 5/26/23 7:28 PM, Peter Geoghegan wrote:
On Fri, May 26, 2023 at 1:56 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
I suppose you're not thinking of applying this to current master, but
instead just leave it for when pg17 opens, right? I mean, clearly it
seems far too invasive to put it in after beta1.I was planning on targeting 16 with this. Although I only posted a
patch recently, I complained about the problems in this area shortly
after the code first went in. It's fairly obvious to me that the
changes made to nbtree went quite a bit further than they needed to.
Thanks Peter for the call out and the follow up on this!
As you already mentioned in this thread, all the changes I've done in
61b313e47e were purely "mechanical" as the main goal was to move forward the
logical decoding on standby thread and..
Admittedly that's partly because I'm an expert on this particular
code.
it was not obvious to me (as I'm not an expert as you are in this area) that
many of those changes were "excessive".
Now, to be fair to Bertrand, it *looks* more complicated than it
really is, in large part due to the obscure case where VACUUM finishes
an interrupted page split (during page deletion), which itself goes on
to cause a page split one level up.
Thanks ;-)
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Sun, May 28, 2023 at 7:49 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
IMO this makes sense for v16. These new arguments were introduced in
v16, so if we have second thoughts, now is the right time to change
them, before v16 is released. It will reduce the backpatching effort in
the future; if we apply this in v17, then v16 will be the odd one out.
My current plan is to commit this later in the week, unless there are
further objections. Wednesday or possibly Thursday.
Maybe add an assertion for that in _bt_search(), too. I know you added
one in _bt_getroot(), and _bt_search() calls that as the very first
thing. But I think it would be useful as documentation in _bt_search(), too.
Attached revision v3 does it that way.
Maybe it would be more straightforward to always require heapRel in
_bt_search() and _bt_getroot(), regardless of whether it's BT_READ or
BT_WRITE, even if the functions don't use it with BT_READ. It would be
less mental effort in the callers to just always pass in 'heapRel'.
Perhaps, but it would also necessitate keeping heapRel in
_bt_get_endpoint()'s signature. That would mean that
_bt_get_endpoint() would needlessly pass its own superfluous heapRel
arg to _bt_search(), while presumably never passing the same heapRel
to _bt_gettrueroot() (not to be confused with _bt_getroot) in the
"level == 0" case. These inconsistencies seem kind of jarring.
Besides, every _bt_search() caller must already understand that
_bt_search does non-obvious extra work for BT_WRITE callers -- that's
nothing new. The requirement that BT_WRITE callers pass a heapRel
exists precisely because code that is used only during BT_WRITE calls
might ultimately reach _bt_allocbuf() indirectly. The "no heapRel
required in BT_READ case" seems directly relevant to callers --
avoiding _bt_allocbuf() during _bt_search() calls during Hot Standby
(or during VACUUM) is a basic requirement that callers more or less
ask for and expect already. (Bear in mind that the new rule going
forward is that _bt_allocbuf() is the one and only choke point where
new pages/buffers can be allocated by nbtree, and the only possible
source of recovery conflicts during REDO besides opportunistic
deletion record conflicts -- so it really isn't strange for _bt_search
callers to be thinking about whether _bt_allocbuf is safe to call.)
--
Peter Geoghegan
Attachments:
v3-0001-nbtree-Allocate-new-pages-in-separate-function.patchapplication/octet-stream; name=v3-0001-nbtree-Allocate-new-pages-in-separate-function.patchDownload
From 9c2ee8dcdf5cf4c44803d25c631f211b909f992f Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 23 May 2023 22:08:56 -0700
Subject: [PATCH v3 1/2] nbtree: Allocate new pages in separate function.
Split nbtree's _bt_getbuf function is two: code that read locks or write
locks existing pages is left in _bt_getbuf, while code that deals with
allocating new pages is moved to a new, dedicated function:
_bt_allocbuf.
This division of labor cleanly separates code that might need to
allocate a new page (usually as part of a page split) with all other
nbtree code. This is clearer overall. Read-only callers to functions
like _bt_search opt out of finishing off any incomplete page splits that
are encountered in passing. Other callers in write paths (that can
never run during Hot Standby) will finish off incomplete page splits,
even in pages that they wouldn't usually have to write lock.
The immediate reason to do this refactoring now is to revert certain
mechanical changes that were required so that _bt_getbuf callers would
provide a valid heapRel in all cases. Commit 61b313e4, which prepared
nbtree for logical decoding on standbys, did this for all _bt_getbuf
callers, even though it was only necessary for P_NEW _bt_getbuf callers.
We could keep the nbtree P_NEW idiom while just passing NULL as our
heapRel instead, but completely replacing it with _bt_allocbuf turned
out to be significantly cleaner. This approach actually makes things
_more_ consistent between HEAD and back branches that still use P_NEW.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
---
src/include/access/nbtree.h | 38 ++-
src/include/utils/tuplesort.h | 4 +-
src/backend/access/heap/heapam_handler.c | 9 +-
src/backend/access/nbtree/nbtinsert.c | 88 ++---
src/backend/access/nbtree/nbtpage.c | 378 +++++++++++----------
src/backend/access/nbtree/nbtree.c | 4 +-
src/backend/access/nbtree/nbtsearch.c | 62 ++--
src/backend/access/nbtree/nbtsort.c | 2 +-
src/backend/access/nbtree/nbtutils.c | 7 +-
src/backend/optimizer/util/plancat.c | 2 +-
src/backend/utils/sort/tuplesortvariants.c | 5 +-
contrib/amcheck/verify_nbtree.c | 15 +-
12 files changed, 317 insertions(+), 297 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d68478609..b690e2476 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -288,16 +288,19 @@ BTPageGetDeleteXid(Page page)
* well need special handling for new pages anyway.
*/
static inline bool
-BTPageIsRecyclable(Page page, Relation heaprel)
+BTPageIsRecyclable(Page page, Relation heapRel)
{
BTPageOpaque opaque;
Assert(!PageIsNew(page));
+ Assert(heapRel != NULL);
/* Recycling okay iff page is deleted and safexid is old enough */
opaque = BTPageGetOpaque(page);
if (P_ISDELETED(opaque))
{
+ FullTransactionId safexid = BTPageGetDeleteXid(page);
+
/*
* The page was deleted, but when? If it was just deleted, a scan
* might have seen the downlink to it, and will read the page later.
@@ -308,7 +311,7 @@ BTPageIsRecyclable(Page page, Relation heaprel)
* anyone. If not, then no scan that's still in progress could have
* seen its downlink, and we can recycle it.
*/
- return GlobalVisCheckRemovableFullXid(heaprel, BTPageGetDeleteXid(page));
+ return GlobalVisCheckRemovableFullXid(heapRel, safexid);
}
return false;
@@ -1177,9 +1180,9 @@ extern IndexTuple _bt_swap_posting(IndexTuple newitem, IndexTuple oposting,
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, bool indexUnchanged,
Relation heapRel);
-extern void _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf,
+extern void _bt_finish_split(Relation rel, Relation heapRel, Buffer lbuf,
BTStack stack);
-extern Buffer _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack,
+extern Buffer _bt_getstackbuf(Relation rel, Relation heapRel, BTStack stack,
BlockNumber child);
/*
@@ -1194,18 +1197,17 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
*/
extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
-extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
- BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
extern void _bt_upgrademetapage(Page page);
-extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
-extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
-extern int _bt_getrootheight(Relation rel, Relation heaprel);
-extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
+extern Buffer _bt_getroot(Relation rel, Relation heapRel, int access);
+extern Buffer _bt_gettrueroot(Relation rel);
+extern int _bt_getrootheight(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
bool *allequalimage);
extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
- int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_allocbuf(Relation rel, Relation heapRel);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
BlockNumber blkno, int access);
extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1228,22 +1230,22 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
/*
* prototypes for functions in nbtsearch.c
*/
-extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
+extern BTStack _bt_search(Relation rel, Relation heapRel, BTScanInsert key,
Buffer *bufP, int access, Snapshot snapshot);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+extern Buffer _bt_moveright(Relation rel, Relation heapRel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access, Snapshot snapshot);
extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
- bool rightmost, Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+ Snapshot snapshot);
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 656a2e989..af057b635 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,9 +399,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
- Relation indexRel,
- Relation heaprel,
- int workMem,
+ Relation indexRel, int workMem,
SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc2..0755be839 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,14 +731,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*multi_cutoff);
- /*
- * Set up sorting if wanted. NewHeap is being passed to
- * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
- * really matter, as the goal is to have a heap relation being passed to
- * _bt_log_reuse_page() (which should not be called from this code path).
- */
+ /* Set up sorting if wanted */
if (use_sort)
- tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
+ tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
maintenance_work_mem,
NULL, TUPLESORT_NONE);
else
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 84bbd78d5..0a0ba875b 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -30,7 +30,7 @@
#define BTREE_FASTPATH_MIN_LEVEL 2
-static BTStack _bt_search_insert(Relation rel, Relation heaprel,
+static BTStack _bt_search_insert(Relation rel, Relation heapRel,
BTInsertState insertstate);
static TransactionId _bt_check_unique(Relation rel, BTInsertState insertstate,
Relation heapRel,
@@ -42,9 +42,11 @@ static OffsetNumber _bt_findinsertloc(Relation rel,
bool indexUnchanged,
BTStack stack,
Relation heapRel);
-static void _bt_stepright(Relation rel, Relation heaprel,
+static void _bt_stepright(Relation rel, Relation heapRel,
BTInsertState insertstate, BTStack stack);
-static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key,
+static void _bt_insertonpg(Relation rel,
+ Relation heapRel,
+ BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
BTStack stack,
@@ -53,13 +55,13 @@ static void _bt_insertonpg(Relation rel, Relation heaprel, BTScanInsert itup_key
OffsetNumber newitemoff,
int postingoff,
bool split_only_page);
-static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
+static Buffer _bt_split(Relation rel, Relation heapRel, BTScanInsert itup_key,
Buffer buf, Buffer cbuf, OffsetNumber newitemoff,
Size newitemsz, IndexTuple newitem, IndexTuple orignewitem,
IndexTuple nposting, uint16 postingoff);
-static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
+static void _bt_insert_parent(Relation rel, Relation heapRel, Buffer buf,
Buffer rbuf, BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
+static Buffer _bt_newlevel(Relation rel, Relation heapRel, Buffer lbuf, Buffer rbuf);
static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off, bool newfirstdataitem);
static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -110,7 +112,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
/* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, heapRel, itup);
+ itup_key = _bt_mkscankey(rel, itup);
if (checkingunique)
{
@@ -314,7 +316,7 @@ search:
* since each per-backend cache won't stay valid for long.
*/
static BTStack
-_bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
+_bt_search_insert(Relation rel, Relation heapRel, BTInsertState insertstate)
{
Assert(insertstate->buf == InvalidBuffer);
Assert(!insertstate->bounds_valid);
@@ -377,7 +379,7 @@ _bt_search_insert(Relation rel, Relation heaprel, BTInsertState insertstate)
}
/* Cannot use optimization -- descend tree, return proper descent stack */
- return _bt_search(rel, heaprel, insertstate->itup_key, &insertstate->buf,
+ return _bt_search(rel, heapRel, insertstate->itup_key, &insertstate->buf,
BT_WRITE, NULL);
}
@@ -1024,13 +1026,15 @@ _bt_findinsertloc(Relation rel,
* indexes.
*/
static void
-_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heapRel, BTInsertState insertstate,
+ BTStack stack)
{
Page page;
BTPageOpaque opaque;
Buffer rbuf;
BlockNumber rblkno;
+ Assert(heapRel != NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -1050,7 +1054,7 @@ _bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack
*/
if (P_INCOMPLETE_SPLIT(opaque))
{
- _bt_finish_split(rel, heaprel, rbuf, stack);
+ _bt_finish_split(rel, heapRel, rbuf, stack);
rbuf = InvalidBuffer;
continue;
}
@@ -1101,7 +1105,7 @@ _bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack
*/
static void
_bt_insertonpg(Relation rel,
- Relation heaprel,
+ Relation heapRel,
BTScanInsert itup_key,
Buffer buf,
Buffer cbuf,
@@ -1145,7 +1149,7 @@ _bt_insertonpg(Relation rel,
/*
* Every internal page should have exactly one negative infinity item at
- * all times. Only _bt_split() and _bt_newroot() should add items that
+ * all times. Only _bt_split() and _bt_newlevel() should add items that
* become negative infinity items through truncation, since they're the
* only routines that allocate new internal pages.
*/
@@ -1212,7 +1216,7 @@ _bt_insertonpg(Relation rel,
Assert(!split_only_page);
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, heaprel, itup_key, buf, cbuf, newitemoff, itemsz,
+ rbuf = _bt_split(rel, heapRel, itup_key, buf, cbuf, newitemoff, itemsz,
itup, origitup, nposting, postingoff);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
@@ -1236,7 +1240,7 @@ _bt_insertonpg(Relation rel,
* page.
*----------
*/
- _bt_insert_parent(rel, heaprel, buf, rbuf, stack, isroot, isonly);
+ _bt_insert_parent(rel, heapRel, buf, rbuf, stack, isroot, isonly);
}
else
{
@@ -1250,14 +1254,14 @@ _bt_insertonpg(Relation rel,
* only one on its tree level, but was not the root, it may have been
* the "fast root". We need to ensure that the fast root link points
* at or above the current page. We can safely acquire a lock on the
- * metapage here --- see comments for _bt_newroot().
+ * metapage here --- see comments for _bt_newlevel().
*/
if (unlikely(split_only_page))
{
Assert(!isleaf);
Assert(BufferIsValid(cbuf));
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -1421,7 +1425,7 @@ _bt_insertonpg(Relation rel,
* call _bt_getrootheight while holding a buffer lock.
*/
if (BlockNumberIsValid(blockcache) &&
- _bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
+ _bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, blockcache);
}
@@ -1462,7 +1466,7 @@ _bt_insertonpg(Relation rel,
* The pin and lock on buf are maintained.
*/
static Buffer
-_bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
+_bt_split(Relation rel, Relation heapRel, BTScanInsert itup_key, Buffer buf,
Buffer cbuf, OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
IndexTuple orignewitem, IndexTuple nposting, uint16 postingoff)
{
@@ -1715,7 +1719,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* way because it avoids an unnecessary PANIC when either origpage or its
* existing sibling page are corrupt.
*/
- rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rbuf = _bt_allocbuf(rel, heapRel);
rightpage = BufferGetPage(rbuf);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_getbuf */
@@ -1888,7 +1892,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
*/
if (!isrightmost)
{
- sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
+ sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
spage = BufferGetPage(sbuf);
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
@@ -2095,13 +2099,15 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
*/
static void
_bt_insert_parent(Relation rel,
- Relation heaprel,
+ Relation heapRel,
Buffer buf,
Buffer rbuf,
BTStack stack,
bool isroot,
bool isonly)
{
+ Assert(heapRel != NULL);
+
/*
* Here we have to do something Lehman and Yao don't talk about: deal with
* a root split and construction of a new root. If our stack is empty
@@ -2121,8 +2127,8 @@ _bt_insert_parent(Relation rel,
Assert(stack == NULL);
Assert(isonly);
- /* create a new root node and update the metapage */
- rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
+ /* create a new root node one level up and update the metapage */
+ rootbuf = _bt_newlevel(rel, heapRel, buf, rbuf);
/* release the split buffers */
_bt_relbuf(rel, rootbuf);
_bt_relbuf(rel, rbuf);
@@ -2161,8 +2167,7 @@ _bt_insert_parent(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* Find the leftmost page at the next level up */
- pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
- NULL);
+ pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2188,7 +2193,7 @@ _bt_insert_parent(Relation rel,
* new downlink will be inserted at the correct offset. Even buf's
* parent may have changed.
*/
- pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);
+ pbuf = _bt_getstackbuf(rel, heapRel, stack, bknum);
/*
* Unlock the right child. The left child will be unlocked in
@@ -2212,7 +2217,7 @@ _bt_insert_parent(Relation rel,
RelationGetRelationName(rel), bknum, rbknum)));
/* Recursively insert into the parent */
- _bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
+ _bt_insertonpg(rel, heapRel, NULL, pbuf, buf, stack->bts_parent,
new_item, MAXALIGN(IndexTupleSize(new_item)),
stack->bts_offset + 1, 0, isonly);
@@ -2230,9 +2235,12 @@ _bt_insert_parent(Relation rel,
*
* On entry, 'lbuf' must be locked in write-mode. On exit, it is unlocked
* and unpinned.
+ *
+ * Caller must provide a valid heapRel, since finishing a page split will
+ * require allocating a new page when the parent page splits in turn.
*/
void
-_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
+_bt_finish_split(Relation rel, Relation heapRel, Buffer lbuf, BTStack stack)
{
Page lpage = BufferGetPage(lbuf);
BTPageOpaque lpageop = BTPageGetOpaque(lpage);
@@ -2243,9 +2251,10 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
bool wasonly;
Assert(P_INCOMPLETE_SPLIT(lpageop));
+ Assert(heapRel != NULL);
/* Lock right sibling, the one missing the downlink */
- rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
rpage = BufferGetPage(rbuf);
rpageop = BTPageGetOpaque(rpage);
@@ -2257,7 +2266,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
BTMetaPageData *metad;
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2274,7 +2283,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
elog(DEBUG1, "finishing incomplete split of %u/%u",
BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
- _bt_insert_parent(rel, heaprel, lbuf, rbuf, stack, wasroot, wasonly);
+ _bt_insert_parent(rel, heapRel, lbuf, rbuf, stack, wasroot, wasonly);
}
/*
@@ -2309,7 +2318,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
* offset number bts_offset + 1.
*/
Buffer
-_bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child)
+_bt_getstackbuf(Relation rel, Relation heapRel, BTStack stack, BlockNumber child)
{
BlockNumber blkno;
OffsetNumber start;
@@ -2323,13 +2332,14 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
Page page;
BTPageOpaque opaque;
- buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
+ buf = _bt_getbuf(rel, blkno, BT_WRITE);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
+ Assert(heapRel != NULL);
if (P_INCOMPLETE_SPLIT(opaque))
{
- _bt_finish_split(rel, heaprel, buf, stack->bts_parent);
+ _bt_finish_split(rel, heapRel, buf, stack->bts_parent);
continue;
}
@@ -2415,7 +2425,7 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
}
/*
- * _bt_newroot() -- Create a new root page for the index.
+ * _bt_newlevel() -- Create a new level above former root page.
*
* We've just split the old root page and need to create a new one.
* In order to do this, we add a new root page to the file, then lock
@@ -2433,7 +2443,7 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
* lbuf, rbuf & rootbuf.
*/
static Buffer
-_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
+_bt_newlevel(Relation rel, Relation heapRel, Buffer lbuf, Buffer rbuf)
{
Buffer rootbuf;
Page lpage,
@@ -2459,12 +2469,12 @@ _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
lopaque = BTPageGetOpaque(lpage);
/* get a new root page */
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heapRel);
rootpage = BufferGetPage(rootbuf);
rootblknum = BufferGetBlockNumber(rootbuf);
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 486d1563d..e9fb5e8f5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,22 +38,21 @@
#include "utils/snapmgr.h"
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Buffer buf,
TransactionId snapshotConflictHorizon,
+ bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable);
static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
OffsetNumber *updatedoffsets,
Size *updatedbuflen, bool needswal);
-static bool _bt_mark_page_halfdead(Relation rel, Relation heaprel,
+static bool _bt_mark_page_halfdead(Relation rel, Relation heapRel,
Buffer leafbuf, BTStack stack);
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
BlockNumber scanblkno,
bool *rightsib_empty,
BTVacState *vstate);
-static bool _bt_lock_subtree_parent(Relation rel, Relation heaprel,
+static bool _bt_lock_subtree_parent(Relation rel, Relation heapRel,
BlockNumber child, BTStack stack,
Buffer *subtreeparent, OffsetNumber *poffset,
BlockNumber *topparent,
@@ -177,7 +176,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
* index tuples needed to be deleted.
*/
bool
-_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
+_bt_vacuum_needs_cleanup(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -190,7 +189,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
*
* Note that we deliberately avoid using cached version of metapage here.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
btm_version = metad->btm_version;
@@ -230,7 +229,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
* finalized.
*/
void
-_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
{
Buffer metabuf;
Page metapg;
@@ -254,7 +253,7 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* no longer used as of PostgreSQL 14. We set it to -1.0 on rewrite, just
* to be consistent.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -326,6 +325,9 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* NOTE that the returned root page will have only a read lock set
* on it even if access = BT_WRITE!
*
+ * If access = BT_WRITE, heapRel must be set; otherwise caller should
+ * just pass NULL. See _bt_allocbuf for an explanation.
+ *
* The returned page is not necessarily the true root --- it could be
* a "fast root" (a page that is alone in its level due to deletions).
* Also, if the root page is split while we are "in flight" to it,
@@ -339,7 +341,7 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* The metadata page is not locked or pinned on exit.
*/
Buffer
-_bt_getroot(Relation rel, Relation heaprel, int access)
+_bt_getroot(Relation rel, Relation heapRel, int access)
{
Buffer metabuf;
Buffer rootbuf;
@@ -349,6 +351,8 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
uint32 rootlevel;
BTMetaPageData *metad;
+ Assert(access == BT_READ || heapRel != NULL);
+
/*
* Try to use previously-cached metapage data to find the root. This
* normally saves one buffer access per index search, which is a very
@@ -369,7 +373,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
Assert(rootblkno != P_NONE);
rootlevel = metad->btm_fastlevel;
- rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
+ rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -395,7 +399,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
rel->rd_amcache = NULL;
}
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/* if no root page initialized yet, do it */
@@ -428,7 +432,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* to optimize this case.)
*/
_bt_relbuf(rel, metabuf);
- return _bt_getroot(rel, heaprel, access);
+ return _bt_getroot(rel, heapRel, access);
}
/*
@@ -436,7 +440,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* the new root page. Since this is the first page in the tree, it's
* a leaf as well as the root.
*/
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heapRel);
rootblkno = BufferGetBlockNumber(rootbuf);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -573,7 +577,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* moving to the root --- that'd deadlock against any concurrent root split.)
*/
Buffer
-_bt_gettrueroot(Relation rel, Relation heaprel)
+_bt_gettrueroot(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -595,7 +599,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
pfree(rel->rd_amcache);
rel->rd_amcache = NULL;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metaopaque = BTPageGetOpaque(metapg);
metad = BTPageGetMeta(metapg);
@@ -668,7 +672,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
* about updating previously cached data.
*/
int
-_bt_getrootheight(Relation rel, Relation heaprel)
+_bt_getrootheight(Relation rel)
{
BTMetaPageData *metad;
@@ -676,7 +680,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -732,7 +736,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
* pg_upgrade'd from Postgres 12.
*/
void
-_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
{
BTMetaPageData *metad;
@@ -740,7 +744,7 @@ _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequ
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -821,37 +825,7 @@ _bt_checkpage(Relation rel, Buffer buf)
}
/*
- * Log the reuse of a page from the FSM.
- */
-static void
-_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid)
-{
- xl_btree_reuse_page xlrec_reuse;
-
- /*
- * Note that we don't register the buffer with the record, because this
- * operation doesn't modify the page. This record only exists to provide a
- * conflict point for Hot Standby.
- */
-
- /* XLOG stuff */
- xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
- xlrec_reuse.locator = rel->rd_locator;
- xlrec_reuse.block = blkno;
- xlrec_reuse.snapshotConflictHorizon = safexid;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
-
- XLogInsert(RM_BTREE_ID, XLOG_BTREE_REUSE_PAGE);
-}
-
-/*
- * _bt_getbuf() -- Get a buffer by block number for read or write.
- *
- * blkno == P_NEW means to get an unallocated index page. The page
- * will be initialized before returning it.
+ * _bt_getbuf() -- Get an existing block in a buffer, for read or write.
*
* The general rule in nbtree is that it's never okay to access a
* page without holding both a buffer pin and a buffer lock on
@@ -860,136 +834,167 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
* When this routine returns, the appropriate lock is set on the
* requested buffer and its reference count has been incremented
* (ie, the buffer is "locked and pinned"). Also, we apply
- * _bt_checkpage to sanity-check the page (except in P_NEW case),
- * and perform Valgrind client requests that help Valgrind detect
- * unsafe page accesses.
+ * _bt_checkpage to sanity-check the page, and perform Valgrind
+ * client requests that help Valgrind detect unsafe page accesses.
*
* Note: raw LockBuffer() calls are disallowed in nbtree; all
* buffer lock requests need to go through wrapper functions such
* as _bt_lockbuf().
*/
Buffer
-_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access)
{
Buffer buf;
- if (blkno != P_NEW)
+ Assert(BlockNumberIsValid(blkno));
+
+ /* Read an existing block of the relation */
+ buf = ReadBuffer(rel, blkno);
+ _bt_lockbuf(rel, buf, access);
+ _bt_checkpage(rel, buf);
+
+ return buf;
+}
+
+/*
+ * _bt_allocbuf() -- Allocate a new block/page.
+ *
+ * Routine returns a pinned and write-locked buffer that contains a newly
+ * allocated index page. The page will be initialized with the appropriate
+ * amount of special area space for an nbtree page, too.
+ *
+ * Callers are required to pass a valid heapRel. We need heapRel so that we
+ * can handle generating a snapshotConflictHorizon that makes reusing a page
+ * from the FSM safe for queries that may be running on standbys.
+ */
+Buffer
+_bt_allocbuf(Relation rel, Relation heapRel)
+{
+ Buffer buf;
+ BlockNumber blkno;
+ Page page;
+
+ Assert(heapRel != NULL);
+
+ /*
+ * First see if the FSM knows of any free pages.
+ *
+ * We can't trust the FSM's report unreservedly; we have to check that the
+ * page is still free. (For example, an already-free page could have been
+ * re-used between the time the last VACUUM scanned it and the time the
+ * VACUUM made its FSM updates.)
+ *
+ * In fact, it's worse than that: we can't even assume that it's safe to
+ * take a lock on the reported page. If somebody else has a lock on it,
+ * or even worse our own caller does, we could deadlock. (The own-caller
+ * scenario is actually not improbable. Consider an index on a serial or
+ * timestamp column. Nearly all splits will be at the rightmost page, so
+ * it's entirely likely that _bt_split will call us while holding a lock
+ * on the page most recently acquired from FSM. A VACUUM running
+ * concurrently with the previous split could well have placed that page
+ * back in FSM.)
+ *
+ * To get around that, we ask for only a conditional lock on the reported
+ * page. If we fail, then someone else is using the page, and we may
+ * reasonably assume it's not free. (If we happen to be wrong, the worst
+ * consequence is the page will be lost to use till the next VACUUM, which
+ * is no big problem.)
+ */
+ for (;;)
{
- /* Read an existing block of the relation */
+ blkno = GetFreeIndexPage(rel);
+ if (blkno == InvalidBlockNumber)
+ break;
buf = ReadBuffer(rel, blkno);
- _bt_lockbuf(rel, buf, access);
- _bt_checkpage(rel, buf);
- }
- else
- {
- Page page;
-
- Assert(access == BT_WRITE);
-
- /*
- * First see if the FSM knows of any free pages.
- *
- * We can't trust the FSM's report unreservedly; we have to check that
- * the page is still free. (For example, an already-free page could
- * have been re-used between the time the last VACUUM scanned it and
- * the time the VACUUM made its FSM updates.)
- *
- * In fact, it's worse than that: we can't even assume that it's safe
- * to take a lock on the reported page. If somebody else has a lock
- * on it, or even worse our own caller does, we could deadlock. (The
- * own-caller scenario is actually not improbable. Consider an index
- * on a serial or timestamp column. Nearly all splits will be at the
- * rightmost page, so it's entirely likely that _bt_split will call us
- * while holding a lock on the page most recently acquired from FSM. A
- * VACUUM running concurrently with the previous split could well have
- * placed that page back in FSM.)
- *
- * To get around that, we ask for only a conditional lock on the
- * reported page. If we fail, then someone else is using the page,
- * and we may reasonably assume it's not free. (If we happen to be
- * wrong, the worst consequence is the page will be lost to use till
- * the next VACUUM, which is no big problem.)
- */
- for (;;)
+ if (_bt_conditionallockbuf(rel, buf))
{
- blkno = GetFreeIndexPage(rel);
- if (blkno == InvalidBlockNumber)
- break;
- buf = ReadBuffer(rel, blkno);
- if (_bt_conditionallockbuf(rel, buf))
- {
- page = BufferGetPage(buf);
+ page = BufferGetPage(buf);
+ /*
+ * It's possible to find an all-zeroes page in an index. For
+ * example, a backend might successfully extend the relation one
+ * page and then crash before it is able to make a WAL entry for
+ * adding the page. If we find a zeroed page then reclaim it
+ * immediately.
+ */
+ if (PageIsNew(page))
+ {
+ /* Okay to use page. Initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
+ }
+
+ if (BTPageIsRecyclable(page, heapRel))
+ {
/*
- * It's possible to find an all-zeroes page in an index. For
- * example, a backend might successfully extend the relation
- * one page and then crash before it is able to make a WAL
- * entry for adding the page. If we find a zeroed page then
- * reclaim it immediately.
+ * If we are generating WAL for Hot Standby then create a WAL
+ * record that will allow us to conflict with queries running
+ * on standby, in case they have snapshots older than safexid
+ * value
*/
- if (PageIsNew(page))
+ if (RelationNeedsWAL(rel) && XLogStandbyInfoActive())
{
- /* Okay to use page. Initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
- }
+ xl_btree_reuse_page xlrec_reuse;
- if (BTPageIsRecyclable(page, heaprel))
- {
/*
- * If we are generating WAL for Hot Standby then create a
- * WAL record that will allow us to conflict with queries
- * running on standby, in case they have snapshots older
- * than safexid value
+ * Note that we don't register the buffer with the record,
+ * because this operation doesn't modify the page (that
+ * already happened, back when VACUUM deleted the page).
+ * This record only exists to provide a conflict point for
+ * Hot Standby. See record REDO routine comments.
*/
- if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
- _bt_log_reuse_page(rel, heaprel, blkno,
- BTPageGetDeleteXid(page));
+ xlrec_reuse.locator = rel->rd_locator;
+ xlrec_reuse.block = blkno;
+ xlrec_reuse.snapshotConflictHorizon = BTPageGetDeleteXid(page);
+ xlrec_reuse.isCatalogRel =
+ RelationIsAccessibleInLogicalDecoding(heapRel);
- /* Okay to use page. Re-initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
+
+ XLogInsert(RM_BTREE_ID, XLOG_BTREE_REUSE_PAGE);
}
- elog(DEBUG2, "FSM returned nonrecyclable page");
- _bt_relbuf(rel, buf);
- }
- else
- {
- elog(DEBUG2, "FSM returned nonlockable page");
- /* couldn't get lock, so just drop pin */
- ReleaseBuffer(buf);
+
+ /* Okay to use page. Re-initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
}
+ elog(DEBUG2, "FSM returned nonrecyclable page");
+ _bt_relbuf(rel, buf);
+ }
+ else
+ {
+ elog(DEBUG2, "FSM returned nonlockable page");
+ /* couldn't get lock, so just drop pin */
+ ReleaseBuffer(buf);
}
-
- /*
- * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
- * we risk a race condition against btvacuumscan --- see comments
- * therein. This forces us to repeat the valgrind request that
- * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
- * without introducing a race.
- */
- buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
- EB_LOCK_FIRST);
- if (!RelationUsesLocalBuffers(rel))
- VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
-
- /* Initialize the new page before returning it */
- page = BufferGetPage(buf);
- Assert(PageIsNew(page));
- _bt_pageinit(page, BufferGetPageSize(buf));
}
- /* ref count and lock type are correct */
+ /*
+ * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or we
+ * risk a race condition against btvacuumscan --- see comments therein.
+ * This forces us to repeat the valgrind request that _bt_lockbuf()
+ * otherwise would make, as we can't use _bt_lockbuf() without introducing
+ * a race.
+ */
+ buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL, EB_LOCK_FIRST);
+ if (!RelationUsesLocalBuffers(rel))
+ VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
+ /* Initialize the new page before returning it */
+ page = BufferGetPage(buf);
+ Assert(PageIsNew(page));
+ _bt_pageinit(page, BufferGetPageSize(buf));
+
return buf;
}
/*
* _bt_relandgetbuf() -- release a locked buffer and get another one.
*
- * This is equivalent to _bt_relbuf followed by _bt_getbuf, with the
- * exception that blkno may not be P_NEW. Also, if obuf is InvalidBuffer
- * then it reduces to just _bt_getbuf; allowing this case simplifies some
- * callers.
+ * This is equivalent to _bt_relbuf followed by _bt_getbuf. Also, if obuf is
+ * InvalidBuffer then it reduces to just _bt_getbuf; allowing this case
+ * simplifies some callers.
*
* The original motivation for using this was to avoid two entries to the
* bufmgr when one would do. However, now it's mainly just a notational
@@ -1001,7 +1006,7 @@ _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
{
Buffer buf;
- Assert(blkno != P_NEW);
+ Assert(BlockNumberIsValid(blkno));
if (BufferIsValid(obuf))
_bt_unlockbuf(rel, obuf);
buf = ReleaseAndReadBuffer(obuf, rel, blkno);
@@ -1272,14 +1277,14 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* (a version that lacks the TIDs that are to be deleted).
*
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but it needs its own snapshotConflictHorizon (caller gets this
- * from tableam). This is used by the REDO routine to generate recovery
+ * the page, but it needs its own snapshotConflictHorizon and isCatalogRel
+ * (from the tableam). This is used by the REDO routine to generate recovery
* conflicts. The other difference is that only _bt_delitems_vacuum will
* clear page's VACUUM cycle ID.
*/
static void
-_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
- TransactionId snapshotConflictHorizon,
+_bt_delitems_delete(Relation rel, Buffer buf,
+ TransactionId snapshotConflictHorizon, bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable)
{
@@ -1343,10 +1348,10 @@ _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
XLogRecPtr recptr;
xl_btree_delete xlrec_delete;
- xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
xlrec_delete.ndeleted = ndeletable;
xlrec_delete.nupdated = nupdatable;
+ xlrec_delete.isCatalogRel = isCatalogRel;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1517,6 +1522,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
{
Page page = BufferGetPage(buf);
TransactionId snapshotConflictHorizon;
+ bool isCatalogRel;
OffsetNumber postingidxoffnum = InvalidOffsetNumber;
int ndeletable = 0,
nupdatable = 0;
@@ -1525,6 +1531,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
/* Use tableam interface to determine which tuples to delete first */
snapshotConflictHorizon = table_index_delete_tuples(heapRel, delstate);
+ isCatalogRel = RelationIsAccessibleInLogicalDecoding(heapRel);
/* Should not WAL-log snapshotConflictHorizon unless it's required */
if (!XLogStandbyInfoActive())
@@ -1670,8 +1677,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
}
/* Physically delete tuples (or TIDs) using deletable (or updatable) */
- _bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
- ndeletable, updatable, nupdatable);
+ _bt_delitems_delete(rel, buf, snapshotConflictHorizon, isCatalogRel,
+ deletable, ndeletable, updatable, nupdatable);
/* be tidy */
for (int i = 0; i < nupdatable; i++)
@@ -1692,8 +1699,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
* same level must always be locked left to right to avoid deadlocks.
*/
static bool
-_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
- BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
{
Buffer buf;
Page page;
@@ -1704,7 +1710,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
if (leftsib == P_NONE)
return false;
- buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
+ buf = _bt_getbuf(rel, leftsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1750,7 +1756,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
* to-be-deleted subtree.)
*/
static bool
-_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
{
Buffer buf;
Page page;
@@ -1759,7 +1765,7 @@ _bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrights
Assert(leafrightsib != P_NONE);
- buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
+ buf = _bt_getbuf(rel, leafrightsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1948,18 +1954,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* marked with INCOMPLETE_SPLIT flag before proceeding
*/
Assert(leafblkno == scanblkno);
- if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
+ if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
{
ReleaseBuffer(leafbuf);
return;
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
+ itup_key = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, vstate->info->heaprel, itup_key,
- &sleafbuf, BT_READ, NULL);
+ stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ,
+ NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
@@ -1990,7 +1996,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* leafbuf page half-dead.
*/
Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
- if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
+ if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf,
+ stack))
{
_bt_relbuf(rel, leafbuf);
return;
@@ -2053,7 +2060,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
if (!rightsib_empty)
break;
- leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
}
}
@@ -2066,13 +2073,17 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* but may include additional internal pages (at most one per level of the
* tree below the root).
*
+ * Caller must pass a valid heapRel, since it's just about possible that our
+ * call to _bt_lock_subtree_parent will need to allocate a new index page to
+ * complete a page split. Every call to _bt_allocbuf needs to pass a heapRel.
+ *
* Returns 'false' if leafbuf is unsafe to delete, usually because leafbuf is
* the rightmost child of its parent (and parent has more than one downlink).
* Returns 'true' when the first stage of page deletion completed
* successfully.
*/
static bool
-_bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
+_bt_mark_page_halfdead(Relation rel, Relation heapRel, Buffer leafbuf,
BTStack stack)
{
BlockNumber leafblkno;
@@ -2094,6 +2105,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
Assert(!P_RIGHTMOST(opaque) && !P_ISROOT(opaque) &&
P_ISLEAF(opaque) && !P_IGNORE(opaque) &&
P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
+ Assert(heapRel != NULL);
/*
* Save info about the leaf page.
@@ -2108,7 +2120,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
* delete the downlink. It would fail the "right sibling of target page
* is also the next child in parent page" cross-check below.
*/
- if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
+ if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
{
elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
leafblkno, leafrightsib);
@@ -2132,7 +2144,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
*/
topparent = leafblkno;
topparentrightsib = leafrightsib;
- if (!_bt_lock_subtree_parent(rel, heaprel, leafblkno, stack,
+ if (!_bt_lock_subtree_parent(rel, heapRel, leafblkno, stack,
&subtreeparent, &poffset,
&topparent, &topparentrightsib))
return false;
@@ -2352,7 +2364,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
Assert(target != leafblkno);
/* Fetch the block number of the target's left sibling */
- buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
+ buf = _bt_getbuf(rel, target, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
leftsib = opaque->btpo_prev;
@@ -2379,7 +2391,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
_bt_lockbuf(rel, leafbuf, BT_WRITE);
if (leftsib != P_NONE)
{
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2427,7 +2439,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
CHECK_FOR_INTERRUPTS();
/* step right one page */
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
}
@@ -2491,7 +2503,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* And next write-lock the (current) right sibling.
*/
rightsib = opaque->btpo_next;
- rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
page = BufferGetPage(rbuf);
opaque = BTPageGetOpaque(page);
@@ -2538,7 +2550,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* of doing so are slim, and the locking considerations daunting.)
*
* We can safely acquire a lock on the metapage here --- see comments for
- * _bt_newroot().
+ * _bt_newlevel().
*/
if (leftsib == P_NONE && rightsib_is_rightmost)
{
@@ -2547,8 +2559,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
if (P_RIGHTMOST(opaque))
{
/* rightsib will be the only one left on the level */
- metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
- BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2789,7 +2800,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* parent block in the leafbuf page using BTreeTupleSetTopParent()).
*/
static bool
-_bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
+_bt_lock_subtree_parent(Relation rel, Relation heapRel, BlockNumber child,
BTStack stack, Buffer *subtreeparent,
OffsetNumber *poffset, BlockNumber *topparent,
BlockNumber *topparentrightsib)
@@ -2806,7 +2817,7 @@ _bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
* Locate the pivot tuple whose downlink points to "child". Write lock
* the parent page itself.
*/
- pbuf = _bt_getstackbuf(rel, heaprel, stack, child);
+ pbuf = _bt_getstackbuf(rel, heapRel, stack, child);
if (pbuf == InvalidBuffer)
{
/*
@@ -2906,11 +2917,11 @@ _bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
*
* Note: We deliberately avoid completing incomplete splits here.
*/
- if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
+ if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
return false;
/* Recurse to examine child page's grandparent page */
- return _bt_lock_subtree_parent(rel, heaprel, parent, stack->bts_parent,
+ return _bt_lock_subtree_parent(rel, heapRel, parent, stack->bts_parent,
subtreeparent, poffset,
topparent, topparentrightsib);
}
@@ -2976,6 +2987,7 @@ _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate)
Relation heaprel = vstate->info->heaprel;
Assert(stats->pages_newly_deleted >= vstate->npendingpages);
+ Assert(heaprel != NULL);
if (vstate->npendingpages == 0)
{
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b1519..4553aaee5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -835,7 +835,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (stats == NULL)
{
/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
- if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
+ if (!_bt_vacuum_needs_cleanup(info->index))
return NULL;
/*
@@ -871,7 +871,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
*/
Assert(stats->pages_deleted >= stats->pages_free);
num_delpages = stats->pages_deleted - stats->pages_free;
- _bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
+ _bt_set_cleanup_info(info->index, num_delpages);
/*
* It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce..0473dbd6e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,8 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
- Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
@@ -92,16 +91,25 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
* When access = BT_READ, an empty index will result in *bufP being set to
* InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
* during the search will be finished.
+ *
+ * heapRel must be provided by callers that pass access = BT_WRITE, since we
+ * might need to allocate a new root page for caller -- see _bt_bufalloc.
+ * Similarly, access = BT_WRITE calls might need to split an internal page as
+ * a follow-on step when dealing with an incomplete split in its child page.
*/
BTStack
-_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
+_bt_search(Relation rel, Relation heapRel, BTScanInsert key, Buffer *bufP,
int access, Snapshot snapshot)
{
BTStack stack_in = NULL;
int page_access = BT_READ;
+ /* heapRel must be set whenever _bt_allocbuf is reachable */
+ Assert(access == BT_READ || access == BT_WRITE);
+ Assert(access == BT_READ || heapRel != NULL);
+
/* Get the root page to start with */
- *bufP = _bt_getroot(rel, heaprel, access);
+ *bufP = _bt_getroot(rel, heapRel, access);
/* If index is empty and access = BT_READ, no root page is created. */
if (!BufferIsValid(*bufP))
@@ -130,7 +138,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* also taken care of in _bt_getstackbuf). But this is a good
* opportunity to finish splits of internal pages too.
*/
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
+ *bufP = _bt_moveright(rel, heapRel, key, *bufP, (access == BT_WRITE),
stack_in, page_access, snapshot);
/* if this is a leaf page, we're done */
@@ -191,7 +199,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
* but before we acquired a write lock. If it has, we may need to
* move right to its new sibling. Do that.
*/
- *bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+ *bufP = _bt_moveright(rel, heapRel, key, *bufP, true, stack_in, BT_WRITE,
snapshot);
}
@@ -222,8 +230,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*
* If forupdate is true, we will attempt to finish any incomplete splits
* that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
+ * insertion, because we don't allow inserting on a page before the split is
+ * completed. 'heapRel' and 'stack' are only used if forupdate is true.
*
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
@@ -235,7 +243,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*/
Buffer
_bt_moveright(Relation rel,
- Relation heaprel,
+ Relation heapRel,
BTScanInsert key,
Buffer buf,
bool forupdate,
@@ -247,6 +255,8 @@ _bt_moveright(Relation rel,
BTPageOpaque opaque;
int32 cmpval;
+ Assert(!forupdate || heapRel != NULL);
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -290,12 +300,12 @@ _bt_moveright(Relation rel,
}
if (P_INCOMPLETE_SPLIT(opaque))
- _bt_finish_split(rel, heaprel, buf, stack);
+ _bt_finish_split(rel, heapRel, buf, stack);
else
_bt_relbuf(rel, buf);
/* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, heaprel, blkno, access);
+ buf = _bt_getbuf(rel, blkno, access);
continue;
}
@@ -862,7 +872,6 @@ bool
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
- Relation heaprel = scan->heapRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Buffer buf;
BTStack stack;
@@ -1355,7 +1364,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
}
/* Initialize remaining insertion scan key fields */
- _bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
+ _bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1366,7 +1375,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -2007,7 +2016,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
/* step right one page */
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2081,8 +2090,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
if (BTScanPosIsPinned(so->currPos))
_bt_lockbuf(rel, so->currPos.buf, BT_READ);
else
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
for (;;)
{
@@ -2096,8 +2104,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
}
/* Step to next physical page */
- so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
- so->currPos.buf, scan->xs_snapshot);
+ so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+ scan->xs_snapshot);
/* if we're physically at end of index, return failure */
if (so->currPos.buf == InvalidBuffer)
@@ -2144,8 +2152,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
BTScanPosInvalidate(so->currPos);
return false;
}
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
- BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
}
}
}
@@ -2190,7 +2197,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
* again if it's important.
*/
static Buffer
-_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
{
Page page;
BTPageOpaque opaque;
@@ -2218,7 +2225,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
_bt_relbuf(rel, buf);
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
- buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
+ buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2309,7 +2316,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
* The returned buffer is pinned and read-locked.
*/
Buffer
-_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot)
{
Buffer buf;
@@ -2325,9 +2332,9 @@ _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
* smarter about intermediate levels.)
*/
if (level == 0)
- buf = _bt_getroot(rel, heaprel, BT_READ);
+ buf = _bt_getroot(rel, NULL, BT_READ);
else
- buf = _bt_gettrueroot(rel, heaprel);
+ buf = _bt_gettrueroot(rel);
if (!BufferIsValid(buf))
return InvalidBuffer;
@@ -2408,8 +2415,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
* version of _bt_search(). We don't maintain a stack since we know we
* won't need it.
*/
- buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
- ScanDirectionIsBackward(dir), scan->xs_snapshot);
+ buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
if (!BufferIsValid(buf))
{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 6ad3f3c54..c2665fce4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
+ wstate.inskey = _bt_mkscankey(wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 05abf3603..7da499c4d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
* field themselves.
*/
BTScanInsert
-_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup)
{
BTScanInsert key;
ScanKey skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
if (itup)
- _bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
else
{
/* Utility statement callers can set these fields themselves */
@@ -1761,8 +1761,7 @@ _bt_killitems(IndexScanDesc scan)
droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
- buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
page = BufferGetPage(buf);
if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 65adf04c4..39932d3c2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
* For btrees, get tree height while we have the index
* open
*/
- info->tree_height = _bt_getrootheight(indexRelation, relation);
+ info->tree_height = _bt_getrootheight(indexRelation);
}
else
{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 018810692..eb6cfcfd0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,7 +207,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
Tuplesortstate *
tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- Relation heaprel,
int workMem,
SortCoordinate coordinate, int sortopt)
{
@@ -261,7 +260,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
arg->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
if (arg->indexInfo->ii_Expressions != NULL)
{
@@ -362,7 +361,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
/* Prepare SortSupport data for each column */
base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6979aff72..94a975932 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,6 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
OffsetNumber upperbound);
static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
- Relation heaprel,
IndexTuple itup);
static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
Page page, OffsetNumber offset);
@@ -332,7 +331,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
RelationGetRelationName(indrel))));
/* Extract metadata from metapage, and sanitize it in passing */
- _bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
+ _bt_metaversion(indrel, &heapkeyspace, &allequalimage);
if (allequalimage && !heapkeyspace)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1259,7 +1258,7 @@ bt_target_page_check(BtreeCheckState *state)
}
/* Build insertion scankey for current page offset */
- skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
+ skey = bt_mkscankey_pivotsearch(state->rel, itup);
/*
* Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1769,7 +1768,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
* memory remaining allocated.
*/
firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
- return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
+ return bt_mkscankey_pivotsearch(state->rel, firstitup);
}
/*
@@ -2682,7 +2681,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
Buffer lbuf;
bool exists;
- key = _bt_mkscankey(state->rel, state->heaprel, itup);
+ key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
/*
@@ -2695,7 +2694,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
*/
Assert(state->readonly && state->rootdescend);
exists = false;
- stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
+ stack = _bt_search(state->rel, NULL, key, &lbuf, BT_READ, NULL);
if (BufferIsValid(lbuf))
{
@@ -3134,11 +3133,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
* the scankey is greater.
*/
static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
- skey = _bt_mkscankey(rel, heaprel, itup);
+ skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
return skey;
--
2.40.1
On 2023-Jun-05, Peter Geoghegan wrote:
My current plan is to commit this later in the week, unless there are
further objections. Wednesday or possibly Thursday.
I've added this as an open item for 16, with Peter and Andres as owners.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)
Hi,
On 2023-06-05 12:04:29 -0700, Peter Geoghegan wrote:
On Sun, May 28, 2023 at 7:49 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
IMO this makes sense for v16. These new arguments were introduced in
v16, so if we have second thoughts, now is the right time to change
them, before v16 is released. It will reduce the backpatching effort in
the future; if we apply this in v17, then v16 will be the odd one out.My current plan is to commit this later in the week, unless there are
further objections. Wednesday or possibly Thursday.
-1. For me separating the P_NEW path makes a lot of sense, but isn't 16
material. I don't agree that it's a problem to have heaprel as a parameter in
a bunch of places that don't strictly need it today.
I don't really understand why the patch does s/heaprel/heapRel/. Most of these
functions aren't actually using camelcase parameters? This part of the change
just blows up the size, making it harder to review.
12 files changed, 317 insertions(+), 297 deletions(-)
...
Greetings,
Andres Freund
On Wed, Jun 7, 2023 at 5:12 PM Andres Freund <andres@anarazel.de> wrote:
-1. For me separating the P_NEW path makes a lot of sense, but isn't 16
material. I don't agree that it's a problem to have heaprel as a parameter in
a bunch of places that don't strictly need it today.
As I've said, this is primarily about keeping all of the branches
consistent. I agree that there is no particular known consequence to
not doing this, and have said as much several times.
I don't really understand why the patch does s/heaprel/heapRel/.
That has been the style used within nbtree for many years now.
Most of these
functions aren't actually using camelcase parameters? This part of the change
just blows up the size, making it harder to review.
I wonder what made it impossibly hard to review the first time around.
The nbtree aspects of this work were pretty much written on
auto-pilot. I had no intention of making a fuss about it, but then I
never expected this push back.
--
Peter Geoghegan
On 2023-Jun-07, Peter Geoghegan wrote:
On Wed, Jun 7, 2023 at 5:12 PM Andres Freund <andres@anarazel.de> wrote:
I don't really understand why the patch does s/heaprel/heapRel/.
That has been the style used within nbtree for many years now.
IMO this kind of change definitely does not have place in a post-beta1
restructuring patch. We rarely indulge in case-fixing exercises at any
other time, and I don't see any good argument why post-beta1 is a better
time for it. I suggest that you should strive to keep the patch as
small as possible.
--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)
On Thu, Jun 8, 2023 at 7:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
IMO this kind of change definitely does not have place in a post-beta1
restructuring patch. We rarely indulge in case-fixing exercises at any
other time, and I don't see any good argument why post-beta1 is a better
time for it.
There is a glaring inconsistency. Now about half of the relevant
functions in nbtree.h use "heaprel", while the other half use
"heapRel". Obviously that's not the end of the world, but it's
annoying. It's exactly the kind of case-fixing exercise that does tend
to happen.
I'm not going to argue this point any further, though. I will make
this change at a later date. That will introduce an inconsistency
between branches, of course, but apparently there isn't any
alternative.
I suggest that you should strive to keep the patch as
small as possible.
Attached is v4, which goes back to using "heaprel" in new-to-16 code.
As a result, it is slightly smaller than v3.
My new plan is to commit this tomorrow, since the clear consensus is
that we should go ahead with this for 16.
--
Peter Geoghegan
Attachments:
v4-0001-nbtree-Allocate-new-pages-in-separate-function.patchapplication/octet-stream; name=v4-0001-nbtree-Allocate-new-pages-in-separate-function.patchDownload
From 7d15245de8f0c3b3079f9384c8098c5ae8b6c77d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 23 May 2023 22:08:56 -0700
Subject: [PATCH v4] nbtree: Allocate new pages in separate function.
Split nbtree's _bt_getbuf function is two: code that read locks or write
locks existing pages is left in _bt_getbuf, while code that deals with
allocating new pages is moved to a new, dedicated function:
_bt_allocbuf.
This division of labor cleanly separates code that might need to
allocate a new page (usually as part of a page split) with all other
nbtree code. This is clearer overall. Read-only callers to functions
like _bt_search opt out of finishing off any incomplete page splits that
are encountered in passing. Other callers in write paths (that can
never run during Hot Standby) will finish off incomplete page splits,
even in pages that they wouldn't usually have to write lock.
The immediate reason to do this refactoring now is to revert certain
mechanical changes that were required so that _bt_getbuf callers would
provide a valid heapRel in all cases. Commit 61b313e4, which prepared
nbtree for logical decoding on standbys, did this for all _bt_getbuf
callers, even though it was only necessary for P_NEW _bt_getbuf callers.
We could keep the nbtree P_NEW idiom while just passing NULL as our
heapRel instead, but completely replacing it with _bt_allocbuf turned
out to be significantly cleaner. This approach actually makes things
_more_ consistent between HEAD and back branches that still use P_NEW.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=8Z9qY58bjm_7TAHgtW6RzZ5Ke62q5emdCEy9BAzwhmg@mail.gmail.com
---
src/include/access/nbtree.h | 26 +-
src/include/utils/tuplesort.h | 4 +-
src/backend/access/heap/heapam_handler.c | 9 +-
src/backend/access/nbtree/nbtinsert.c | 48 +--
src/backend/access/nbtree/nbtpage.c | 360 +++++++++++----------
src/backend/access/nbtree/nbtree.c | 4 +-
src/backend/access/nbtree/nbtsearch.c | 50 +--
src/backend/access/nbtree/nbtsort.c | 2 +-
src/backend/access/nbtree/nbtutils.c | 7 +-
src/backend/optimizer/util/plancat.c | 2 +-
src/backend/utils/sort/tuplesortvariants.c | 5 +-
contrib/amcheck/verify_nbtree.c | 15 +-
12 files changed, 275 insertions(+), 257 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d68478609..8891fa797 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -293,11 +293,14 @@ BTPageIsRecyclable(Page page, Relation heaprel)
BTPageOpaque opaque;
Assert(!PageIsNew(page));
+ Assert(heaprel != NULL);
/* Recycling okay iff page is deleted and safexid is old enough */
opaque = BTPageGetOpaque(page);
if (P_ISDELETED(opaque))
{
+ FullTransactionId safexid = BTPageGetDeleteXid(page);
+
/*
* The page was deleted, but when? If it was just deleted, a scan
* might have seen the downlink to it, and will read the page later.
@@ -308,7 +311,7 @@ BTPageIsRecyclable(Page page, Relation heaprel)
* anyone. If not, then no scan that's still in progress could have
* seen its downlink, and we can recycle it.
*/
- return GlobalVisCheckRemovableFullXid(heaprel, BTPageGetDeleteXid(page));
+ return GlobalVisCheckRemovableFullXid(heaprel, safexid);
}
return false;
@@ -1194,18 +1197,17 @@ extern OffsetNumber _bt_findsplitloc(Relation rel, Page origpage,
*/
extern void _bt_initmetapage(Page page, BlockNumber rootbknum, uint32 level,
bool allequalimage);
-extern bool _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel);
-extern void _bt_set_cleanup_info(Relation rel, Relation heaprel,
- BlockNumber num_delpages);
+extern bool _bt_vacuum_needs_cleanup(Relation rel);
+extern void _bt_set_cleanup_info(Relation rel, BlockNumber num_delpages);
extern void _bt_upgrademetapage(Page page);
extern Buffer _bt_getroot(Relation rel, Relation heaprel, int access);
-extern Buffer _bt_gettrueroot(Relation rel, Relation heaprel);
-extern int _bt_getrootheight(Relation rel, Relation heaprel);
-extern void _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace,
+extern Buffer _bt_gettrueroot(Relation rel);
+extern int _bt_getrootheight(Relation rel);
+extern void _bt_metaversion(Relation rel, bool *heapkeyspace,
bool *allequalimage);
extern void _bt_checkpage(Relation rel, Buffer buf);
-extern Buffer _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno,
- int access);
+extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern Buffer _bt_allocbuf(Relation rel, Relation heaprel);
extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
BlockNumber blkno, int access);
extern void _bt_relbuf(Relation rel, Buffer buf);
@@ -1237,13 +1239,13 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern Buffer _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level,
- bool rightmost, Snapshot snapshot);
+extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
+ Snapshot snapshot);
/*
* prototypes for functions in nbtutils.c
*/
-extern BTScanInsert _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
extern void _bt_freestack(BTStack stack);
extern void _bt_preprocess_array_keys(IndexScanDesc scan);
extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 656a2e989..af057b635 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -399,9 +399,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
int workMem, SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
- Relation indexRel,
- Relation heaprel,
- int workMem,
+ Relation indexRel, int workMem,
SortCoordinate coordinate,
int sortopt);
extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc2..0755be839 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -731,14 +731,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
*multi_cutoff);
- /*
- * Set up sorting if wanted. NewHeap is being passed to
- * tuplesort_begin_cluster(), it could have been OldHeap too. It does not
- * really matter, as the goal is to have a heap relation being passed to
- * _bt_log_reuse_page() (which should not be called from this code path).
- */
+ /* Set up sorting if wanted */
if (use_sort)
- tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex, NewHeap,
+ tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
maintenance_work_mem,
NULL, TUPLESORT_NONE);
else
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 84bbd78d5..bbebeef4a 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -59,7 +59,7 @@ static Buffer _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key,
IndexTuple nposting, uint16 postingoff);
static void _bt_insert_parent(Relation rel, Relation heaprel, Buffer buf,
Buffer rbuf, BTStack stack, bool isroot, bool isonly);
-static Buffer _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
+static Buffer _bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf);
static inline bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
OffsetNumber itup_off, bool newfirstdataitem);
static void _bt_delete_or_dedup_one_page(Relation rel, Relation heapRel,
@@ -110,7 +110,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
bool checkingunique = (checkUnique != UNIQUE_CHECK_NO);
/* we need an insertion scan key to do our search, so build one */
- itup_key = _bt_mkscankey(rel, heapRel, itup);
+ itup_key = _bt_mkscankey(rel, itup);
if (checkingunique)
{
@@ -1024,13 +1024,15 @@ _bt_findinsertloc(Relation rel,
* indexes.
*/
static void
-_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate, BTStack stack)
+_bt_stepright(Relation rel, Relation heaprel, BTInsertState insertstate,
+ BTStack stack)
{
Page page;
BTPageOpaque opaque;
Buffer rbuf;
BlockNumber rblkno;
+ Assert(heaprel != NULL);
page = BufferGetPage(insertstate->buf);
opaque = BTPageGetOpaque(page);
@@ -1145,7 +1147,7 @@ _bt_insertonpg(Relation rel,
/*
* Every internal page should have exactly one negative infinity item at
- * all times. Only _bt_split() and _bt_newroot() should add items that
+ * all times. Only _bt_split() and _bt_newlevel() should add items that
* become negative infinity items through truncation, since they're the
* only routines that allocate new internal pages.
*/
@@ -1250,14 +1252,14 @@ _bt_insertonpg(Relation rel,
* only one on its tree level, but was not the root, it may have been
* the "fast root". We need to ensure that the fast root link points
* at or above the current page. We can safely acquire a lock on the
- * metapage here --- see comments for _bt_newroot().
+ * metapage here --- see comments for _bt_newlevel().
*/
if (unlikely(split_only_page))
{
Assert(!isleaf);
Assert(BufferIsValid(cbuf));
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -1421,7 +1423,7 @@ _bt_insertonpg(Relation rel,
* call _bt_getrootheight while holding a buffer lock.
*/
if (BlockNumberIsValid(blockcache) &&
- _bt_getrootheight(rel, heaprel) >= BTREE_FASTPATH_MIN_LEVEL)
+ _bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
RelationSetTargetBlock(rel, blockcache);
}
@@ -1715,7 +1717,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
* way because it avoids an unnecessary PANIC when either origpage or its
* existing sibling page are corrupt.
*/
- rbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rbuf = _bt_allocbuf(rel, heaprel);
rightpage = BufferGetPage(rbuf);
rightpagenumber = BufferGetBlockNumber(rbuf);
/* rightpage was initialized by _bt_getbuf */
@@ -1888,7 +1890,7 @@ _bt_split(Relation rel, Relation heaprel, BTScanInsert itup_key, Buffer buf,
*/
if (!isrightmost)
{
- sbuf = _bt_getbuf(rel, heaprel, oopaque->btpo_next, BT_WRITE);
+ sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
spage = BufferGetPage(sbuf);
sopaque = BTPageGetOpaque(spage);
if (sopaque->btpo_prev != origpagenumber)
@@ -2102,6 +2104,8 @@ _bt_insert_parent(Relation rel,
bool isroot,
bool isonly)
{
+ Assert(heaprel != NULL);
+
/*
* Here we have to do something Lehman and Yao don't talk about: deal with
* a root split and construction of a new root. If our stack is empty
@@ -2121,8 +2125,8 @@ _bt_insert_parent(Relation rel,
Assert(stack == NULL);
Assert(isonly);
- /* create a new root node and update the metapage */
- rootbuf = _bt_newroot(rel, heaprel, buf, rbuf);
+ /* create a new root node one level up and update the metapage */
+ rootbuf = _bt_newlevel(rel, heaprel, buf, rbuf);
/* release the split buffers */
_bt_relbuf(rel, rootbuf);
_bt_relbuf(rel, rbuf);
@@ -2161,8 +2165,7 @@ _bt_insert_parent(Relation rel,
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* Find the leftmost page at the next level up */
- pbuf = _bt_get_endpoint(rel, heaprel, opaque->btpo_level + 1, false,
- NULL);
+ pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
@@ -2230,6 +2233,9 @@ _bt_insert_parent(Relation rel,
*
* On entry, 'lbuf' must be locked in write-mode. On exit, it is unlocked
* and unpinned.
+ *
+ * Caller must provide a valid heaprel, since finishing a page split will
+ * require allocating a new page when the parent page splits in turn.
*/
void
_bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
@@ -2243,9 +2249,10 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
bool wasonly;
Assert(P_INCOMPLETE_SPLIT(lpageop));
+ Assert(heaprel != NULL);
/* Lock right sibling, the one missing the downlink */
- rbuf = _bt_getbuf(rel, heaprel, lpageop->btpo_next, BT_WRITE);
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
rpage = BufferGetPage(rbuf);
rpageop = BTPageGetOpaque(rpage);
@@ -2257,7 +2264,7 @@ _bt_finish_split(Relation rel, Relation heaprel, Buffer lbuf, BTStack stack)
BTMetaPageData *metad;
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2323,10 +2330,11 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
Page page;
BTPageOpaque opaque;
- buf = _bt_getbuf(rel, heaprel, blkno, BT_WRITE);
+ buf = _bt_getbuf(rel, blkno, BT_WRITE);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
+ Assert(heaprel != NULL);
if (P_INCOMPLETE_SPLIT(opaque))
{
_bt_finish_split(rel, heaprel, buf, stack->bts_parent);
@@ -2415,7 +2423,7 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
}
/*
- * _bt_newroot() -- Create a new root page for the index.
+ * _bt_newlevel() -- Create a new level above former root page.
*
* We've just split the old root page and need to create a new one.
* In order to do this, we add a new root page to the file, then lock
@@ -2433,7 +2441,7 @@ _bt_getstackbuf(Relation rel, Relation heaprel, BTStack stack, BlockNumber child
* lbuf, rbuf & rootbuf.
*/
static Buffer
-_bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
+_bt_newlevel(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
{
Buffer rootbuf;
Page lpage,
@@ -2459,12 +2467,12 @@ _bt_newroot(Relation rel, Relation heaprel, Buffer lbuf, Buffer rbuf)
lopaque = BTPageGetOpaque(lpage);
/* get a new root page */
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heaprel);
rootpage = BufferGetPage(rootbuf);
rootblknum = BufferGetBlockNumber(rootbuf);
/* acquire lock on the metapage */
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 486d1563d..acd11197a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -38,10 +38,9 @@
#include "utils/snapmgr.h"
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
-static void _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid);
-static void _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
+static void _bt_delitems_delete(Relation rel, Buffer buf,
TransactionId snapshotConflictHorizon,
+ bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable);
static char *_bt_delitems_update(BTVacuumPosting *updatable, int nupdatable,
@@ -177,7 +176,7 @@ _bt_getmeta(Relation rel, Buffer metabuf)
* index tuples needed to be deleted.
*/
bool
-_bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
+_bt_vacuum_needs_cleanup(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -190,7 +189,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
*
* Note that we deliberately avoid using cached version of metapage here.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
btm_version = metad->btm_version;
@@ -230,7 +229,7 @@ _bt_vacuum_needs_cleanup(Relation rel, Relation heaprel)
* finalized.
*/
void
-_bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
+_bt_set_cleanup_info(Relation rel, BlockNumber num_delpages)
{
Buffer metabuf;
Page metapg;
@@ -254,7 +253,7 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* no longer used as of PostgreSQL 14. We set it to -1.0 on rewrite, just
* to be consistent.
*/
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -326,6 +325,9 @@ _bt_set_cleanup_info(Relation rel, Relation heaprel, BlockNumber num_delpages)
* NOTE that the returned root page will have only a read lock set
* on it even if access = BT_WRITE!
*
+ * If access = BT_WRITE, heaprel must be set; otherwise caller should
+ * just pass NULL. See _bt_allocbuf for an explanation.
+ *
* The returned page is not necessarily the true root --- it could be
* a "fast root" (a page that is alone in its level due to deletions).
* Also, if the root page is split while we are "in flight" to it,
@@ -349,6 +351,8 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
uint32 rootlevel;
BTMetaPageData *metad;
+ Assert(access == BT_READ || heaprel != NULL);
+
/*
* Try to use previously-cached metapage data to find the root. This
* normally saves one buffer access per index search, which is a very
@@ -369,7 +373,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
Assert(rootblkno != P_NONE);
rootlevel = metad->btm_fastlevel;
- rootbuf = _bt_getbuf(rel, heaprel, rootblkno, BT_READ);
+ rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -395,7 +399,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
rel->rd_amcache = NULL;
}
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/* if no root page initialized yet, do it */
@@ -436,7 +440,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* the new root page. Since this is the first page in the tree, it's
* a leaf as well as the root.
*/
- rootbuf = _bt_getbuf(rel, heaprel, P_NEW, BT_WRITE);
+ rootbuf = _bt_allocbuf(rel, heaprel);
rootblkno = BufferGetBlockNumber(rootbuf);
rootpage = BufferGetPage(rootbuf);
rootopaque = BTPageGetOpaque(rootpage);
@@ -573,7 +577,7 @@ _bt_getroot(Relation rel, Relation heaprel, int access)
* moving to the root --- that'd deadlock against any concurrent root split.)
*/
Buffer
-_bt_gettrueroot(Relation rel, Relation heaprel)
+_bt_gettrueroot(Relation rel)
{
Buffer metabuf;
Page metapg;
@@ -595,7 +599,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
pfree(rel->rd_amcache);
rel->rd_amcache = NULL;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metaopaque = BTPageGetOpaque(metapg);
metad = BTPageGetMeta(metapg);
@@ -668,7 +672,7 @@ _bt_gettrueroot(Relation rel, Relation heaprel)
* about updating previously cached data.
*/
int
-_bt_getrootheight(Relation rel, Relation heaprel)
+_bt_getrootheight(Relation rel)
{
BTMetaPageData *metad;
@@ -676,7 +680,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -732,7 +736,7 @@ _bt_getrootheight(Relation rel, Relation heaprel)
* pg_upgrade'd from Postgres 12.
*/
void
-_bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequalimage)
+_bt_metaversion(Relation rel, bool *heapkeyspace, bool *allequalimage)
{
BTMetaPageData *metad;
@@ -740,7 +744,7 @@ _bt_metaversion(Relation rel, Relation heaprel, bool *heapkeyspace, bool *allequ
{
Buffer metabuf;
- metabuf = _bt_getbuf(rel, heaprel, BTREE_METAPAGE, BT_READ);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metad = _bt_getmeta(rel, metabuf);
/*
@@ -821,37 +825,7 @@ _bt_checkpage(Relation rel, Buffer buf)
}
/*
- * Log the reuse of a page from the FSM.
- */
-static void
-_bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
- FullTransactionId safexid)
-{
- xl_btree_reuse_page xlrec_reuse;
-
- /*
- * Note that we don't register the buffer with the record, because this
- * operation doesn't modify the page. This record only exists to provide a
- * conflict point for Hot Standby.
- */
-
- /* XLOG stuff */
- xlrec_reuse.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
- xlrec_reuse.locator = rel->rd_locator;
- xlrec_reuse.block = blkno;
- xlrec_reuse.snapshotConflictHorizon = safexid;
-
- XLogBeginInsert();
- XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
-
- XLogInsert(RM_BTREE_ID, XLOG_BTREE_REUSE_PAGE);
-}
-
-/*
- * _bt_getbuf() -- Get a buffer by block number for read or write.
- *
- * blkno == P_NEW means to get an unallocated index page. The page
- * will be initialized before returning it.
+ * _bt_getbuf() -- Get an existing block in a buffer, for read or write.
*
* The general rule in nbtree is that it's never okay to access a
* page without holding both a buffer pin and a buffer lock on
@@ -860,136 +834,167 @@ _bt_log_reuse_page(Relation rel, Relation heaprel, BlockNumber blkno,
* When this routine returns, the appropriate lock is set on the
* requested buffer and its reference count has been incremented
* (ie, the buffer is "locked and pinned"). Also, we apply
- * _bt_checkpage to sanity-check the page (except in P_NEW case),
- * and perform Valgrind client requests that help Valgrind detect
- * unsafe page accesses.
+ * _bt_checkpage to sanity-check the page, and perform Valgrind
+ * client requests that help Valgrind detect unsafe page accesses.
*
* Note: raw LockBuffer() calls are disallowed in nbtree; all
* buffer lock requests need to go through wrapper functions such
* as _bt_lockbuf().
*/
Buffer
-_bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
+_bt_getbuf(Relation rel, BlockNumber blkno, int access)
{
Buffer buf;
- if (blkno != P_NEW)
+ Assert(BlockNumberIsValid(blkno));
+
+ /* Read an existing block of the relation */
+ buf = ReadBuffer(rel, blkno);
+ _bt_lockbuf(rel, buf, access);
+ _bt_checkpage(rel, buf);
+
+ return buf;
+}
+
+/*
+ * _bt_allocbuf() -- Allocate a new block/page.
+ *
+ * Routine returns a pinned and write-locked buffer that contains a newly
+ * allocated index page. The page will be initialized with the appropriate
+ * amount of special area space for an nbtree page, too.
+ *
+ * Callers are required to pass a valid heaprel. We need heaprel so that we
+ * can handle generating a snapshotConflictHorizon that makes reusing a page
+ * from the FSM safe for queries that may be running on standbys.
+ */
+Buffer
+_bt_allocbuf(Relation rel, Relation heaprel)
+{
+ Buffer buf;
+ BlockNumber blkno;
+ Page page;
+
+ Assert(heaprel != NULL);
+
+ /*
+ * First see if the FSM knows of any free pages.
+ *
+ * We can't trust the FSM's report unreservedly; we have to check that the
+ * page is still free. (For example, an already-free page could have been
+ * re-used between the time the last VACUUM scanned it and the time the
+ * VACUUM made its FSM updates.)
+ *
+ * In fact, it's worse than that: we can't even assume that it's safe to
+ * take a lock on the reported page. If somebody else has a lock on it,
+ * or even worse our own caller does, we could deadlock. (The own-caller
+ * scenario is actually not improbable. Consider an index on a serial or
+ * timestamp column. Nearly all splits will be at the rightmost page, so
+ * it's entirely likely that _bt_split will call us while holding a lock
+ * on the page most recently acquired from FSM. A VACUUM running
+ * concurrently with the previous split could well have placed that page
+ * back in FSM.)
+ *
+ * To get around that, we ask for only a conditional lock on the reported
+ * page. If we fail, then someone else is using the page, and we may
+ * reasonably assume it's not free. (If we happen to be wrong, the worst
+ * consequence is the page will be lost to use till the next VACUUM, which
+ * is no big problem.)
+ */
+ for (;;)
{
- /* Read an existing block of the relation */
+ blkno = GetFreeIndexPage(rel);
+ if (blkno == InvalidBlockNumber)
+ break;
buf = ReadBuffer(rel, blkno);
- _bt_lockbuf(rel, buf, access);
- _bt_checkpage(rel, buf);
- }
- else
- {
- Page page;
-
- Assert(access == BT_WRITE);
-
- /*
- * First see if the FSM knows of any free pages.
- *
- * We can't trust the FSM's report unreservedly; we have to check that
- * the page is still free. (For example, an already-free page could
- * have been re-used between the time the last VACUUM scanned it and
- * the time the VACUUM made its FSM updates.)
- *
- * In fact, it's worse than that: we can't even assume that it's safe
- * to take a lock on the reported page. If somebody else has a lock
- * on it, or even worse our own caller does, we could deadlock. (The
- * own-caller scenario is actually not improbable. Consider an index
- * on a serial or timestamp column. Nearly all splits will be at the
- * rightmost page, so it's entirely likely that _bt_split will call us
- * while holding a lock on the page most recently acquired from FSM. A
- * VACUUM running concurrently with the previous split could well have
- * placed that page back in FSM.)
- *
- * To get around that, we ask for only a conditional lock on the
- * reported page. If we fail, then someone else is using the page,
- * and we may reasonably assume it's not free. (If we happen to be
- * wrong, the worst consequence is the page will be lost to use till
- * the next VACUUM, which is no big problem.)
- */
- for (;;)
+ if (_bt_conditionallockbuf(rel, buf))
{
- blkno = GetFreeIndexPage(rel);
- if (blkno == InvalidBlockNumber)
- break;
- buf = ReadBuffer(rel, blkno);
- if (_bt_conditionallockbuf(rel, buf))
- {
- page = BufferGetPage(buf);
+ page = BufferGetPage(buf);
+ /*
+ * It's possible to find an all-zeroes page in an index. For
+ * example, a backend might successfully extend the relation one
+ * page and then crash before it is able to make a WAL entry for
+ * adding the page. If we find a zeroed page then reclaim it
+ * immediately.
+ */
+ if (PageIsNew(page))
+ {
+ /* Okay to use page. Initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
+ }
+
+ if (BTPageIsRecyclable(page, heaprel))
+ {
/*
- * It's possible to find an all-zeroes page in an index. For
- * example, a backend might successfully extend the relation
- * one page and then crash before it is able to make a WAL
- * entry for adding the page. If we find a zeroed page then
- * reclaim it immediately.
+ * If we are generating WAL for Hot Standby then create a WAL
+ * record that will allow us to conflict with queries running
+ * on standby, in case they have snapshots older than safexid
+ * value
*/
- if (PageIsNew(page))
+ if (RelationNeedsWAL(rel) && XLogStandbyInfoActive())
{
- /* Okay to use page. Initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
- }
+ xl_btree_reuse_page xlrec_reuse;
- if (BTPageIsRecyclable(page, heaprel))
- {
/*
- * If we are generating WAL for Hot Standby then create a
- * WAL record that will allow us to conflict with queries
- * running on standby, in case they have snapshots older
- * than safexid value
+ * Note that we don't register the buffer with the record,
+ * because this operation doesn't modify the page (that
+ * already happened, back when VACUUM deleted the page).
+ * This record only exists to provide a conflict point for
+ * Hot Standby. See record REDO routine comments.
*/
- if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
- _bt_log_reuse_page(rel, heaprel, blkno,
- BTPageGetDeleteXid(page));
+ xlrec_reuse.locator = rel->rd_locator;
+ xlrec_reuse.block = blkno;
+ xlrec_reuse.snapshotConflictHorizon = BTPageGetDeleteXid(page);
+ xlrec_reuse.isCatalogRel =
+ RelationIsAccessibleInLogicalDecoding(heaprel);
- /* Okay to use page. Re-initialize and return it. */
- _bt_pageinit(page, BufferGetPageSize(buf));
- return buf;
+ XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec_reuse, SizeOfBtreeReusePage);
+
+ XLogInsert(RM_BTREE_ID, XLOG_BTREE_REUSE_PAGE);
}
- elog(DEBUG2, "FSM returned nonrecyclable page");
- _bt_relbuf(rel, buf);
- }
- else
- {
- elog(DEBUG2, "FSM returned nonlockable page");
- /* couldn't get lock, so just drop pin */
- ReleaseBuffer(buf);
+
+ /* Okay to use page. Re-initialize and return it. */
+ _bt_pageinit(page, BufferGetPageSize(buf));
+ return buf;
}
+ elog(DEBUG2, "FSM returned nonrecyclable page");
+ _bt_relbuf(rel, buf);
+ }
+ else
+ {
+ elog(DEBUG2, "FSM returned nonlockable page");
+ /* couldn't get lock, so just drop pin */
+ ReleaseBuffer(buf);
}
-
- /*
- * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
- * we risk a race condition against btvacuumscan --- see comments
- * therein. This forces us to repeat the valgrind request that
- * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
- * without introducing a race.
- */
- buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
- EB_LOCK_FIRST);
- if (!RelationUsesLocalBuffers(rel))
- VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
-
- /* Initialize the new page before returning it */
- page = BufferGetPage(buf);
- Assert(PageIsNew(page));
- _bt_pageinit(page, BufferGetPageSize(buf));
}
- /* ref count and lock type are correct */
+ /*
+ * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or we
+ * risk a race condition against btvacuumscan --- see comments therein.
+ * This forces us to repeat the valgrind request that _bt_lockbuf()
+ * otherwise would make, as we can't use _bt_lockbuf() without introducing
+ * a race.
+ */
+ buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL, EB_LOCK_FIRST);
+ if (!RelationUsesLocalBuffers(rel))
+ VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
+
+ /* Initialize the new page before returning it */
+ page = BufferGetPage(buf);
+ Assert(PageIsNew(page));
+ _bt_pageinit(page, BufferGetPageSize(buf));
+
return buf;
}
/*
* _bt_relandgetbuf() -- release a locked buffer and get another one.
*
- * This is equivalent to _bt_relbuf followed by _bt_getbuf, with the
- * exception that blkno may not be P_NEW. Also, if obuf is InvalidBuffer
- * then it reduces to just _bt_getbuf; allowing this case simplifies some
- * callers.
+ * This is equivalent to _bt_relbuf followed by _bt_getbuf. Also, if obuf is
+ * InvalidBuffer then it reduces to just _bt_getbuf; allowing this case
+ * simplifies some callers.
*
* The original motivation for using this was to avoid two entries to the
* bufmgr when one would do. However, now it's mainly just a notational
@@ -1001,7 +1006,7 @@ _bt_relandgetbuf(Relation rel, Buffer obuf, BlockNumber blkno, int access)
{
Buffer buf;
- Assert(blkno != P_NEW);
+ Assert(BlockNumberIsValid(blkno));
if (BufferIsValid(obuf))
_bt_unlockbuf(rel, obuf);
buf = ReleaseAndReadBuffer(obuf, rel, blkno);
@@ -1272,14 +1277,14 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
* (a version that lacks the TIDs that are to be deleted).
*
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but it needs its own snapshotConflictHorizon (caller gets this
- * from tableam). This is used by the REDO routine to generate recovery
+ * the page, but it needs its own snapshotConflictHorizon and isCatalogRel
+ * (from the tableam). This is used by the REDO routine to generate recovery
* conflicts. The other difference is that only _bt_delitems_vacuum will
* clear page's VACUUM cycle ID.
*/
static void
-_bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
- TransactionId snapshotConflictHorizon,
+_bt_delitems_delete(Relation rel, Buffer buf,
+ TransactionId snapshotConflictHorizon, bool isCatalogRel,
OffsetNumber *deletable, int ndeletable,
BTVacuumPosting *updatable, int nupdatable)
{
@@ -1343,10 +1348,10 @@ _bt_delitems_delete(Relation rel, Relation heaprel, Buffer buf,
XLogRecPtr recptr;
xl_btree_delete xlrec_delete;
- xlrec_delete.isCatalogRel = RelationIsAccessibleInLogicalDecoding(heaprel);
xlrec_delete.snapshotConflictHorizon = snapshotConflictHorizon;
xlrec_delete.ndeleted = ndeletable;
xlrec_delete.nupdated = nupdatable;
+ xlrec_delete.isCatalogRel = isCatalogRel;
XLogBeginInsert();
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1517,6 +1522,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
{
Page page = BufferGetPage(buf);
TransactionId snapshotConflictHorizon;
+ bool isCatalogRel;
OffsetNumber postingidxoffnum = InvalidOffsetNumber;
int ndeletable = 0,
nupdatable = 0;
@@ -1525,6 +1531,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
/* Use tableam interface to determine which tuples to delete first */
snapshotConflictHorizon = table_index_delete_tuples(heapRel, delstate);
+ isCatalogRel = RelationIsAccessibleInLogicalDecoding(heapRel);
/* Should not WAL-log snapshotConflictHorizon unless it's required */
if (!XLogStandbyInfoActive())
@@ -1670,8 +1677,8 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
}
/* Physically delete tuples (or TIDs) using deletable (or updatable) */
- _bt_delitems_delete(rel, heapRel, buf, snapshotConflictHorizon, deletable,
- ndeletable, updatable, nupdatable);
+ _bt_delitems_delete(rel, buf, snapshotConflictHorizon, isCatalogRel,
+ deletable, ndeletable, updatable, nupdatable);
/* be tidy */
for (int i = 0; i < nupdatable; i++)
@@ -1692,8 +1699,7 @@ _bt_delitems_delete_check(Relation rel, Buffer buf, Relation heapRel,
* same level must always be locked left to right to avoid deadlocks.
*/
static bool
-_bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
- BlockNumber target)
+_bt_leftsib_splitflag(Relation rel, BlockNumber leftsib, BlockNumber target)
{
Buffer buf;
Page page;
@@ -1704,7 +1710,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
if (leftsib == P_NONE)
return false;
- buf = _bt_getbuf(rel, heaprel, leftsib, BT_READ);
+ buf = _bt_getbuf(rel, leftsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1750,7 +1756,7 @@ _bt_leftsib_splitflag(Relation rel, Relation heaprel, BlockNumber leftsib,
* to-be-deleted subtree.)
*/
static bool
-_bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrightsib)
+_bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
{
Buffer buf;
Page page;
@@ -1759,7 +1765,7 @@ _bt_rightsib_halfdeadflag(Relation rel, Relation heaprel, BlockNumber leafrights
Assert(leafrightsib != P_NONE);
- buf = _bt_getbuf(rel, heaprel, leafrightsib, BT_READ);
+ buf = _bt_getbuf(rel, leafrightsib, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
@@ -1948,18 +1954,18 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* marked with INCOMPLETE_SPLIT flag before proceeding
*/
Assert(leafblkno == scanblkno);
- if (_bt_leftsib_splitflag(rel, vstate->info->heaprel, leftsib, leafblkno))
+ if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
{
ReleaseBuffer(leafbuf);
return;
}
/* we need an insertion scan key for the search, so build one */
- itup_key = _bt_mkscankey(rel, vstate->info->heaprel, targetkey);
+ itup_key = _bt_mkscankey(rel, targetkey);
/* find the leftmost leaf page with matching pivot/high key */
itup_key->pivotsearch = true;
- stack = _bt_search(rel, vstate->info->heaprel, itup_key,
- &sleafbuf, BT_READ, NULL);
+ stack = _bt_search(rel, NULL, itup_key, &sleafbuf, BT_READ,
+ NULL);
/* won't need a second lock or pin on leafbuf */
_bt_relbuf(rel, sleafbuf);
@@ -1990,7 +1996,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* leafbuf page half-dead.
*/
Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
- if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf, stack))
+ if (!_bt_mark_page_halfdead(rel, vstate->info->heaprel, leafbuf,
+ stack))
{
_bt_relbuf(rel, leafbuf);
return;
@@ -2053,7 +2060,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
if (!rightsib_empty)
break;
- leafbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
}
}
@@ -2066,6 +2073,10 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
* but may include additional internal pages (at most one per level of the
* tree below the root).
*
+ * Caller must pass a valid heaprel, since it's just about possible that our
+ * call to _bt_lock_subtree_parent will need to allocate a new index page to
+ * complete a page split. Every call to _bt_allocbuf needs to pass a heaprel.
+ *
* Returns 'false' if leafbuf is unsafe to delete, usually because leafbuf is
* the rightmost child of its parent (and parent has more than one downlink).
* Returns 'true' when the first stage of page deletion completed
@@ -2094,6 +2105,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
Assert(!P_RIGHTMOST(opaque) && !P_ISROOT(opaque) &&
P_ISLEAF(opaque) && !P_IGNORE(opaque) &&
P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
+ Assert(heaprel != NULL);
/*
* Save info about the leaf page.
@@ -2108,7 +2120,7 @@ _bt_mark_page_halfdead(Relation rel, Relation heaprel, Buffer leafbuf,
* delete the downlink. It would fail the "right sibling of target page
* is also the next child in parent page" cross-check below.
*/
- if (_bt_rightsib_halfdeadflag(rel, heaprel, leafrightsib))
+ if (_bt_rightsib_halfdeadflag(rel, leafrightsib))
{
elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead",
leafblkno, leafrightsib);
@@ -2352,7 +2364,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
Assert(target != leafblkno);
/* Fetch the block number of the target's left sibling */
- buf = _bt_getbuf(rel, vstate->info->heaprel, target, BT_READ);
+ buf = _bt_getbuf(rel, target, BT_READ);
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
leftsib = opaque->btpo_prev;
@@ -2379,7 +2391,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
_bt_lockbuf(rel, leafbuf, BT_WRITE);
if (leftsib != P_NONE)
{
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
while (P_ISDELETED(opaque) || opaque->btpo_next != target)
@@ -2427,7 +2439,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
CHECK_FOR_INTERRUPTS();
/* step right one page */
- lbuf = _bt_getbuf(rel, vstate->info->heaprel, leftsib, BT_WRITE);
+ lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
page = BufferGetPage(lbuf);
opaque = BTPageGetOpaque(page);
}
@@ -2491,7 +2503,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* And next write-lock the (current) right sibling.
*/
rightsib = opaque->btpo_next;
- rbuf = _bt_getbuf(rel, vstate->info->heaprel, rightsib, BT_WRITE);
+ rbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
page = BufferGetPage(rbuf);
opaque = BTPageGetOpaque(page);
@@ -2538,7 +2550,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
* of doing so are slim, and the locking considerations daunting.)
*
* We can safely acquire a lock on the metapage here --- see comments for
- * _bt_newroot().
+ * _bt_newlevel().
*/
if (leftsib == P_NONE && rightsib_is_rightmost)
{
@@ -2547,8 +2559,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
if (P_RIGHTMOST(opaque))
{
/* rightsib will be the only one left on the level */
- metabuf = _bt_getbuf(rel, vstate->info->heaprel, BTREE_METAPAGE,
- BT_WRITE);
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
metapg = BufferGetPage(metabuf);
metad = BTPageGetMeta(metapg);
@@ -2906,7 +2917,7 @@ _bt_lock_subtree_parent(Relation rel, Relation heaprel, BlockNumber child,
*
* Note: We deliberately avoid completing incomplete splits here.
*/
- if (_bt_leftsib_splitflag(rel, heaprel, leftsibparent, parent))
+ if (_bt_leftsib_splitflag(rel, leftsibparent, parent))
return false;
/* Recurse to examine child page's grandparent page */
@@ -2976,6 +2987,7 @@ _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate)
Relation heaprel = vstate->info->heaprel;
Assert(stats->pages_newly_deleted >= vstate->npendingpages);
+ Assert(heaprel != NULL);
if (vstate->npendingpages == 0)
{
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b1519..4553aaee5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -835,7 +835,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
if (stats == NULL)
{
/* Check if VACUUM operation can entirely avoid btvacuumscan() call */
- if (!_bt_vacuum_needs_cleanup(info->index, info->heaprel))
+ if (!_bt_vacuum_needs_cleanup(info->index))
return NULL;
/*
@@ -871,7 +871,7 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
*/
Assert(stats->pages_deleted >= stats->pages_free);
num_delpages = stats->pages_deleted - stats->pages_free;
- _bt_set_cleanup_info(info->index, info->heaprel, num_delpages);
+ _bt_set_cleanup_info(info->index, num_delpages);
/*
* It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce..43d505be2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -42,8 +42,7 @@ static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
- Snapshot snapshot);
+static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
@@ -92,6 +91,11 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
* When access = BT_READ, an empty index will result in *bufP being set to
* InvalidBuffer. Also, in BT_WRITE mode, any incomplete splits encountered
* during the search will be finished.
+ *
+ * heaprel must be provided by callers that pass access = BT_WRITE, since we
+ * might need to allocate a new root page for caller -- see _bt_bufalloc.
+ * Similarly, access = BT_WRITE calls might need to split an internal page as
+ * a follow-on step when dealing with an incomplete split in its child page.
*/
BTStack
_bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
@@ -100,6 +104,10 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
BTStack stack_in = NULL;
int page_access = BT_READ;
+ /* heaprel must be set whenever _bt_allocbuf is reachable */
+ Assert(access == BT_READ || access == BT_WRITE);
+ Assert(access == BT_READ || heaprel != NULL);
+
/* Get the root page to start with */
*bufP = _bt_getroot(rel, heaprel, access);
@@ -222,8 +230,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
*
* If forupdate is true, we will attempt to finish any incomplete splits
* that we encounter. This is required when locking a target page for an
- * insertion, because we don't allow inserting on a page before the split
- * is completed. 'stack' is only used if forupdate is true.
+ * insertion, because we don't allow inserting on a page before the split is
+ * completed. 'heaprel' and 'stack' are only used if forupdate is true.
*
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
@@ -247,6 +255,8 @@ _bt_moveright(Relation rel,
BTPageOpaque opaque;
int32 cmpval;
+ Assert(!forupdate || heaprel != NULL);
+
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -295,7 +305,7 @@ _bt_moveright(Relation rel,
_bt_relbuf(rel, buf);
/* re-acquire the lock in the right mode, and re-check */
- buf = _bt_getbuf(rel, heaprel, blkno, access);
+ buf = _bt_getbuf(rel, blkno, access);
continue;
}
@@ -862,7 +872,6 @@ bool
_bt_first(IndexScanDesc scan, ScanDirection dir)
{
Relation rel = scan->indexRelation;
- Relation heaprel = scan->heapRelation;
BTScanOpaque so = (BTScanOpaque) scan->opaque;
Buffer buf;
BTStack stack;
@@ -1355,7 +1364,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
}
/* Initialize remaining insertion scan key fields */
- _bt_metaversion(rel, heaprel, &inskey.heapkeyspace, &inskey.allequalimage);
+ _bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
inskey.anynullkeys = false; /* unused */
inskey.nextkey = nextkey;
inskey.pivotsearch = false;
@@ -1366,7 +1375,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
* Use the manufactured insertion scan key to descend the tree and
* position ourselves on the target leaf page.
*/
- stack = _bt_search(rel, heaprel, &inskey, &buf, BT_READ, scan->xs_snapshot);
+ stack = _bt_search(rel, NULL, &inskey, &buf, BT_READ, scan->xs_snapshot);
/* don't need to keep the stack around... */
_bt_freestack(stack);
@@ -2007,7 +2016,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
/* step right one page */
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2081,8 +2090,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
if (BTScanPosIsPinned(so->currPos))
_bt_lockbuf(rel, so->currPos.buf, BT_READ);
else
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
for (;;)
{
@@ -2096,8 +2104,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
}
/* Step to next physical page */
- so->currPos.buf = _bt_walk_left(rel, scan->heapRelation,
- so->currPos.buf, scan->xs_snapshot);
+ so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+ scan->xs_snapshot);
/* if we're physically at end of index, return failure */
if (so->currPos.buf == InvalidBuffer)
@@ -2144,8 +2152,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
BTScanPosInvalidate(so->currPos);
return false;
}
- so->currPos.buf = _bt_getbuf(rel, scan->heapRelation, blkno,
- BT_READ);
+ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
}
}
}
@@ -2190,7 +2197,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
* again if it's important.
*/
static Buffer
-_bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
+_bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
{
Page page;
BTPageOpaque opaque;
@@ -2218,7 +2225,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
_bt_relbuf(rel, buf);
/* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();
- buf = _bt_getbuf(rel, heaprel, blkno, BT_READ);
+ buf = _bt_getbuf(rel, blkno, BT_READ);
page = BufferGetPage(buf);
TestForOldSnapshot(snapshot, rel, page);
opaque = BTPageGetOpaque(page);
@@ -2309,7 +2316,7 @@ _bt_walk_left(Relation rel, Relation heaprel, Buffer buf, Snapshot snapshot)
* The returned buffer is pinned and read-locked.
*/
Buffer
-_bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
+_bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
Snapshot snapshot)
{
Buffer buf;
@@ -2325,9 +2332,9 @@ _bt_get_endpoint(Relation rel, Relation heaprel, uint32 level, bool rightmost,
* smarter about intermediate levels.)
*/
if (level == 0)
- buf = _bt_getroot(rel, heaprel, BT_READ);
+ buf = _bt_getroot(rel, NULL, BT_READ);
else
- buf = _bt_gettrueroot(rel, heaprel);
+ buf = _bt_gettrueroot(rel);
if (!BufferIsValid(buf))
return InvalidBuffer;
@@ -2408,8 +2415,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
* version of _bt_search(). We don't maintain a stack since we know we
* won't need it.
*/
- buf = _bt_get_endpoint(rel, scan->heapRelation, 0,
- ScanDirectionIsBackward(dir), scan->xs_snapshot);
+ buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(dir), scan->xs_snapshot);
if (!BufferIsValid(buf))
{
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 6ad3f3c54..c2665fce4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -566,7 +566,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
wstate.heap = btspool->heap;
wstate.index = btspool->index;
- wstate.inskey = _bt_mkscankey(wstate.index, btspool->heap, NULL);
+ wstate.inskey = _bt_mkscankey(wstate.index, NULL);
/* _bt_mkscankey() won't set allequalimage without metapage */
wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 05abf3603..7da499c4d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -87,7 +87,7 @@ static int _bt_keep_natts(Relation rel, IndexTuple lastleft,
* field themselves.
*/
BTScanInsert
-_bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup)
{
BTScanInsert key;
ScanKey skey;
@@ -112,7 +112,7 @@ _bt_mkscankey(Relation rel, Relation heaprel, IndexTuple itup)
key = palloc(offsetof(BTScanInsertData, scankeys) +
sizeof(ScanKeyData) * indnkeyatts);
if (itup)
- _bt_metaversion(rel, heaprel, &key->heapkeyspace, &key->allequalimage);
+ _bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
else
{
/* Utility statement callers can set these fields themselves */
@@ -1761,8 +1761,7 @@ _bt_killitems(IndexScanDesc scan)
droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
- buf = _bt_getbuf(scan->indexRelation, scan->heapRelation,
- so->currPos.currPage, BT_READ);
+ buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
page = BufferGetPage(buf);
if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 65adf04c4..39932d3c2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -462,7 +462,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
* For btrees, get tree height while we have the index
* open
*/
- info->tree_height = _bt_getrootheight(indexRelation, relation);
+ info->tree_height = _bt_getrootheight(indexRelation);
}
else
{
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 018810692..eb6cfcfd0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -207,7 +207,6 @@ tuplesort_begin_heap(TupleDesc tupDesc,
Tuplesortstate *
tuplesort_begin_cluster(TupleDesc tupDesc,
Relation indexRel,
- Relation heaprel,
int workMem,
SortCoordinate coordinate, int sortopt)
{
@@ -261,7 +260,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
arg->tupDesc = tupDesc; /* assume we need not copy tupDesc */
- indexScanKey = _bt_mkscankey(indexRel, heaprel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
if (arg->indexInfo->ii_Expressions != NULL)
{
@@ -362,7 +361,7 @@ tuplesort_begin_index_btree(Relation heapRel,
arg->enforceUnique = enforceUnique;
arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
- indexScanKey = _bt_mkscankey(indexRel, heapRel, NULL);
+ indexScanKey = _bt_mkscankey(indexRel, NULL);
/* Prepare SortSupport data for each column */
base->sortKeys = (SortSupport) palloc0(base->nKeys *
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 6979aff72..94a975932 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -183,7 +183,6 @@ static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
OffsetNumber upperbound);
static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
static inline BTScanInsert bt_mkscankey_pivotsearch(Relation rel,
- Relation heaprel,
IndexTuple itup);
static ItemId PageGetItemIdCareful(BtreeCheckState *state, BlockNumber block,
Page page, OffsetNumber offset);
@@ -332,7 +331,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
RelationGetRelationName(indrel))));
/* Extract metadata from metapage, and sanitize it in passing */
- _bt_metaversion(indrel, heaprel, &heapkeyspace, &allequalimage);
+ _bt_metaversion(indrel, &heapkeyspace, &allequalimage);
if (allequalimage && !heapkeyspace)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1259,7 +1258,7 @@ bt_target_page_check(BtreeCheckState *state)
}
/* Build insertion scankey for current page offset */
- skey = bt_mkscankey_pivotsearch(state->rel, state->heaprel, itup);
+ skey = bt_mkscankey_pivotsearch(state->rel, itup);
/*
* Make sure tuple size does not exceed the relevant BTREE_VERSION
@@ -1769,7 +1768,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
* memory remaining allocated.
*/
firstitup = (IndexTuple) PageGetItem(rightpage, rightitem);
- return bt_mkscankey_pivotsearch(state->rel, state->heaprel, firstitup);
+ return bt_mkscankey_pivotsearch(state->rel, firstitup);
}
/*
@@ -2682,7 +2681,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
Buffer lbuf;
bool exists;
- key = _bt_mkscankey(state->rel, state->heaprel, itup);
+ key = _bt_mkscankey(state->rel, itup);
Assert(key->heapkeyspace && key->scantid != NULL);
/*
@@ -2695,7 +2694,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
*/
Assert(state->readonly && state->rootdescend);
exists = false;
- stack = _bt_search(state->rel, state->heaprel, key, &lbuf, BT_READ, NULL);
+ stack = _bt_search(state->rel, NULL, key, &lbuf, BT_READ, NULL);
if (BufferIsValid(lbuf))
{
@@ -3134,11 +3133,11 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
* the scankey is greater.
*/
static inline BTScanInsert
-bt_mkscankey_pivotsearch(Relation rel, Relation heaprel, IndexTuple itup)
+bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
{
BTScanInsert skey;
- skey = _bt_mkscankey(rel, heaprel, itup);
+ skey = _bt_mkscankey(rel, itup);
skey->pivotsearch = true;
return skey;
--
2.40.1
Hi,
On 2023-06-08 08:50:31 -0700, Peter Geoghegan wrote:
On Thu, Jun 8, 2023 at 7:22 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
IMO this kind of change definitely does not have place in a post-beta1
restructuring patch. We rarely indulge in case-fixing exercises at any
other time, and I don't see any good argument why post-beta1 is a better
time for it.There is a glaring inconsistency. Now about half of the relevant
functions in nbtree.h use "heaprel", while the other half use
"heapRel". Obviously that's not the end of the world, but it's
annoying. It's exactly the kind of case-fixing exercise that does tend
to happen.
From what I can tell it's largely consistent with other parameters of the
respective function. E.g. btinsert(), _bt_doinsert() use camelCase for most
parameters, so heapRel fits in. There are a few cases where it's not obvious
what the pattern is intended to be :/.
My new plan is to commit this tomorrow, since the clear consensus is
that we should go ahead with this for 16.
I'm not sure there is that concensus (for me half the changes shouldn't be
done, the rest should be in 17), but in the end it doesn't matter that much.
--- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -399,9 +399,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc, int workMem, SortCoordinate coordinate, int sortopt); extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc, - Relation indexRel, - Relation heaprel, - int workMem, + Relation indexRel, int workMem, SortCoordinate coordinate, int sortopt);
I think we should continue to provide the table here, even if we don't need it
today.
Greetings,
Andres Freund
On Fri, Jun 9, 2023 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
From what I can tell it's largely consistent with other parameters of the
respective function. E.g. btinsert(), _bt_doinsert() use camelCase for most
parameters, so heapRel fits in. There are a few cases where it's not obvious
what the pattern is intended to be :/.
It's not 100% clear what the underlying principle is, but we mix
camelCase and underscore styles all the time, so that's always kinda
true.
My new plan is to commit this tomorrow, since the clear consensus is
that we should go ahead with this for 16.I'm not sure there is that concensus (for me half the changes shouldn't be
done, the rest should be in 17), but in the end it doesn't matter that much.
Really? What parts are you opposed to in principle? I don't see why
you wouldn't support everything or nothing for 17 (questions of style
aside). I don't see what's ambiguous about what we should do here,
barring the 16-or-17 question.
It's not like nbtree ever really "used P_NEW". It doesn't actually
depend on any of the P_NEW handling inside bufmgr.c. It looks a little
like it might, but that's just an accident.
--- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -399,9 +399,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc, int workMem, SortCoordinate coordinate, int sortopt); extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc, - Relation indexRel, - Relation heaprel, - int workMem, + Relation indexRel, int workMem, SortCoordinate coordinate, int sortopt);I think we should continue to provide the table here, even if we don't need it
today.
I don't see why, but okay. I'll do it that way.
--
Peter Geoghegan
Hi,
On 2023-06-09 12:23:36 -0700, Peter Geoghegan wrote:
On Fri, Jun 9, 2023 at 12:03 PM Andres Freund <andres@anarazel.de> wrote:
My new plan is to commit this tomorrow, since the clear consensus is
that we should go ahead with this for 16.I'm not sure there is that concensus (for me half the changes shouldn't be
done, the rest should be in 17), but in the end it doesn't matter that much.Really? What parts are you opposed to in principle? I don't see why
you wouldn't support everything or nothing for 17 (questions of style
aside). I don't see what's ambiguous about what we should do here,
barring the 16-or-17 question.
I don't think minimizing heaprel being passed around is a worthwhile goal, the
contrary actually: It just makes it painful to use heaprel anywhere, because
it causes precisely these cascading changes of adding/removing the parameter
to a bunch of functions. If anything we should do the opposite.
It's not like nbtree ever really "used P_NEW". It doesn't actually
depend on any of the P_NEW handling inside bufmgr.c. It looks a little
like it might, but that's just an accident.
That part I am entirely on board with, as mentioned earlier. It doesn't seem
like 16 material though.
Greetings,
Andres Freund
On Fri, Jun 9, 2023 at 9:40 PM Andres Freund <andres@anarazel.de> wrote:
I don't think minimizing heaprel being passed around is a worthwhile goal, the
contrary actually: It just makes it painful to use heaprel anywhere, because
it causes precisely these cascading changes of adding/removing the parameter
to a bunch of functions. If anything we should do the opposite.It's not like nbtree ever really "used P_NEW". It doesn't actually
depend on any of the P_NEW handling inside bufmgr.c. It looks a little
like it might, but that's just an accident.That part I am entirely on board with, as mentioned earlier. It doesn't seem
like 16 material though.
Obviously you shouldn't need a heaprel to lock a page. As it happened
GiST already worked without this sort of P_NEW idiom, which is why
commit 61b313e4 hardly made any changes at all to GiST, even though
the relevant parts of GiST are heavily based on nbtree. Did you just
forget to plaster similar heaprel arguments all over GiST and SP-GiST?
I'm really disappointed that you're still pushing back here, even
after I got a +1 on backpatching from Heikki. This should have been
straightforward.
--
Peter Geoghegan
On Fri, Jun 9, 2023 at 12:23 PM Peter Geoghegan <pg@bowt.ie> wrote:
I'm not sure there is that concensus (for me half the changes shouldn't be
done, the rest should be in 17), but in the end it doesn't matter that much.
I pushed this just now. I have also closed out the open item.
--- a/src/include/utils/tuplesort.h +++ b/src/include/utils/tuplesort.h @@ -399,9 +399,7 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc, int workMem, SortCoordinate coordinate, int sortopt); extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc, - Relation indexRel, - Relation heaprel, - int workMem, + Relation indexRel, int workMem, SortCoordinate coordinate, int sortopt);I think we should continue to provide the table here, even if we don't need it
today.I don't see why, but okay. I'll do it that way.
I didn't end up doing that in the version that I pushed (heaprel was
removed from tuplesort_begin_cluster in the final version after all),
since I couldn't justify the use of NewHeap over OldHeap at the call
site in heapam_handler.c. If you're interested in following up with
this yourself, I have no objections.
--
Peter Geoghegan