Failure while inserting parent tuple to B-tree is not fun
Splitting a B-tree page is a two-stage process: First, the page is
split, and then a downlink for the new right page is inserted into the
parent (which might recurse to split the parent page, too). What happens
if inserting the downlink fails for some reason? I tried that out, and
it turns out that it's not nice.
I used this to cause a failure:
--- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); }+ elog(ERROR, "fail!");
+
/* get high key from left page == lowest key on new right page */
ritem = (IndexTuple) PageGetItem(page,
PageGetItemId(page, P_HIKEY));
postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 10000);
ERROR: fail!
That's not surprising. But when I removed that elog again and restarted
the server, I still can't insert. The index is permanently broken:
postgres=# insert into foo select generate_series(1, 10000);
ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5
In real life, you would get a failure like this e.g if you run out of
memory or disk space while inserting the downlink to the parent.
Although rare in practice, it's no fun if it happens.
I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate "the downlink hasn't been inserted
yet". When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I propose that we use the same approach I used with GiST, and add a flag to
the page header to indicate "the downlink hasn't been inserted yet". When
insertion (or vacuum) bumps into a flagged page, it can finish the
incomplete action by inserting the downlink.
Sounds very reasonable, but I'm interested in how you intend to
structure things, given this sounds like what could loosely be called
btree private state. I may also need to use a flag bit for something
that is of interest to the btree code alone.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22.10.2013 20:27, Peter Geoghegan wrote:
On Tue, Oct 22, 2013 at 9:55 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:I propose that we use the same approach I used with GiST, and add a flag to
the page header to indicate "the downlink hasn't been inserted yet". When
insertion (or vacuum) bumps into a flagged page, it can finish the
incomplete action by inserting the downlink.Sounds very reasonable, but I'm interested in how you intend to
structure things, given this sounds like what could loosely be called
btree private state. I may also need to use a flag bit for something
that is of interest to the btree code alone.
I may be missing something, but there are already plenty of b-tree
specific flags. See BTP_* in nbtree.h. I'll just add another to that list.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Oct 22, 2013 at 10:30 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I may be missing something, but there are already plenty of b-tree specific
flags. See BTP_* in nbtree.h. I'll just add another to that list.
Based on your remarks, I thought that you were intent on directly
using page level bits (pd_flags), rather than the private state flag
bits. Blame it on the lack of coffee, I suppose.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
Splitting a B-tree page is a two-stage process: First, the page is split,
and then a downlink for the new right page is inserted into the parent
(which might recurse to split the parent page, too). What happens if
inserting the downlink fails for some reason? I tried that out, and it turns
out that it's not nice.I used this to cause a failure:
--- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); }+ elog(ERROR, "fail!");
+
/* get high key from left page == lowest key on new right page */
ritem = (IndexTuple) PageGetItem(page,
PageGetItemId(page, P_HIKEY));postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 10000);
ERROR: fail!That's not surprising. But when I removed that elog again and restarted the
server, I still can't insert. The index is permanently broken:postgres=# insert into foo select generate_series(1, 10000);
ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5In real life, you would get a failure like this e.g if you run out of memory
or disk space while inserting the downlink to the parent. Although rare in
practice, it's no fun if it happens.
Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22.10.2013 21:25, Andres Freund wrote:
On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
Splitting a B-tree page is a two-stage process: First, the page is split,
and then a downlink for the new right page is inserted into the parent
(which might recurse to split the parent page, too). What happens if
inserting the downlink fails for some reason? I tried that out, and it turns
out that it's not nice.I used this to cause a failure:
--- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); }+ elog(ERROR, "fail!");
+
/* get high key from left page == lowest key on new right page */
ritem = (IndexTuple) PageGetItem(page,
PageGetItemId(page, P_HIKEY));postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 10000);
ERROR: fail!That's not surprising. But when I removed that elog again and restarted the
server, I still can't insert. The index is permanently broken:postgres=# insert into foo select generate_series(1, 10000);
ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5In real life, you would get a failure like this e.g if you run out of memory
or disk space while inserting the downlink to the parent. Although rare in
practice, it's no fun if it happens.Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?
Because there's no recovery involved. The failure I injected (or an
out-of-memory or out-of-disk-space in the real world) doesn't cause a
PANIC, just an ERROR that rolls back the current transaction, nothing more.
We could put a critical section around the whole recursion that inserts
the downlinks, so that you would get a PANIC and the incomplete split
mechanism would fix it at recovery. But that would hardly be an improvement.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
On 22.10.2013 21:25, Andres Freund wrote:
On 2013-10-22 19:55:09 +0300, Heikki Linnakangas wrote:
Splitting a B-tree page is a two-stage process: First, the page is split,
and then a downlink for the new right page is inserted into the parent
(which might recurse to split the parent page, too). What happens if
inserting the downlink fails for some reason? I tried that out, and it turns
out that it's not nice.I used this to cause a failure:
--- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1669,6 +1669,8 @@ _bt_insert_parent(Relation rel, _bt_relbuf(rel, pbuf); }+ elog(ERROR, "fail!");
+
/* get high key from left page == lowest key on new right page */
ritem = (IndexTuple) PageGetItem(page,
PageGetItemId(page, P_HIKEY));postgres=# create table foo (i int4 primary key);
CREATE TABLE
postgres=# insert into foo select generate_series(1, 10000);
ERROR: fail!That's not surprising. But when I removed that elog again and restarted the
server, I still can't insert. The index is permanently broken:postgres=# insert into foo select generate_series(1, 10000);
ERROR: failed to re-find parent key in index "foo_pkey" for split pages 4/5In real life, you would get a failure like this e.g if you run out of memory
or disk space while inserting the downlink to the parent. Although rare in
practice, it's no fun if it happens.Why doesn't the incomplete split mechanism prevent this? Because we do
not delay checkpoints on the primary and a checkpoint happened just
befor your elog(ERROR) above?Because there's no recovery involved. The failure I injected (or an
out-of-memory or out-of-disk-space in the real world) doesn't cause a PANIC,
just an ERROR that rolls back the current transaction, nothing more.We could put a critical section around the whole recursion that inserts the
downlinks, so that you would get a PANIC and the incomplete split mechanism
would fix it at recovery. But that would hardly be an improvement.
You were talking about restarting the server, that's why I assumed
recovery had been involved... But you just were talking about removing
the elog() again.
For me this clearly *has* to be in a critical section with the current
code. I had always assumed all multi-part actions would be.
Do you forsee the fix with ignoring missing downlinks to be
back-patchable? FWIW, I think I might have seen real-world cases of
this.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
We could put a critical section around the whole recursion that inserts the
downlinks, so that you would get a PANIC and the incomplete split mechanism
would fix it at recovery. But that would hardly be an improvement.
For me this clearly *has* to be in a critical section with the current
code. I had always assumed all multi-part actions would be.
No, that's hardly a good idea. As Heikki says, that would amount to
converting an entirely foreseeable situation into a PANIC.
Note also that the problem might be persistent, eg if you're out of disk
space. In that case, you'd just get the PANIC again at recovery, so now
not only have you crashed all your sessions but you have a database that
won't start up.
I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22.10.2013 22:24, Tom Lane wrote:
I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.
Yes. I intend to do that, too.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2013-10-22 21:29:13 +0300, Heikki Linnakangas wrote:
We could put a critical section around the whole recursion that inserts the
downlinks, so that you would get a PANIC and the incomplete split mechanism
would fix it at recovery. But that would hardly be an improvement.For me this clearly *has* to be in a critical section with the current
code. I had always assumed all multi-part actions would be.No, that's hardly a good idea. As Heikki says, that would amount to
converting an entirely foreseeable situation into a PANIC.
But IIUC this can currently lead to an index giving wrong answers, not
"just" fail at further inserts? I think if we half-split (without
inserting the downlink) a page several times, via different child-pages,
we might "suceed" in splitting but we'll have corrupted left/right
links. With complete splits such things should be prevented by the
page-level locks we hold, but that's not the case anymore.
If so that could, especially in combination with unique indexes, lead to
quite nasty data corruption
I wonder whether Heikki's approach could be used to remove the need for
the incomplete-split-fixup code altogether, thus eliminating a class of
recovery failure possibilities.
That'd be better...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@2ndquadrant.com> writes:
On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
No, that's hardly a good idea. As Heikki says, that would amount to
converting an entirely foreseeable situation into a PANIC.
But IIUC this can currently lead to an index giving wrong answers, not
"just" fail at further inserts?
No. It's exactly the same situation as when the insert is still in
progress, ie searches have to move right when following the original
downlink. Go read the nbtree README.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 2013-10-22 16:38:05 -0400, Tom Lane wrote:
Andres Freund <andres@2ndquadrant.com> writes:
On 2013-10-22 15:24:40 -0400, Tom Lane wrote:
No, that's hardly a good idea. As Heikki says, that would amount to
converting an entirely foreseeable situation into a PANIC.But IIUC this can currently lead to an index giving wrong answers, not
"just" fail at further inserts?No. It's exactly the same situation as when the insert is still in
progress, ie searches have to move right when following the original
downlink. Go read the nbtree README.
If the parent insert is still in progress we have a write lock on the
left and right page preventing such issues, right? At least that's what
the nbtree README says...
That doesn't hold true if we split a page but fail before re-inserting
the parent since the transaction rollback will obviously have released
the locks.
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 22.10.2013 19:55, Heikki Linnakangas wrote:
I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate "the downlink hasn't been inserted
yet". When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.
This is what I came up with.
One thing I'm not totally happy about is the way page deletions of
incompletely split pages are handled. Basically, it just bails out and
refuses to delete a page that is part of an incomplete split. That's
probably OK in practice, as incomplete splits should be very rare
anyway, but it's a bit dissatisfying to not handle the case because at
first glance it seems like it should be even simpler than usual to
delete a page that has no downlink. Nevertheless, I decided to just skip
that for now.
After this patch, deleting the only child of a parent and the parent
itself is still a multi-WAL-record operation that needs to be tracked
during recovery, and completed at the end of recovery. I'd like to
eliminate that too, but that's another patch.
- Heikki
Attachments:
btree-incomplete-split-1.patchtext/x-diff; name=btree-incomplete-split-1.patchDownload
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..29d8bd1 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -384,12 +384,41 @@ an additional insertion above that, etc).
For a root split, the followon WAL entry is a "new root" entry rather than
an "insertion" entry, but details are otherwise much the same.
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course). This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because insertion involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks along the way, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra writes in what would otherwise
+be a read-only operation (it would not be possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is atomically cleared.
+The child page is kept locked until the insertion in the parent is finished
+and the flag in the child cleared, but can be released immediately after
+that, before recursing up the tree, if the parent also needs to be split.
+That ensures that incompletely split pages should not be seen under normal
+circumstances; only when a transaction fails to insert the parent for some
+reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
+
+We used to keep track of incomplete splits during recovery and finish them
+immediately at end of recovery, instead of doing it lazily at the next
+insertion. However, that made the recovery much more complicated, and only
+fixed the problem when crash recovery was performed. An incomplete split can
+also occur if an otherwise recoverable error, like out-of-memory or
+out-of-disk-space, happens while inserting the downlink to the parent.
When splitting a non-root page that is alone on its level, the required
metapage update (of the "fast root" link) is performed and logged as part
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a452fea..a810015 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -59,14 +59,16 @@ static void _bt_findinsertloc(Relation rel,
ScanKey scankey,
IndexTuple newtup,
Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
- OffsetNumber newitemoff, Size newitemsz,
+static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
+ OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
IndexTuple newitem, bool newitemonleft);
+static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
+ BTStack stack, bool is_root, bool is_only);
static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff,
Size newitemsz,
@@ -130,7 +132,8 @@ top:
* move right in the tree. See Lehman and Yao for an excruciatingly
* precise description.
*/
- buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+ true, stack, BT_WRITE);
/*
* If we're not allowing duplicates, make sure the key isn't already in
@@ -184,7 +187,7 @@ top:
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
- _bt_insertonpg(rel, buf, stack, itup, offset, false);
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
@@ -664,6 +667,10 @@ _bt_findinsertloc(Relation rel,
* insertion, and the buffer must be pinned and write-locked. On return,
* we will have dropped both the pin and the lock on the buffer.
*
+ * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* The locking interactions in this code are critical. You should
* grok Lehman and Yao's paper before making any changes. In addition,
* you need to understand how we disambiguate duplicate keys in this
@@ -677,6 +684,7 @@ _bt_findinsertloc(Relation rel,
static void
_bt_insertonpg(Relation rel,
Buffer buf,
+ Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
@@ -690,6 +698,11 @@ _bt_insertonpg(Relation rel,
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ /* the caller should've finished any incomplete splits already */
+ if (P_INCOMPLETE_SPLIT(lpageop))
+ elog(ERROR, "cannot insert to incompletely-split page %u",
+ BufferGetBlockNumber(buf));
+
itemsz = IndexTupleDSize(*itup);
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
@@ -714,7 +727,7 @@ _bt_insertonpg(Relation rel,
&newitemonleft);
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, buf, firstright,
+ rbuf = _bt_split(rel, buf, cbuf, firstright,
newitemoff, itemsz, itup, newitemonleft);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
@@ -788,15 +801,25 @@ _bt_insertonpg(Relation rel,
MarkBufferDirty(metabuf);
}
+ /* clear INCOMPLETE_SPLIT flag on child if this finishes a split */
+ if (!P_ISLEAF(lpageop))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_insert xlrec;
- BlockNumber xldownlink;
+ BlockNumber xlleftchild;
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
- XLogRecData rdata[4];
+ XLogRecData rdata[5];
XLogRecData *nextrdata;
IndexTupleData trunctuple;
@@ -812,12 +835,15 @@ _bt_insertonpg(Relation rel,
xlinfo = XLOG_BTREE_INSERT_LEAF;
else
{
- xldownlink = ItemPointerGetBlockNumber(&(itup->t_tid));
- Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
-
- nextrdata->data = (char *) &xldownlink;
+ /*
+ * Include the block number of the left child, whose
+ * INCOMPLETE_SPLIT flag is cleared.
+ */
+ xlleftchild = BufferGetBlockNumber(cbuf);
+ nextrdata->data = (char *) &xlleftchild;
nextrdata->len = sizeof(BlockNumber);
- nextrdata->buffer = InvalidBuffer;
+ nextrdata->buffer = cbuf;
+ nextrdata->buffer_std = true;
nextrdata->next = nextrdata + 1;
nextrdata++;
@@ -870,6 +896,8 @@ _bt_insertonpg(Relation rel,
END_CRIT_SECTION();
/* release buffers; send out relcache inval if metapage changed */
+ if (!P_ISLEAF(lpageop))
+ _bt_relbuf(rel, cbuf);
if (BufferIsValid(metabuf))
{
if (!InRecovery)
@@ -889,11 +917,15 @@ _bt_insertonpg(Relation rel,
* new right page. newitemoff etc. tell us about the new item that
* must be inserted along with the data from the old page.
*
+ * When splitting a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
-_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
bool newitemonleft)
{
@@ -961,6 +993,8 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lopaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
ropaque->btpo_flags = lopaque->btpo_flags;
+ /* set flag in left page indicating that the right page has no downlink */
+ lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev;
lopaque->btpo_next = rightpagenumber;
ropaque->btpo_prev = origpagenumber;
@@ -1184,6 +1218,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
MarkBufferDirty(sbuf);
}
+ /*
+ * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
+ * a split.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
@@ -1206,34 +1252,6 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata = &rdata[0];
- if (ropaque->btpo.level > 0)
- {
- /* Log downlink on non-leaf pages */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
- lastrdata->len = sizeof(BlockIdData);
- lastrdata->buffer = InvalidBuffer;
-
- /*
- * We must also log the left page's high key, because the right
- * page's leftmost key is suppressed on non-leaf levels. Show it
- * as belonging to the left page buffer, so that it is not stored
- * if XLogInsert decides it needs a full-page image of the left
- * page.
- */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- itemid = PageGetItemId(origpage, P_HIKEY);
- item = (IndexTuple) PageGetItem(origpage, itemid);
- lastrdata->data = (char *) item;
- lastrdata->len = MAXALIGN(IndexTupleSize(item));
- lastrdata->buffer = buf; /* backup block 1 */
- lastrdata->buffer_std = true;
- }
-
/*
* Log the new item and its offset, if it was inserted on the left
* page. (If it was put on the right page, we don't need to explicitly
@@ -1260,17 +1278,40 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->buffer = buf; /* backup block 1 */
lastrdata->buffer_std = true;
}
- else if (ropaque->btpo.level == 0)
+
+ /* Log left page */
+ if (ropaque->btpo.level > 0)
{
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
/*
- * Although we don't need to WAL-log the new item, we still need
- * XLogInsert to consider storing a full-page image of the left
- * page, so make an empty entry referencing that buffer. This also
- * ensures that the left page is always backup block 1.
+ * We must also log the left page's high key, because the right
+ * page's leftmost key is suppressed on non-leaf levels. Show it
+ * as belonging to the left page buffer, so that it is not stored
+ * if XLogInsert decides it needs a full-page image of the left
+ * page.
*/
+ itemid = PageGetItemId(origpage, P_HIKEY);
+ item = (IndexTuple) PageGetItem(origpage, itemid);
+ lastrdata->data = (char *) item;
+ lastrdata->len = MAXALIGN(IndexTupleSize(item));
+ lastrdata->buffer = buf; /* backup block 1 */
+ lastrdata->buffer_std = true;
+ }
+
+ if (ropaque->btpo.level == 0 && !newitemonleft)
+ {
lastrdata->next = lastrdata + 1;
lastrdata++;
+ /*
+ * Although we don't need to WAL-log anything on the left page,
+ * the new item, we still need XLogInsert to consider storing a
+ * full-page image of the left page, so make an empty entry
+ * referencing that buffer. This also ensures that the left page
+ * is always backup block 1.
+ */
lastrdata->data = NULL;
lastrdata->len = 0;
lastrdata->buffer = buf; /* backup block 1 */
@@ -1278,6 +1319,22 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
}
/*
+ * Log block number of left child, whose INCOMPLETE_SPLIT flag this
+ * insertion clears.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ BlockNumber cblkno = BufferGetBlockNumber(cbuf);
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
+ lastrdata->data = (char *) &cblkno;
+ lastrdata->len = sizeof(BlockNumber);
+ lastrdata->buffer = cbuf; /* backup block 2 */
+ lastrdata->buffer_std = true;
+ }
+
+ /*
* Log the contents of the right page in the format understood by
* _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
* because we're going to recreate the whole page anyway, so it should
@@ -1306,7 +1363,7 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->data = NULL;
lastrdata->len = 0;
- lastrdata->buffer = sbuf; /* backup block 2 */
+ lastrdata->buffer = sbuf; /* bkp block 2 (leaf) or 3 (non-leaf) */
lastrdata->buffer_std = true;
}
@@ -1333,6 +1390,10 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
if (!P_RIGHTMOST(ropaque))
_bt_relbuf(rel, sbuf);
+ /* release the child */
+ if (ropaque->btpo.level > 0)
+ _bt_relbuf(rel, cbuf);
+
/* split's done */
return rbuf;
}
@@ -1603,10 +1664,8 @@ _bt_checksplitloc(FindSplitData *state,
* have to be efficient (concurrent ROOT split, WAL recovery)
* is_root - we split the true root
* is_only - we split a page alone on its level (might have been fast root)
- *
- * This is exported so it can be called by nbtxlog.c.
*/
-void
+static void
_bt_insert_parent(Relation rel,
Buffer buf,
Buffer rbuf,
@@ -1685,12 +1744,13 @@ _bt_insert_parent(Relation rel,
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- /* Now we can unlock the children */
+ /*
+ * Now we can unlock the right child. The left child will be unlocked
+ * by _bt_insertonpg().
+ */
_bt_relbuf(rel, rbuf);
- _bt_relbuf(rel, buf);
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
@@ -1698,7 +1758,7 @@ _bt_insert_parent(Relation rel,
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
- _bt_insertonpg(rel, pbuf, stack->bts_parent,
+ _bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
is_only);
@@ -1708,6 +1768,56 @@ _bt_insert_parent(Relation rel,
}
/*
+ * _bt_finish_split() -- Finish an incomplete split
+ *
+ * 'lbuf' is the left buffer whose right sibling is missing the downlink.
+ * On entry, we hold write-lock on it, and the lock is released on exit.
+ */
+void
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+{
+ Page lpage = BufferGetPage(lbuf);
+ BTPageOpaque lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+ Buffer rbuf;
+ Page rpage;
+ BTPageOpaque rpageop;
+ bool was_root;
+ bool was_only;
+
+ /* Lock the right sibling, the one missing the downlink */
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+ rpage = BufferGetPage(rbuf);
+ rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+ /* Could this be a root split? */
+ if (!stack)
+ {
+ Buffer metabuf;
+ Page metapg;
+ BTMetaPageData *metad;
+
+ /* acquire lock on the metapage */
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
+
+ was_root = (metad->btm_root == BufferGetBlockNumber(lbuf));
+
+ _bt_relbuf(rel, metabuf);
+ }
+ else
+ was_root = false;
+
+ /* Was this the only page on the level before split? */
+ was_only = (P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop));
+
+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
+
+ _bt_insert_parent(rel, lbuf, rbuf, stack, was_root, was_only);
+}
+
+/*
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
* we last looked at in the parent.
*
@@ -1843,6 +1953,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rbkno;
BlockNumber rootblknum;
BTPageOpaque rootopaque;
+ BTPageOpaque lopaque;
ItemId itemid;
IndexTuple item;
Size itemsz;
@@ -1854,6 +1965,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
lbkno = BufferGetBlockNumber(lbuf);
rbkno = BufferGetBlockNumber(rbuf);
lpage = BufferGetPage(lbuf);
+ lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
/* get a new root page */
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1927,6 +2039,11 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
pfree(new_item);
+ /* Clear the incomplete-split flag in the left child */
+ Assert(P_INCOMPLETE_SPLIT(lopaque));
+ lopaque->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(lbuf);
+
MarkBufferDirty(rootbuf);
MarkBufferDirty(metabuf);
@@ -1935,7 +2052,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
{
xl_btree_newroot xlrec;
XLogRecPtr recptr;
- XLogRecData rdata[2];
+ XLogRecData rdata[3];
xlrec.node = rel->rd_node;
xlrec.rootblk = rootblknum;
@@ -1954,7 +2071,13 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rdata[1].len = ((PageHeader) rootpage)->pd_special -
((PageHeader) rootpage)->pd_upper;
rdata[1].buffer = InvalidBuffer;
- rdata[1].next = NULL;
+ rdata[1].next = &(rdata[2]);
+
+ /* Make a full-page image of the left child if needed */
+ rdata[2].data = NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = lbuf;
+ rdata[2].next = NULL;
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f407753..cbbfc1d 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -992,8 +992,15 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_READ);
if (pbuf == InvalidBuffer)
- elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
+ {
+ /*
+ * If the target page's left sibling was incompletely split, it
+ * has no downlink in the parent. Bail out.
+ */
+ elog(DEBUG1, "failed to re-find parent key in index \"%s\" for deletion target page %u",
RelationGetRelationName(rel), target);
+ return false;
+ }
parent = stack->bts_blkno;
poffset = stack->bts_offset;
@@ -1094,11 +1101,16 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
/*
* We can never delete rightmost pages nor root pages. While at it, check
* that page is not already deleted and is empty.
+ *
+ * Also don't delete a page that was incompletely split. We could handle
+ * that case, but it would complicate the logic, and incomplete splits are
+ * rare.
*/
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
- P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+ P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+ P_INCOMPLETE_SPLIT(opaque))
{
/* Should never fail to delete a half-dead page */
Assert(!P_ISHALFDEAD(opaque));
@@ -1223,6 +1235,19 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
page = BufferGetPage(lbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
+
+ /*
+ * If the left sibling was incompletely split, and therefore the target
+ * page has no downlink in the parent, give up. We could handle this
+ * case - in fact deleting a page that has no downlink would probably
+ * be simpler than the general case, as the parent doesn't need to be
+ * updated. But incomplete splits are rare.
+ */
+ if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_relbuf(rel, lbuf);
+ return 0;
+ }
}
else
lbuf = InvalidBuffer;
@@ -1242,7 +1267,8 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
* didn't have it locked; the others are just for paranoia's sake.
*/
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
- P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+ P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+ P_INCOMPLETE_SPLIT(opaque))
{
_bt_relbuf(rel, buf);
if (BufferIsValid(lbuf))
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ac98589..a85b5ee 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -51,7 +51,9 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
- * will result in *bufP being set to InvalidBuffer.
+ * will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
+ * any incomplete splits encountered during the search will be finished by
+ * inserting the missing downlinks.
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
@@ -82,8 +84,13 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage). If it has, we may need to
* move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way.
*/
- *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey, BT_READ);
+ *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+ (access == BT_WRITE), stack_in,
+ BT_READ);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -148,6 +155,11 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* item >= scankey. When nextkey is true, we are looking for the first
* item strictly greater than scankey.
*
+ * If finishsplits is true, we will attempt to finish any incompletely split
+ * pages that we encounter. This is required when searching for a target page
+ * for an insertion, because the split algorithm will fail if it cannot find
+ * the downlink of a page. 'stack' is only used if finishsplits is true.
+ *
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
* the same on the right sibling. Return value is the buffer we stop at.
@@ -158,11 +170,14 @@ _bt_moveright(Relation rel,
int keysz,
ScanKey scankey,
bool nextkey,
+ bool finishsplits,
+ BTStack stack,
int access)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
+ int curaccess = access;
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -186,12 +201,41 @@ _bt_moveright(Relation rel,
while (!P_RIGHTMOST(opaque) &&
(P_IGNORE(opaque) ||
- _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
+ _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval ||
+ (finishsplits && P_INCOMPLETE_SPLIT(opaque))))
{
- /* step right one page */
- BlockNumber rblkno = opaque->btpo_next;
+ BlockNumber blkno;
+ int needaccess = access;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (finishsplits && P_INCOMPLETE_SPLIT(opaque))
+ {
+ blkno = BufferGetBlockNumber(buf);
+ if (curaccess == BT_READ)
+ {
+ /* re-acquire in write-mode, and recheck */
+ needaccess = BT_WRITE;
+ }
+ else
+ {
+ _bt_finish_split(rel, buf, stack);
+ /*
+ * the buffer was released by _bt_finish_split, we will
+ * re-acquire it in the right mode below.
+ */
+ buf = InvalidBuffer;
+ }
+ }
+ else
+ {
+ /* step right one page */
+ blkno = opaque->btpo_next;
+ }
- buf = _bt_relandgetbuf(rel, buf, rblkno, access);
+ buf = _bt_relandgetbuf(rel, buf, blkno, needaccess);
+ curaccess = needaccess;
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index cb5867e..c934a5e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,77 +21,29 @@
#include "miscadmin.h"
/*
- * We must keep track of expected insertions due to page splits, and apply
- * them manually if they are not seen in the WAL log during replay. This
- * makes it safe for page insertion to be a multiple-WAL-action process.
- *
- * Similarly, deletion of an only child page and deletion of its parent page
+ * Deletion of an only child page and deletion of its parent page
* form multiple WAL log entries, and we have to be prepared to follow through
- * with the deletion if the log ends between.
+ * with the deletion if the log ends between. Keep track of those actions,
*
* The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split or multi deletion to remain incomplete
+ * since we don't expect a multi deletion to remain incomplete
* for long. In any case we need to respect the order of operations.
*/
typedef struct bt_incomplete_action
{
RelFileNode node; /* the index */
- bool is_split; /* T = pending split, F = pending delete */
- /* these fields are for a split: */
- bool is_root; /* we split the root */
- BlockNumber leftblk; /* left half of split */
- BlockNumber rightblk; /* right half of split */
/* these fields are for a delete: */
BlockNumber delblk; /* parent block to be deleted */
} bt_incomplete_action;
static List *incomplete_actions;
-
-static void
-log_incomplete_split(RelFileNode node, BlockNumber leftblk,
- BlockNumber rightblk, bool is_root)
-{
- bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
-
- action->node = node;
- action->is_split = true;
- action->is_root = is_root;
- action->leftblk = leftblk;
- action->rightblk = rightblk;
- incomplete_actions = lappend(incomplete_actions, action);
-}
-
-static void
-forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
-{
- ListCell *l;
-
- foreach(l, incomplete_actions)
- {
- bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
- if (RelFileNodeEquals(node, action->node) &&
- action->is_split &&
- downlink == action->rightblk)
- {
- if (is_root != action->is_root)
- elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
- action->is_root, is_root);
- incomplete_actions = list_delete_ptr(incomplete_actions, action);
- pfree(action);
- break; /* need not look further */
- }
- }
-}
-
static void
log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
{
bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
action->node = node;
- action->is_split = false;
action->delblk = delblk;
incomplete_actions = lappend(incomplete_actions, action);
}
@@ -106,7 +58,6 @@ forget_matching_deletion(RelFileNode node, BlockNumber delblk)
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
if (RelFileNodeEquals(node, action->node) &&
- !action->is_split &&
delblk == action->delblk)
{
incomplete_actions = list_delete_ptr(incomplete_actions, action);
@@ -190,23 +141,60 @@ _bt_restore_meta(RelFileNode rnode, XLogRecPtr lsn,
UnlockReleaseBuffer(metabuf);
}
+/*
+ * _bt_clear_incomplete_split -- clear INCOMPLETE_SPLIT flag on a page
+ *
+ * This is a common subroutine of the redo functions of all the WAL record
+ * types that can insert a downlink: insert, split, and newroot.
+ *
+ * Returns a (locked) buffer containing the child page, or InvalidBuffer if
+ * the page was not found (because it was truncated later).
+ */
+static Buffer
+_bt_clear_incomplete_split(XLogRecPtr lsn, XLogRecord *record,
+ RelFileNode rnode, BlockNumber cblock)
+{
+ Buffer cbuf;
+
+ cbuf = XLogReadBuffer(rnode, cblock, false);
+ if (BufferIsValid(cbuf))
+ {
+ Page page = (Page) BufferGetPage(cbuf);
+
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+ pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(cbuf);
+ }
+ }
+
+ return cbuf;
+}
+
static void
btree_xlog_insert(bool isleaf, bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
Buffer buffer;
+ Buffer cbuffer = InvalidBuffer;
Page page;
char *datapos;
int datalen;
xl_btree_metadata md;
- BlockNumber downlink = 0;
+ BlockNumber finishes_split = 0;
+ int blk_index;
datapos = (char *) xlrec + SizeOfBtreeInsert;
datalen = record->xl_len - SizeOfBtreeInsert;
- if (!isleaf)
+ if (!isleaf && (record->xl_info & XLR_BKP_BLOCK(0)) == 0)
{
- memcpy(&downlink, datapos, sizeof(BlockNumber));
+ memcpy(&finishes_split, datapos, sizeof(BlockNumber));
+ Assert(finishes_split != 0);
datapos += sizeof(BlockNumber);
datalen -= sizeof(BlockNumber);
}
@@ -217,8 +205,18 @@ btree_xlog_insert(bool isleaf, bool ismeta,
datalen -= sizeof(xl_btree_metadata);
}
- if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ cbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
+ else
+ cbuffer = _bt_clear_incomplete_split(lsn, record, xlrec->target.node, finishes_split);
+ }
+
+ /* the backup block index containing the main page is 0 or 1 */
+ blk_index = isleaf ? 0 : 1;
+ if (record->xl_info & XLR_BKP_BLOCK(blk_index))
+ (void) RestoreBackupBlock(lsn, record, blk_index, false, false);
else
{
buffer = XLogReadBuffer(xlrec->target.node,
@@ -228,11 +226,7 @@ btree_xlog_insert(bool isleaf, bool ismeta,
{
page = (Page) BufferGetPage(buffer);
- if (lsn <= PageGetLSN(page))
- {
- UnlockReleaseBuffer(buffer);
- }
- else
+ if (lsn > PageGetLSN(page))
{
if (PageAddItem(page, (Item) datapos, datalen,
ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
@@ -241,11 +235,14 @@ btree_xlog_insert(bool isleaf, bool ismeta,
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
}
+ UnlockReleaseBuffer(buffer);
}
}
+ if (BufferIsValid(cbuffer))
+ UnlockReleaseBuffer(cbuffer);
+
/*
* Note: in normal operation, we'd update the metapage while still holding
* lock on the page we inserted into. But during replay it's not
@@ -257,10 +254,6 @@ btree_xlog_insert(bool isleaf, bool ismeta,
_bt_restore_meta(xlrec->target.node, lsn,
md.root, md.level,
md.fastroot, md.fastlevel);
-
- /* Forget any split this insertion completes */
- if (!isleaf)
- forget_matching_split(xlrec->target.node, downlink, false);
}
static void
@@ -268,7 +261,10 @@ btree_xlog_split(bool onleft, bool isroot,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
+ bool isleaf = xlrec->level == 0;
+ Buffer lbuf;
Buffer rbuf;
+ Buffer cbuf = InvalidBuffer;
Page rpage;
BTPageOpaque ropaque;
char *datapos;
@@ -278,42 +274,19 @@ btree_xlog_split(bool onleft, bool isroot,
Size newitemsz = 0;
Item left_hikey = NULL;
Size left_hikeysz = 0;
+ BlockNumber finishes_split = InvalidBlockNumber;
datapos = (char *) xlrec + SizeOfBtreeSplit;
datalen = record->xl_len - SizeOfBtreeSplit;
- /* Forget any split this insertion completes */
- if (xlrec->level > 0)
- {
- /* we assume SizeOfBtreeSplit is at least 16-bit aligned */
- BlockNumber downlink = BlockIdGetBlockNumber((BlockId) datapos);
-
- datapos += sizeof(BlockIdData);
- datalen -= sizeof(BlockIdData);
-
- forget_matching_split(xlrec->node, downlink, false);
-
- /* Extract left hikey and its size (still assuming 16-bit alignment) */
- if (!(record->xl_info & XLR_BKP_BLOCK(0)))
- {
- /* We assume 16-bit alignment is enough for IndexTupleSize */
- left_hikey = (Item) datapos;
- left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-
- datapos += left_hikeysz;
- datalen -= left_hikeysz;
- }
- }
-
- /* Extract newitem and newitemoff, if present */
+ /* Extract newitemoff and newitem, if present */
if (onleft)
{
- /* Extract the offset (still assuming 16-bit alignment) */
+ /* we assume SizeOfBtreeSplit is at least 16-bit aligned */
memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
datapos += sizeof(OffsetNumber);
datalen -= sizeof(OffsetNumber);
}
-
if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
{
/*
@@ -327,6 +300,37 @@ btree_xlog_split(bool onleft, bool isroot,
datalen -= newitemsz;
}
+ /* Extract left hikey and its size (still assuming 16-bit alignment) */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(0)))
+ {
+ left_hikey = (Item) datapos;
+ left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+ datapos += left_hikeysz;
+ datalen -= left_hikeysz;
+ }
+ /*
+ * If this insertion finishes an incomplete split, get the block number
+ * of the child.
+ */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(1)))
+ {
+ memcpy(&finishes_split, datapos, sizeof(BlockNumber));
+ datapos += sizeof(BlockNumber);
+ datalen -= sizeof(BlockNumber);
+ }
+
+ /*
+ * Clear the incomplete split flag on the left sibling of the child page
+ * this is a downlink for.
+ */
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ cbuf = RestoreBackupBlock(lsn, record, 2, false, true);
+ else
+ cbuf = _bt_clear_incomplete_split(lsn, record, xlrec->node, finishes_split);
+ }
+
/* Reconstruct right (new) sibling page from scratch */
rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
Assert(BufferIsValid(rbuf));
@@ -338,7 +342,7 @@ btree_xlog_split(bool onleft, bool isroot,
ropaque->btpo_prev = xlrec->leftsib;
ropaque->btpo_next = xlrec->rnext;
ropaque->btpo.level = xlrec->level;
- ropaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0;
_bt_restore_page(rpage, datapos, datalen);
@@ -347,7 +351,7 @@ btree_xlog_split(bool onleft, bool isroot,
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
- if (xlrec->level == 0)
+ if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
@@ -362,10 +366,10 @@ btree_xlog_split(bool onleft, bool isroot,
/* Now reconstruct left (original) sibling page */
if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ lbuf = RestoreBackupBlock(lsn, record, 0, false, true);
else
{
- Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
+ lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
if (BufferIsValid(lbuf))
{
@@ -422,19 +426,22 @@ btree_xlog_split(bool onleft, bool isroot,
elog(PANIC, "failed to add high key to left page after split");
/* Fix opaque fields */
- lopaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ lopaque->btpo_flags = isleaf ? BTP_LEAF : 0;
+ lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_next = xlrec->rightsib;
lopaque->btpo_cycleid = 0;
PageSetLSN(lpage, lsn);
MarkBufferDirty(lbuf);
}
-
- UnlockReleaseBuffer(lbuf);
}
}
- /* We no longer need the right buffer */
+ /* We no longer need the buffers */
+ if (BufferIsValid(cbuf))
+ UnlockReleaseBuffer(cbuf);
+ if (BufferIsValid(lbuf))
+ UnlockReleaseBuffer(lbuf);
UnlockReleaseBuffer(rbuf);
/*
@@ -445,32 +452,39 @@ btree_xlog_split(bool onleft, bool isroot,
* replay, because no other index update can be in progress, and readers
* will cope properly when following an obsolete left-link.
*/
- if (record->xl_info & XLR_BKP_BLOCK(1))
- (void) RestoreBackupBlock(lsn, record, 1, false, false);
- else if (xlrec->rnext != P_NONE)
+ if (xlrec->rnext != P_NONE)
{
- Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+ /*
+ * the backup block containing right sibling is 2 or 3, depending
+ * whether this was a leaf or internal page.
+ */
+ int rnext_index = isleaf ? 2 : 3;
- if (BufferIsValid(buffer))
+ if (record->xl_info & XLR_BKP_BLOCK(rnext_index))
+ (void) RestoreBackupBlock(lsn, record, rnext_index, false, false);
+ else
{
- Page page = (Page) BufferGetPage(buffer);
+ Buffer buffer;
- if (lsn > PageGetLSN(page))
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+
+ if (BufferIsValid(buffer))
{
- BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Page page = (Page) BufferGetPage(buffer);
- pageop->btpo_prev = xlrec->rightsib;
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
- PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
+ pageop->btpo_prev = xlrec->rightsib;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
}
- UnlockReleaseBuffer(buffer);
}
}
-
- /* The job ain't done till the parent link is inserted... */
- log_incomplete_split(xlrec->node,
- xlrec->leftsib, xlrec->rightsib, isroot);
}
static void
@@ -946,12 +960,11 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_newroot *xlrec = (xl_btree_newroot *) XLogRecGetData(record);
Buffer buffer;
+ Buffer cbuffer = InvalidBuffer;
Page page;
BTPageOpaque pageop;
- BlockNumber downlink = 0;
-
- /* Backup blocks are not used in newroot records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+ BlockNumber cblkno;
+ IndexTuple itup;
buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
Assert(BufferIsValid(buffer));
@@ -969,28 +982,31 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
if (record->xl_len > SizeOfBtreeNewroot)
{
- IndexTuple itup;
-
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeNewroot,
record->xl_len - SizeOfBtreeNewroot);
- /* extract downlink to the right-hand split page */
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
- downlink = ItemPointerGetBlockNumber(&(itup->t_tid));
+ /* extract block number of the left-hand split page */
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ cblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+
+ /* Clear the incomplete-split flag in left child */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ cbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
+ else
+ cbuffer = _bt_clear_incomplete_split(lsn, record,
+ xlrec->node, cblkno);
}
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
UnlockReleaseBuffer(buffer);
+ if (BufferIsValid(cbuffer))
+ UnlockReleaseBuffer(cbuffer);
_bt_restore_meta(xlrec->node, lsn,
xlrec->rootblk, xlrec->level,
xlrec->rootblk, xlrec->level);
-
- /* Check to see if this satisfies any incomplete insertions */
- if (record->xl_len > SizeOfBtreeNewroot)
- forget_matching_split(xlrec->node, downlink, true);
}
static void
@@ -1084,55 +1100,19 @@ btree_xlog_cleanup(void)
{
bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
- if (action->is_split)
+ /* finish an incomplete deletion (of a half-dead page) */
+ Buffer buf;
+
+ buf = XLogReadBuffer(action->node, action->delblk, false);
+ if (BufferIsValid(buf))
{
- /* finish an incomplete split */
- Buffer lbuf,
- rbuf;
- Page lpage,
- rpage;
- BTPageOpaque lpageop,
- rpageop;
- bool is_only;
Relation reln;
- lbuf = XLogReadBuffer(action->node, action->leftblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(lbuf))
- elog(PANIC, "btree_xlog_cleanup: left block unfound");
- lpage = (Page) BufferGetPage(lbuf);
- lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
- rbuf = XLogReadBuffer(action->node, action->rightblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(rbuf))
- elog(PANIC, "btree_xlog_cleanup: right block unfound");
- rpage = (Page) BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
- /* if the pages are all of their level, it's a only-page split */
- is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
reln = CreateFakeRelcacheEntry(action->node);
- _bt_insert_parent(reln, lbuf, rbuf, NULL,
- action->is_root, is_only);
+ if (_bt_pagedel(reln, buf, NULL) == 0)
+ elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
FreeFakeRelcacheEntry(reln);
}
- else
- {
- /* finish an incomplete deletion (of a half-dead page) */
- Buffer buf;
-
- buf = XLogReadBuffer(action->node, action->delblk, false);
- if (BufferIsValid(buf))
- {
- Relation reln;
-
- reln = CreateFakeRelcacheEntry(action->node);
- if (_bt_pagedel(reln, buf, NULL) == 0)
- elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
- FreeFakeRelcacheEntry(reln);
- }
- }
}
incomplete_actions = NIL;
}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3997f94..803073f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -73,6 +73,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
+#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
@@ -178,6 +179,7 @@ typedef struct BTMetaPageData
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
+#define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -254,7 +256,7 @@ typedef struct xl_btree_metadata
typedef struct xl_btree_insert
{
xl_btreetid target; /* inserted tuple id */
- /* BlockNumber downlink field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
+ /* BlockNumber finishes_split field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
} xl_btree_insert;
@@ -287,19 +289,18 @@ typedef struct xl_btree_split
OffsetNumber firstright; /* first item moved to right page */
/*
- * If level > 0, BlockIdData downlink follows. (We use BlockIdData rather
- * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
- * aligned.)
+ * In the _L variants, next are OffsetNumber newitemoff and the new item.
+ * (In the _R variants, the new item is one of the right page's tuples.)
+ * The new item, but not newitemoff, is suppressed if XLogInsert chooses
+ * to store the left page's whole page image.
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in the new right page. Also, it's suppressed if
* XLogInsert chooses to store the left page's whole page image.
*
- * In the _L variants, next are OffsetNumber newitemoff and the new item.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * The new item, but not newitemoff, is suppressed if XLogInsert chooses
- * to store the left page's whole page image.
+ * If level > 0, BlockNumber of the page whose incomplete-split flag
+ * this insertion clears. (not aligned)
*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
@@ -617,8 +618,7 @@ extern Datum btoptions(PG_FUNCTION_ARGS);
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
-extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
- BTStack stack, bool is_root, bool is_only);
+extern void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
/*
* prototypes for functions in nbtpage.c
@@ -648,7 +648,8 @@ extern BTStack _bt_search(Relation rel,
int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey, int access);
+ ScanKey scankey, bool nextkey,
+ bool finishsplits, BTStack stack, int access);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
On 25.10.2013 22:13, Heikki Linnakangas wrote:
On 22.10.2013 19:55, Heikki Linnakangas wrote:
I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate "the downlink hasn't been inserted
yet". When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.This is what I came up with.
One thing I'm not totally happy about is the way page deletions of
incompletely split pages are handled. Basically, it just bails out and
refuses to delete a page that is part of an incomplete split. That's
probably OK in practice, as incomplete splits should be very rare
anyway, but it's a bit dissatisfying to not handle the case because at
first glance it seems like it should be even simpler than usual to
delete a page that has no downlink. Nevertheless, I decided to just skip
that for now.After this patch, deleting the only child of a parent and the parent
itself is still a multi-WAL-record operation that needs to be tracked
during recovery, and completed at the end of recovery. I'd like to
eliminate that too, but that's another patch.
Here's a new version of this, which uses a similar technique to handle
page deletions, eliminating the "incomplete action" tracking code
altogether (from btree). When an internal page is marked as half-dead,
its right sibling is atomically marked with a
"left-sibling-is-half-dead" flag. Whenever an insertion encounters a
page with that flag set, it will finish the deletion of the left sibling
before proceeding with the insertion.
This needs a lot more testing, but I wanted to get this out for review,
in case someone sees a fundamental problem with this.
- Heikki
Attachments:
btree-incomplete-actions-2.patchtext/x-diff; name=btree-incomplete-actions-2.patchDownload
*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
***************
*** 168,173 **** parent item does still exist and can't have been deleted. Also, because
--- 168,203 ----
we are matching downlink page numbers and not data keys, we don't have any
problem with possibly misidentifying the parent item.
+ VACUUM needs to do a linear scan of an index to search for deleted pages
+ that can be reclaimed because they are older than all open transactions.
+ For efficiency's sake, we'd like to use the same linear scan to search for
+ deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages
+ in index order, but it is possible to visit them in physical order instead.
+ The tricky part of this is to avoid missing any deletable tuples in the
+ presence of concurrent page splits: a page split could easily move some
+ tuples from a page not yet passed over by the sequential scan to a
+ lower-numbered page already passed over. (This wasn't a concern for the
+ index-order scan, because splits always split right.) To implement this,
+ we provide a "vacuum cycle ID" mechanism that makes it possible to
+ determine whether a page has been split since the current btbulkdelete
+ cycle started. If btbulkdelete finds a page that has been split since
+ it started, and has a right-link pointing to a lower page number, then
+ it temporarily suspends its sequential scan and visits that page instead.
+ It must continue to follow right-links and vacuum dead tuples until
+ reaching a page that either hasn't been split since btbulkdelete started,
+ or is above the location of the outer sequential scan. Then it can resume
+ the sequential scan. This ensures that all tuples are visited. It may be
+ that some tuples are visited twice, but that has no worse effect than an
+ inaccurate index tuple count (and we can't guarantee an accurate count
+ anyway in the face of concurrent activity). Note that this still works
+ if the has-been-recently-split test has a small probability of false
+ positives, so long as it never gives a false negative. This makes it
+ possible to implement the test with a small counter value stored on each
+ index page.
+
+ Page Deletion Algorithm
+ -----------------------
+
We consider deleting an entire page from the btree only when it's become
completely empty of items. (Merging partly-full pages would allow better
space reuse, but it seems impractical to move existing data items left or
***************
*** 216,229 **** The notion of a half-dead page means that the key space relationship between
the half-dead page's level and its parent's level may be a little out of
whack: key space that appears to belong to the half-dead page's parent on the
parent level may really belong to its right sibling. To prevent any possible
! problems, we hold lock on the deleted child page until we have finished
! deleting any now-half-dead parent page(s). This prevents any insertions into
! the transferred keyspace until the operation is complete. The reason for
! doing this is that a sufficiently large number of insertions into the
! transferred keyspace, resulting in multiple page splits, could propagate keys
! from that keyspace into the parent level, resulting in transiently
! out-of-order keys in that level. It is thought that that wouldn't cause any
! serious problem, but it seems too risky to allow.
A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
--- 246,262 ----
the half-dead page's level and its parent's level may be a little out of
whack: key space that appears to belong to the half-dead page's parent on the
parent level may really belong to its right sibling. To prevent any possible
! problems, we mark the half-dead page's right sibling, which now owns the
! keyspace of the deleted page, with a flag indicating that the left sibling
! is half-dead. Any insertion into that page will force the half-dead page to be
! deleted first, similar we require any insertions to an incompletely-split page
! to finish the split first.
!
! The reason for doing this is that a sufficiently large number of insertions
! into the transferred keyspace, resulting in multiple page splits, could
! propagate keys from that keyspace into the parent level, resulting in
! transiently out-of-order keys in that level. It is thought that that wouldn't
! cause any serious problem, but it seems too risky to allow.
A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
***************
*** 298,330 **** as part of the atomic update for the delete (either way, the metapage has
to be the last page locked in the update to avoid deadlock risks). This
avoids race conditions if two such operations are executing concurrently.
- VACUUM needs to do a linear scan of an index to search for deleted pages
- that can be reclaimed because they are older than all open transactions.
- For efficiency's sake, we'd like to use the same linear scan to search for
- deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages
- in index order, but it is possible to visit them in physical order instead.
- The tricky part of this is to avoid missing any deletable tuples in the
- presence of concurrent page splits: a page split could easily move some
- tuples from a page not yet passed over by the sequential scan to a
- lower-numbered page already passed over. (This wasn't a concern for the
- index-order scan, because splits always split right.) To implement this,
- we provide a "vacuum cycle ID" mechanism that makes it possible to
- determine whether a page has been split since the current btbulkdelete
- cycle started. If btbulkdelete finds a page that has been split since
- it started, and has a right-link pointing to a lower page number, then
- it temporarily suspends its sequential scan and visits that page instead.
- It must continue to follow right-links and vacuum dead tuples until
- reaching a page that either hasn't been split since btbulkdelete started,
- or is above the location of the outer sequential scan. Then it can resume
- the sequential scan. This ensures that all tuples are visited. It may be
- that some tuples are visited twice, but that has no worse effect than an
- inaccurate index tuple count (and we can't guarantee an accurate count
- anyway in the face of concurrent activity). Note that this still works
- if the has-been-recently-split test has a small probability of false
- positives, so long as it never gives a false negative. This makes it
- possible to implement the test with a small counter value stored on each
- index page.
-
On-the-Fly Deletion Of Index Tuples
-----------------------------------
--- 331,336 ----
***************
*** 384,395 **** an additional insertion above that, etc).
For a root split, the followon WAL entry is a "new root" entry rather than
an "insertion" entry, but details are otherwise much the same.
! Because insertion involves multiple atomic actions, the WAL replay logic
! has to detect the case where a page split isn't followed by a matching
! insertion on the parent level, and then do that insertion on its own (and
! recursively for any subsequent parent insertion, of course). This is
! feasible because the WAL entry for the split contains enough info to know
! what must be inserted in the parent level.
When splitting a non-root page that is alone on its level, the required
metapage update (of the "fast root" link) is performed and logged as part
--- 390,430 ----
For a root split, the followon WAL entry is a "new root" entry rather than
an "insertion" entry, but details are otherwise much the same.
! Because splitting involves multiple atomic actions, it's possible that the
! system crashes between splitting a page and inserting the downlink for the
! new half to the parent. After recovery, the downlink for the new page will
! be missing. The search algorithm works correctly, as the page will be found
! by following the right-link from its left sibling, although if a lot of
! downlinks in the tree are missing, performance will suffer. A more serious
! consequence is that if the page without a downlink gets split again, the
! insertion algorithm will fail to find the location in the parent level to
! insert the downlink.
!
! Our approach is to create any missing downlinks on-they-fly, when
! searching the tree for a new insertion. It could be done during searches,
! too, but it seems best not to put any extra updates in what would otherwise
! be a read-only operation (updating is not possible in hot standby mode
! anyway). To identify missing downlinks, when a page is split, the left page
! is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
! When the downlink is inserted to the parent, the flag is cleared atomically
! with the insertion. The child page is kept locked until the insertion in the
! parent is finished and the flag in the child cleared, but can be released
! immediately after that, before recursing up the tree, if the parent also
! needs to be split. This ensures that incompletely split pages should not be
! seen under normal circumstances; only when insertion to the parent fails
! for some reason.
!
! We flag the left page, even though it's the right page that's missing the
! downlink, beacuse it's more convenient to know already when following the
! right-link from the left page to the right page that it will need to have
! its downlink inserted to the parent.
!
! We used to keep track of incomplete splits during recovery and finish them
! immediately at end of recovery, instead of doing it lazily at the next
! insertion. However, that made the recovery much more complicated, and only
! fixed the problem when crash recovery was performed. An incomplete split can
! also occur if an otherwise recoverable error, like out-of-memory or
! out-of-disk-space, happens while inserting the downlink to the parent.
When splitting a non-root page that is alone on its level, the required
metapage update (of the "fast root" link) is performed and logged as part
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 58,72 **** static void _bt_findinsertloc(Relation rel,
int keysz,
ScanKey scankey,
IndexTuple newtup,
Relation heapRel);
! static void _bt_insertonpg(Relation rel, Buffer buf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
bool split_only_page);
! static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
! OffsetNumber newitemoff, Size newitemsz,
IndexTuple newitem, bool newitemonleft);
static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff,
Size newitemsz,
--- 58,76 ----
int keysz,
ScanKey scankey,
IndexTuple newtup,
+ BTStack stack,
Relation heapRel);
! static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
bool split_only_page);
! static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
! OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
IndexTuple newitem, bool newitemonleft);
+ static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
+ BTStack stack, bool is_root, bool is_only);
+ static void _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack);
static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff,
Size newitemsz,
***************
*** 130,136 **** top:
* move right in the tree. See Lehman and Yao for an excruciatingly
* precise description.
*/
! buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
/*
* If we're not allowing duplicates, make sure the key isn't already in
--- 134,141 ----
* move right in the tree. See Lehman and Yao for an excruciatingly
* precise description.
*/
! buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
! true, stack, BT_WRITE);
/*
* If we're not allowing duplicates, make sure the key isn't already in
***************
*** 183,190 **** top:
*/
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
! _bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
! _bt_insertonpg(rel, buf, stack, itup, offset, false);
}
else
{
--- 188,196 ----
*/
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
! _bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
! stack, heapRel);
! _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
***************
*** 508,517 **** _bt_findinsertloc(Relation rel,
int keysz,
ScanKey scankey,
IndexTuple newtup,
Relation heapRel)
{
Buffer buf = *bufptr;
! Page page = BufferGetPage(buf);
Size itemsz;
BTPageOpaque lpageop;
bool movedright,
--- 514,524 ----
int keysz,
ScanKey scankey,
IndexTuple newtup,
+ BTStack stack,
Relation heapRel)
{
Buffer buf = *bufptr;
! Page page;
Size itemsz;
BTPageOpaque lpageop;
bool movedright,
***************
*** 519,524 **** _bt_findinsertloc(Relation rel,
--- 526,532 ----
OffsetNumber newitemoff;
OffsetNumber firstlegaloff = *offsetptr;
+ page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
itemsz = IndexTupleDSize(*newtup);
***************
*** 570,575 **** _bt_findinsertloc(Relation rel,
--- 578,584 ----
while (PageGetFreeSpace(page) < itemsz)
{
Buffer rbuf;
+ BlockNumber rblkno;
/*
* before considering moving right, see if we can obtain enough space
***************
*** 607,624 **** _bt_findinsertloc(Relation rel,
*/
rbuf = InvalidBuffer;
for (;;)
{
- BlockNumber rblkno = lpageop->btpo_next;
-
rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
page = BufferGetPage(rbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
if (!P_IGNORE(lpageop))
break;
if (P_RIGHTMOST(lpageop))
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
}
_bt_relbuf(rel, buf);
buf = rbuf;
--- 616,648 ----
*/
rbuf = InvalidBuffer;
+ rblkno = lpageop->btpo_next;
for (;;)
{
rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
page = BufferGetPage(rbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ /*
+ * If this page was incompletely split, do that now. We do this
+ * while holding a lock on the left sibling, which is not good
+ * because finishing the split could be a fairly length operation.
+ * But this should happen very seldom.
+ */
+ if (P_NEEDS_FIXUP(lpageop))
+ {
+ _bt_fixup(rel, rbuf, stack);
+ rbuf = InvalidBuffer;
+ continue;
+ }
+
if (!P_IGNORE(lpageop))
break;
if (P_RIGHTMOST(lpageop))
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
+
+ rblkno = lpageop->btpo_next;
}
_bt_relbuf(rel, buf);
buf = rbuf;
***************
*** 664,669 **** _bt_findinsertloc(Relation rel,
--- 688,697 ----
* insertion, and the buffer must be pinned and write-locked. On return,
* we will have dropped both the pin and the lock on the buffer.
*
+ * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* The locking interactions in this code are critical. You should
* grok Lehman and Yao's paper before making any changes. In addition,
* you need to understand how we disambiguate duplicate keys in this
***************
*** 677,682 **** _bt_findinsertloc(Relation rel,
--- 705,711 ----
static void
_bt_insertonpg(Relation rel,
Buffer buf,
+ Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
***************
*** 690,695 **** _bt_insertonpg(Relation rel,
--- 719,735 ----
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ /*
+ * The caller should've finished any incomplete splits and page deletions
+ * already
+ */
+ if (P_INCOMPLETE_SPLIT(lpageop))
+ elog(ERROR, "cannot insert to incompletely-split page %u",
+ BufferGetBlockNumber(buf));
+ if (P_LEFT_HALF_DEAD(lpageop))
+ elog(ERROR, "cannot insert to page %u with half-dead left sibling",
+ BufferGetBlockNumber(buf));
+
itemsz = IndexTupleDSize(*itup);
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
***************
*** 714,720 **** _bt_insertonpg(Relation rel,
&newitemonleft);
/* split the buffer into left and right halves */
! rbuf = _bt_split(rel, buf, firstright,
newitemoff, itemsz, itup, newitemonleft);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
--- 754,760 ----
&newitemonleft);
/* split the buffer into left and right halves */
! rbuf = _bt_split(rel, buf, cbuf, firstright,
newitemoff, itemsz, itup, newitemonleft);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
***************
*** 788,798 **** _bt_insertonpg(Relation rel,
MarkBufferDirty(metabuf);
}
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_insert xlrec;
! BlockNumber xldownlink;
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
--- 828,848 ----
MarkBufferDirty(metabuf);
}
+ /* clear INCOMPLETE_SPLIT flag on child if this finishes a split */
+ if (!P_ISLEAF(lpageop))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_insert xlrec;
! BlockNumber xlleftchild;
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
***************
*** 812,823 **** _bt_insertonpg(Relation rel,
xlinfo = XLOG_BTREE_INSERT_LEAF;
else
{
! xldownlink = ItemPointerGetBlockNumber(&(itup->t_tid));
! Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
!
! nextrdata->data = (char *) &xldownlink;
nextrdata->len = sizeof(BlockNumber);
! nextrdata->buffer = InvalidBuffer;
nextrdata->next = nextrdata + 1;
nextrdata++;
--- 862,876 ----
xlinfo = XLOG_BTREE_INSERT_LEAF;
else
{
! /*
! * Include the block number of the left child, whose
! * INCOMPLETE_SPLIT flag is cleared.
! */
! xlleftchild = BufferGetBlockNumber(cbuf);
! nextrdata->data = (char *) &xlleftchild;
nextrdata->len = sizeof(BlockNumber);
! nextrdata->buffer = cbuf;
! nextrdata->buffer_std = true;
nextrdata->next = nextrdata + 1;
nextrdata++;
***************
*** 870,875 **** _bt_insertonpg(Relation rel,
--- 923,930 ----
END_CRIT_SECTION();
/* release buffers; send out relcache inval if metapage changed */
+ if (!P_ISLEAF(lpageop))
+ _bt_relbuf(rel, cbuf);
if (BufferIsValid(metabuf))
{
if (!InRecovery)
***************
*** 889,899 **** _bt_insertonpg(Relation rel,
* new right page. newitemoff etc. tell us about the new item that
* must be inserted along with the data from the old page.
*
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
! _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
bool newitemonleft)
{
--- 944,958 ----
* new right page. newitemoff etc. tell us about the new item that
* must be inserted along with the data from the old page.
*
+ * When splitting a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
! _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
bool newitemonleft)
{
***************
*** 961,966 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
--- 1020,1027 ----
lopaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
ropaque->btpo_flags = lopaque->btpo_flags;
+ /* set flag in left page indicating that the right page has no downlink */
+ lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev;
lopaque->btpo_next = rightpagenumber;
ropaque->btpo_prev = origpagenumber;
***************
*** 1184,1189 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
--- 1245,1262 ----
MarkBufferDirty(sbuf);
}
+ /*
+ * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
+ * a split.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
***************
*** 1206,1239 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata = &rdata[0];
- if (ropaque->btpo.level > 0)
- {
- /* Log downlink on non-leaf pages */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
- lastrdata->len = sizeof(BlockIdData);
- lastrdata->buffer = InvalidBuffer;
-
- /*
- * We must also log the left page's high key, because the right
- * page's leftmost key is suppressed on non-leaf levels. Show it
- * as belonging to the left page buffer, so that it is not stored
- * if XLogInsert decides it needs a full-page image of the left
- * page.
- */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- itemid = PageGetItemId(origpage, P_HIKEY);
- item = (IndexTuple) PageGetItem(origpage, itemid);
- lastrdata->data = (char *) item;
- lastrdata->len = MAXALIGN(IndexTupleSize(item));
- lastrdata->buffer = buf; /* backup block 1 */
- lastrdata->buffer_std = true;
- }
-
/*
* Log the new item and its offset, if it was inserted on the left
* page. (If it was put on the right page, we don't need to explicitly
--- 1279,1284 ----
***************
*** 1260,1276 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->buffer = buf; /* backup block 1 */
lastrdata->buffer_std = true;
}
! else if (ropaque->btpo.level == 0)
{
/*
! * Although we don't need to WAL-log the new item, we still need
! * XLogInsert to consider storing a full-page image of the left
! * page, so make an empty entry referencing that buffer. This also
! * ensures that the left page is always backup block 1.
*/
lastrdata->next = lastrdata + 1;
lastrdata++;
lastrdata->data = NULL;
lastrdata->len = 0;
lastrdata->buffer = buf; /* backup block 1 */
--- 1305,1344 ----
lastrdata->buffer = buf; /* backup block 1 */
lastrdata->buffer_std = true;
}
!
! /* Log left page */
! if (ropaque->btpo.level > 0)
{
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
/*
! * We must also log the left page's high key, because the right
! * page's leftmost key is suppressed on non-leaf levels. Show it
! * as belonging to the left page buffer, so that it is not stored
! * if XLogInsert decides it needs a full-page image of the left
! * page.
*/
+ itemid = PageGetItemId(origpage, P_HIKEY);
+ item = (IndexTuple) PageGetItem(origpage, itemid);
+ lastrdata->data = (char *) item;
+ lastrdata->len = MAXALIGN(IndexTupleSize(item));
+ lastrdata->buffer = buf; /* backup block 1 */
+ lastrdata->buffer_std = true;
+ }
+
+ if (ropaque->btpo.level == 0 && !newitemonleft)
+ {
lastrdata->next = lastrdata + 1;
lastrdata++;
+ /*
+ * Although we don't need to WAL-log anything on the left page,
+ * the new item, we still need XLogInsert to consider storing a
+ * full-page image of the left page, so make an empty entry
+ * referencing that buffer. This also ensures that the left page
+ * is always backup block 1.
+ */
lastrdata->data = NULL;
lastrdata->len = 0;
lastrdata->buffer = buf; /* backup block 1 */
***************
*** 1278,1283 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
--- 1346,1367 ----
}
/*
+ * Log block number of left child, whose INCOMPLETE_SPLIT flag this
+ * insertion clears.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ BlockNumber cblkno = BufferGetBlockNumber(cbuf);
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
+ lastrdata->data = (char *) &cblkno;
+ lastrdata->len = sizeof(BlockNumber);
+ lastrdata->buffer = cbuf; /* backup block 2 */
+ lastrdata->buffer_std = true;
+ }
+
+ /*
* Log the contents of the right page in the format understood by
* _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
* because we're going to recreate the whole page anyway, so it should
***************
*** 1306,1312 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->data = NULL;
lastrdata->len = 0;
! lastrdata->buffer = sbuf; /* backup block 2 */
lastrdata->buffer_std = true;
}
--- 1390,1396 ----
lastrdata->data = NULL;
lastrdata->len = 0;
! lastrdata->buffer = sbuf; /* bkp block 2 (leaf) or 3 (non-leaf) */
lastrdata->buffer_std = true;
}
***************
*** 1333,1338 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
--- 1417,1426 ----
if (!P_RIGHTMOST(ropaque))
_bt_relbuf(rel, sbuf);
+ /* release the child */
+ if (ropaque->btpo.level > 0)
+ _bt_relbuf(rel, cbuf);
+
/* split's done */
return rbuf;
}
***************
*** 1603,1612 **** _bt_checksplitloc(FindSplitData *state,
* have to be efficient (concurrent ROOT split, WAL recovery)
* is_root - we split the true root
* is_only - we split a page alone on its level (might have been fast root)
- *
- * This is exported so it can be called by nbtxlog.c.
*/
! void
_bt_insert_parent(Relation rel,
Buffer buf,
Buffer rbuf,
--- 1691,1698 ----
* have to be efficient (concurrent ROOT split, WAL recovery)
* is_root - we split the true root
* is_only - we split a page alone on its level (might have been fast root)
*/
! static void
_bt_insert_parent(Relation rel,
Buffer buf,
Buffer rbuf,
***************
*** 1685,1696 **** _bt_insert_parent(Relation rel,
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
! /* Now we can unlock the children */
_bt_relbuf(rel, rbuf);
- _bt_relbuf(rel, buf);
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
--- 1771,1783 ----
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
! /*
! * Now we can unlock the right child. The left child will be unlocked
! * by _bt_insertonpg().
! */
_bt_relbuf(rel, rbuf);
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
***************
*** 1698,1704 **** _bt_insert_parent(Relation rel,
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
! _bt_insertonpg(rel, pbuf, stack->bts_parent,
new_item, stack->bts_offset + 1,
is_only);
--- 1785,1791 ----
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
! _bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
is_only);
***************
*** 1708,1713 **** _bt_insert_parent(Relation rel,
--- 1795,1895 ----
}
/*
+ * _bt_fixup() -- Finish incomplete actions on a page.
+ *
+ * A crash or other failure can leave a split or the deletion of a half-dead
+ * page incomplete. The insertion routines won't allow to insert on a page
+ * that is incompletely split, or whose left sibling is half-dead (in which
+ * case the key of the page's downlink is too high, and needs to be replaced
+ * with the half-dead left siblings downlink key). Before inserting on such
+ * a page, call _bt_fixup() to finish the incomplete action.
+ */
+ void
+ _bt_fixup(Relation rel, Buffer buf, BTStack stack)
+ {
+ Page page = BufferGetPage(buf);
+ BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ if (P_ISHALFDEAD(opaque))
+ {
+ elog(LOG, "finishing deletion of half-dead page %u",
+ BufferGetBlockNumber(buf));
+ (void) _bt_pagedel(rel, buf, stack);
+ }
+ else if (P_LEFT_HALF_DEAD(opaque))
+ {
+ buf = _bt_walk_left(rel, buf);
+
+ elog(LOG, "finishing deletion of half-dead page %u",
+ BufferGetBlockNumber(buf));
+ (void) _bt_pagedel(rel, buf, NULL);
+ }
+ else if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ /* upgrade the lock */
+ buf = _bt_relandgetbuf(rel, buf, BufferGetBlockNumber(buf), BT_WRITE);
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (!P_INCOMPLETE_SPLIT(opaque))
+ _bt_relbuf(rel, buf);
+ else
+ _bt_finish_split(rel, buf, stack);
+ }
+ else
+ _bt_relbuf(rel, buf);
+ }
+
+ /*
+ * _bt_finish_split() -- Finish an incomplete split
+ *
+ * On entry, we hold write-mode lock on it, and the lock is released on
+ * exit.
+ */
+ static void
+ _bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+ {
+ Page lpage = BufferGetPage(lbuf);
+ BTPageOpaque lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+ Buffer rbuf;
+ Page rpage;
+ BTPageOpaque rpageop;
+ bool was_root;
+ bool was_only;
+
+ /* Lock right sibling, the one missing the downlink */
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+ rpage = BufferGetPage(rbuf);
+ rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+ /* Could this be a root split? */
+ if (!stack)
+ {
+ Buffer metabuf;
+ Page metapg;
+ BTMetaPageData *metad;
+
+ /* acquire lock on the metapage */
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
+
+ was_root = (metad->btm_root == BufferGetBlockNumber(lbuf));
+
+ _bt_relbuf(rel, metabuf);
+ }
+ else
+ was_root = false;
+
+ /* Was this the only page on the level before split? */
+ was_only = (P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop));
+
+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
+
+ _bt_insert_parent(rel, lbuf, rbuf, stack, was_root, was_only);
+ }
+
+ /*
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
* we last looked at in the parent.
*
***************
*** 1739,1744 **** _bt_getstackbuf(Relation rel, BTStack stack, int access)
--- 1921,1932 ----
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (P_NEEDS_FIXUP(opaque))
+ {
+ _bt_fixup(rel, buf, stack);
+ continue;
+ }
+
if (!P_IGNORE(opaque))
{
OffsetNumber offnum,
***************
*** 1843,1848 **** _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
--- 2031,2037 ----
rbkno;
BlockNumber rootblknum;
BTPageOpaque rootopaque;
+ BTPageOpaque lopaque;
ItemId itemid;
IndexTuple item;
Size itemsz;
***************
*** 1854,1859 **** _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
--- 2043,2049 ----
lbkno = BufferGetBlockNumber(lbuf);
rbkno = BufferGetBlockNumber(rbuf);
lpage = BufferGetPage(lbuf);
+ lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
/* get a new root page */
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
***************
*** 1927,1932 **** _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
--- 2117,2127 ----
BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
pfree(new_item);
+ /* Clear the incomplete-split flag in the left child */
+ Assert(P_INCOMPLETE_SPLIT(lopaque));
+ lopaque->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(lbuf);
+
MarkBufferDirty(rootbuf);
MarkBufferDirty(metabuf);
***************
*** 1935,1941 **** _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
{
xl_btree_newroot xlrec;
XLogRecPtr recptr;
! XLogRecData rdata[2];
xlrec.node = rel->rd_node;
xlrec.rootblk = rootblknum;
--- 2130,2136 ----
{
xl_btree_newroot xlrec;
XLogRecPtr recptr;
! XLogRecData rdata[3];
xlrec.node = rel->rd_node;
xlrec.rootblk = rootblknum;
***************
*** 1954,1960 **** _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rdata[1].len = ((PageHeader) rootpage)->pd_special -
((PageHeader) rootpage)->pd_upper;
rdata[1].buffer = InvalidBuffer;
! rdata[1].next = NULL;
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
--- 2149,2161 ----
rdata[1].len = ((PageHeader) rootpage)->pd_special -
((PageHeader) rootpage)->pd_upper;
rdata[1].buffer = InvalidBuffer;
! rdata[1].next = &(rdata[2]);
!
! /* Make a full-page image of the left child if needed */
! rdata[2].data = NULL;
! rdata[2].len = 0;
! rdata[2].buffer = lbuf;
! rdata[2].next = NULL;
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 980,993 **** _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
Page page;
BTPageOpaque opaque;
- /*
- * In recovery mode, assume the deletion being replayed is valid. We
- * can't always check it because we won't have a full search stack, and we
- * should complain if there's a problem, anyway.
- */
- if (InRecovery)
- return true;
-
/* Locate the parent's downlink (updating the stack entry if needed) */
ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_READ);
--- 980,985 ----
***************
*** 1081,1089 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
ScanKey itup_scankey;
Buffer lbuf,
rbuf,
! pbuf;
bool parent_half_dead;
bool parent_one_child;
bool rightsib_empty;
Buffer metabuf = InvalidBuffer;
Page metapg = NULL;
--- 1073,1083 ----
ScanKey itup_scankey;
Buffer lbuf,
rbuf,
! pbuf,
! prbuf;
bool parent_half_dead;
bool parent_one_child;
+ bool leftsib_half_dead;
bool rightsib_empty;
Buffer metabuf = InvalidBuffer;
Page metapg = NULL;
***************
*** 1133,1182 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
*/
if (stack == NULL)
{
! if (!InRecovery)
! {
! /* we need an insertion scan key to do our search, so build one */
! itup_scankey = _bt_mkscankey(rel, targetkey);
! /* find the leftmost leaf page containing this key */
! stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
! &lbuf, BT_READ);
! /* don't need a pin on that either */
! _bt_relbuf(rel, lbuf);
! /*
! * If we are trying to delete an interior page, _bt_search did
! * more than we needed. Locate the stack item pointing to our
! * parent level.
! */
! ilevel = 0;
! for (;;)
! {
! if (stack == NULL)
! elog(ERROR, "not enough stack items");
! if (ilevel == targetlevel)
! break;
! stack = stack->bts_parent;
! ilevel++;
! }
! }
! else
{
! /*
! * During WAL recovery, we can't use _bt_search (for one reason,
! * it might invoke user-defined comparison functions that expect
! * facilities not available in recovery mode). Instead, just set
! * up a dummy stack pointing to the left end of the parent tree
! * level, from which _bt_getstackbuf will walk right to the parent
! * page. Painful, but we don't care too much about performance in
! * this scenario.
! */
! pbuf = _bt_get_endpoint(rel, targetlevel + 1, false);
! stack = (BTStack) palloc(sizeof(BTStackData));
! stack->bts_blkno = BufferGetBlockNumber(pbuf);
! stack->bts_offset = InvalidOffsetNumber;
! /* bts_btentry will be initialized below */
! stack->bts_parent = NULL;
! _bt_relbuf(rel, pbuf);
}
}
--- 1127,1155 ----
*/
if (stack == NULL)
{
! /* we need an insertion scan key to do our search, so build one */
! itup_scankey = _bt_mkscankey(rel, targetkey);
! /* find the leftmost leaf page containing this key */
! stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
! &lbuf, BT_READ);
! /* don't need a pin on that either */
! _bt_relbuf(rel, lbuf);
! lbuf = InvalidBuffer;
! /*
! * If we are trying to delete an interior page, _bt_search did
! * more than we needed. Locate the stack item pointing to our
! * parent level.
! */
! ilevel = 0;
! for (;;)
{
! if (stack == NULL)
! elog(ERROR, "not enough stack items");
! if (ilevel == targetlevel)
! break;
! stack = stack->bts_parent;
! ilevel++;
}
}
***************
*** 1199,1207 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
* target page. The sibling that was current a moment ago could have
* split, so we may have to move right. This search could fail if either
* the sibling or the target page was deleted by someone else meanwhile;
! * if so, give up. (Right now, that should never happen, since page
! * deletion is only done in VACUUM and there shouldn't be multiple VACUUMs
! * concurrently on the same table.)
*/
if (leftsib != P_NONE)
{
--- 1172,1180 ----
* target page. The sibling that was current a moment ago could have
* split, so we may have to move right. This search could fail if either
* the sibling or the target page was deleted by someone else meanwhile;
! * if so, give up. Although page deletion is only initiated by VACUUM,
! * other backends can delete half-dead pages they encounter during
! * insertions.
*/
if (leftsib != P_NONE)
{
***************
*** 1223,1228 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1196,1203 ----
page = BufferGetPage(lbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
+
+ leftsib_half_dead = P_ISHALFDEAD(opaque);
}
else
lbuf = InvalidBuffer;
***************
*** 1311,1316 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1286,1300 ----
}
/*
+ * Also lock the parent's right sibling, if we need to set its
+ * LEFT_IS_HALF_DEAD flag.
+ */
+ if (parent_half_dead)
+ prbuf = _bt_getbuf(rel, opaque->btpo_next, BT_WRITE);
+ else
+ prbuf = InvalidBuffer;
+
+ /*
* If we are deleting the next-to-last page on the target's level, then
* the rightsib is a candidate to become the new fast root. (In theory, it
* might be possible to push the fast root even further down, but the odds
***************
*** 1397,1402 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1381,1390 ----
{
PageIndexTupleDelete(page, poffset);
opaque->btpo_flags |= BTP_HALF_DEAD;
+
+ page = BufferGetPage(prbuf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ opaque->btpo_flags |= BTP_LEFT_HALF_DEAD;
}
else
{
***************
*** 1426,1431 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1414,1424 ----
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
Assert(opaque->btpo_prev == target);
opaque->btpo_prev = leftsib;
+ /* Update the right page's left-is-half-dead flag */
+ if (leftsib_half_dead)
+ opaque->btpo_flags |= BTP_LEFT_HALF_DEAD;
+ else
+ opaque->btpo_flags &= ~BTP_LEFT_HALF_DEAD;
rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
/*
***************
*** 1458,1463 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1451,1458 ----
MarkBufferDirty(buf);
if (BufferIsValid(lbuf))
MarkBufferDirty(lbuf);
+ if (BufferIsValid(prbuf))
+ MarkBufferDirty(prbuf);
/* XLOG stuff */
if (RelationNeedsWAL(rel))
***************
*** 1466,1472 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
! XLogRecData rdata[5];
XLogRecData *nextrdata;
xlrec.target.node = rel->rd_node;
--- 1461,1467 ----
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
! XLogRecData rdata[6];
XLogRecData *nextrdata;
xlrec.target.node = rel->rd_node;
***************
*** 1474,1479 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1469,1481 ----
xlrec.deadblk = target;
xlrec.leftblk = leftsib;
xlrec.rightblk = rightsib;
+ xlrec.flags = 0;
+ if (leftsib_half_dead)
+ xlrec.flags |= DP_LEFT_IS_HALF_DEAD;
+ if (BufferIsValid(prbuf))
+ xlrec.parentright = BufferGetBlockNumber(prbuf);
+ else
+ xlrec.parentright = InvalidBlockNumber;
xlrec.btpo_xact = opaque->btpo.xact;
rdata[0].data = (char *) &xlrec;
***************
*** 1502,1517 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
nextrdata->data = NULL;
nextrdata->len = 0;
- nextrdata->next = nextrdata + 1;
nextrdata->buffer = pbuf;
nextrdata->buffer_std = true;
nextrdata++;
nextrdata->data = NULL;
nextrdata->len = 0;
nextrdata->buffer = rbuf;
nextrdata->buffer_std = true;
- nextrdata->next = NULL;
if (BufferIsValid(lbuf))
{
--- 1504,1518 ----
nextrdata->data = NULL;
nextrdata->len = 0;
nextrdata->buffer = pbuf;
nextrdata->buffer_std = true;
+ nextrdata->next = nextrdata + 1;
nextrdata++;
nextrdata->data = NULL;
nextrdata->len = 0;
nextrdata->buffer = rbuf;
nextrdata->buffer_std = true;
if (BufferIsValid(lbuf))
{
***************
*** 1521,1529 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
nextrdata->len = 0;
nextrdata->buffer = lbuf;
nextrdata->buffer_std = true;
- nextrdata->next = NULL;
}
recptr = XLogInsert(RM_BTREE_ID, xlinfo, rdata);
if (BufferIsValid(metabuf))
--- 1522,1540 ----
nextrdata->len = 0;
nextrdata->buffer = lbuf;
nextrdata->buffer_std = true;
}
+ if (BufferIsValid(prbuf))
+ {
+ nextrdata->next = nextrdata + 1;
+ nextrdata++;
+ nextrdata->data = NULL;
+ nextrdata->len = 0;
+ nextrdata->buffer = prbuf;
+ nextrdata->buffer_std = true;
+ }
+ nextrdata->next = NULL;
+
recptr = XLogInsert(RM_BTREE_ID, xlinfo, rdata);
if (BufferIsValid(metabuf))
***************
*** 1541,1546 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1552,1562 ----
page = BufferGetPage(lbuf);
PageSetLSN(page, recptr);
}
+ if (BufferIsValid(prbuf))
+ {
+ page = BufferGetPage(prbuf);
+ PageSetLSN(page, recptr);
+ }
}
END_CRIT_SECTION();
***************
*** 1551,1593 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
CacheInvalidateRelcache(rel);
_bt_relbuf(rel, metabuf);
}
! /* can always release leftsib immediately */
if (BufferIsValid(lbuf))
_bt_relbuf(rel, lbuf);
/*
* If parent became half dead, recurse to delete it. Otherwise, if right
* sibling is empty and is now the last child of the parent, recurse to
* try to delete it. (These cases cannot apply at the same time, though
* the second case might itself recurse to the first.)
- *
- * When recursing to parent, we hold the lock on the target page until
- * done. This delays any insertions into the keyspace that was just
- * effectively reassigned to the parent's right sibling. If we allowed
- * that, and there were enough such insertions before we finish deleting
- * the parent, page splits within that keyspace could lead to inserting
- * out-of-order keys into the grandparent level. It is thought that that
- * wouldn't have any serious consequences, but it still seems like a
- * pretty bad idea.
*/
if (parent_half_dead)
{
/* recursive call will release pbuf */
_bt_relbuf(rel, rbuf);
result = _bt_pagedel(rel, pbuf, stack->bts_parent) + 1;
- _bt_relbuf(rel, buf);
}
else if (parent_one_child && rightsib_empty)
{
_bt_relbuf(rel, pbuf);
- _bt_relbuf(rel, buf);
/* recursive call will release rbuf */
result = _bt_pagedel(rel, rbuf, stack) + 1;
}
else
{
_bt_relbuf(rel, pbuf);
- _bt_relbuf(rel, buf);
_bt_relbuf(rel, rbuf);
result = 1;
}
--- 1567,1601 ----
CacheInvalidateRelcache(rel);
_bt_relbuf(rel, metabuf);
}
! /* can always release buffer and its left sibling immediately */
! _bt_relbuf(rel, buf);
if (BufferIsValid(lbuf))
_bt_relbuf(rel, lbuf);
+ if (BufferIsValid(prbuf))
+ _bt_relbuf(rel, prbuf);
/*
* If parent became half dead, recurse to delete it. Otherwise, if right
* sibling is empty and is now the last child of the parent, recurse to
* try to delete it. (These cases cannot apply at the same time, though
* the second case might itself recurse to the first.)
*/
if (parent_half_dead)
{
/* recursive call will release pbuf */
_bt_relbuf(rel, rbuf);
+
result = _bt_pagedel(rel, pbuf, stack->bts_parent) + 1;
}
else if (parent_one_child && rightsib_empty)
{
_bt_relbuf(rel, pbuf);
/* recursive call will release rbuf */
result = _bt_pagedel(rel, rbuf, stack) + 1;
}
else
{
_bt_relbuf(rel, pbuf);
_bt_relbuf(rel, rbuf);
result = 1;
}
*** a/src/backend/access/nbtree/nbtsearch.c
--- b/src/backend/access/nbtree/nbtsearch.c
***************
*** 29,35 **** static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
- static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
--- 29,34 ----
***************
*** 51,57 **** static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
! * will result in *bufP being set to InvalidBuffer.
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
--- 50,58 ----
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
! * will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
! * any incomplete splits or half-dead pages encountered during the search
! * will be finished.
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
***************
*** 82,89 **** _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage). If it has, we may need to
* move right to its new sibling. Do that.
*/
! *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey, BT_READ);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
--- 83,95 ----
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage). If it has, we may need to
* move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way.
*/
! *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
! (access == BT_WRITE), stack_in,
! BT_READ);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
***************
*** 148,153 **** _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
--- 154,164 ----
* item >= scankey. When nextkey is true, we are looking for the first
* item strictly greater than scankey.
*
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * or half-dead pages that we encounter. This is required when searching for
+ * a target page for an insertion, because we don't allow inserting on a
+ * page with incomplete actions. 'stack' is only used if forupdate is true.
+ *
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
* the same on the right sibling. Return value is the buffer we stop at.
***************
*** 158,163 **** _bt_moveright(Relation rel,
--- 169,176 ----
int keysz,
ScanKey scankey,
bool nextkey,
+ bool forupdate,
+ BTStack stack,
int access)
{
Page page;
***************
*** 186,199 **** _bt_moveright(Relation rel,
while (!P_RIGHTMOST(opaque) &&
(P_IGNORE(opaque) ||
_bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
{
! /* step right one page */
! BlockNumber rblkno = opaque->btpo_next;
! buf = _bt_relandgetbuf(rel, buf, rblkno, access);
! page = BufferGetPage(buf);
! opaque = (BTPageOpaque) PageGetSpecialPointer(page);
}
if (P_IGNORE(opaque))
--- 199,226 ----
while (!P_RIGHTMOST(opaque) &&
(P_IGNORE(opaque) ||
+ (forupdate && P_NEEDS_FIXUP(opaque)) ||
_bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
{
! /*
! * Finish any incomplete splits and remove any half-dead pages we
! * encounter along the way.
! */
! if (forupdate && P_NEEDS_FIXUP(opaque))
! {
! BlockNumber blkno = BufferGetBlockNumber(buf);
! _bt_fixup(rel, buf, stack);
! _bt_getbuf(rel, blkno, access);
! }
! else
! {
! /* step right one page */
! BlockNumber rblkno = opaque->btpo_next;
! buf = _bt_relandgetbuf(rel, buf, rblkno, access);
! page = BufferGetPage(buf);
! opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! }
}
if (P_IGNORE(opaque))
***************
*** 1314,1320 **** _bt_steppage(IndexScanDesc scan, ScanDirection dir)
* to be half-dead; the caller should check that condition and step left
* again if it's important.
*/
! static Buffer
_bt_walk_left(Relation rel, Buffer buf)
{
Page page;
--- 1341,1347 ----
* to be half-dead; the caller should check that condition and step left
* again if it's important.
*/
! Buffer
_bt_walk_left(Relation rel, Buffer buf)
{
Page page;
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
***************
*** 21,122 ****
#include "miscadmin.h"
/*
- * We must keep track of expected insertions due to page splits, and apply
- * them manually if they are not seen in the WAL log during replay. This
- * makes it safe for page insertion to be a multiple-WAL-action process.
- *
- * Similarly, deletion of an only child page and deletion of its parent page
- * form multiple WAL log entries, and we have to be prepared to follow through
- * with the deletion if the log ends between.
- *
- * The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split or multi deletion to remain incomplete
- * for long. In any case we need to respect the order of operations.
- */
- typedef struct bt_incomplete_action
- {
- RelFileNode node; /* the index */
- bool is_split; /* T = pending split, F = pending delete */
- /* these fields are for a split: */
- bool is_root; /* we split the root */
- BlockNumber leftblk; /* left half of split */
- BlockNumber rightblk; /* right half of split */
- /* these fields are for a delete: */
- BlockNumber delblk; /* parent block to be deleted */
- } bt_incomplete_action;
-
- static List *incomplete_actions;
-
-
- static void
- log_incomplete_split(RelFileNode node, BlockNumber leftblk,
- BlockNumber rightblk, bool is_root)
- {
- bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
-
- action->node = node;
- action->is_split = true;
- action->is_root = is_root;
- action->leftblk = leftblk;
- action->rightblk = rightblk;
- incomplete_actions = lappend(incomplete_actions, action);
- }
-
- static void
- forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
- {
- ListCell *l;
-
- foreach(l, incomplete_actions)
- {
- bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
- if (RelFileNodeEquals(node, action->node) &&
- action->is_split &&
- downlink == action->rightblk)
- {
- if (is_root != action->is_root)
- elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
- action->is_root, is_root);
- incomplete_actions = list_delete_ptr(incomplete_actions, action);
- pfree(action);
- break; /* need not look further */
- }
- }
- }
-
- static void
- log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
- {
- bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
-
- action->node = node;
- action->is_split = false;
- action->delblk = delblk;
- incomplete_actions = lappend(incomplete_actions, action);
- }
-
- static void
- forget_matching_deletion(RelFileNode node, BlockNumber delblk)
- {
- ListCell *l;
-
- foreach(l, incomplete_actions)
- {
- bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
- if (RelFileNodeEquals(node, action->node) &&
- !action->is_split &&
- delblk == action->delblk)
- {
- incomplete_actions = list_delete_ptr(incomplete_actions, action);
- pfree(action);
- break; /* need not look further */
- }
- }
- }
-
- /*
* _bt_restore_page -- re-enter all the index tuples on a page
*
* The page is freshly init'd, and *from (length len) is a copy of what
--- 21,26 ----
***************
*** 190,212 **** _bt_restore_meta(RelFileNode rnode, XLogRecPtr lsn,
UnlockReleaseBuffer(metabuf);
}
static void
btree_xlog_insert(bool isleaf, bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
Buffer buffer;
Page page;
char *datapos;
int datalen;
xl_btree_metadata md;
! BlockNumber downlink = 0;
datapos = (char *) xlrec + SizeOfBtreeInsert;
datalen = record->xl_len - SizeOfBtreeInsert;
! if (!isleaf)
{
! memcpy(&downlink, datapos, sizeof(BlockNumber));
datapos += sizeof(BlockNumber);
datalen -= sizeof(BlockNumber);
}
--- 94,153 ----
UnlockReleaseBuffer(metabuf);
}
+ /*
+ * _bt_clear_incomplete_split -- clear INCOMPLETE_SPLIT flag on a page
+ *
+ * This is a common subroutine of the redo functions of all the WAL record
+ * types that can insert a downlink: insert, split, and newroot.
+ */
+ static void
+ _bt_clear_incomplete_split(XLogRecPtr lsn, XLogRecord *record,
+ RelFileNode rnode, BlockNumber cblock)
+ {
+ Buffer buf;
+
+ buf = XLogReadBuffer(rnode, cblock, false);
+ if (BufferIsValid(buf))
+ {
+ Page page = (Page) BufferGetPage(buf);
+
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+ pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+ UnlockReleaseBuffer(buf);
+ }
+ }
+
static void
btree_xlog_insert(bool isleaf, bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
Buffer buffer;
+ Buffer cbuffer = InvalidBuffer;
Page page;
char *datapos;
int datalen;
xl_btree_metadata md;
! BlockNumber cblkno = 0;
! int main_blk_index;
datapos = (char *) xlrec + SizeOfBtreeInsert;
datalen = record->xl_len - SizeOfBtreeInsert;
! /*
! * if this insert finishes a split at lower level, extract the block
! * number of the (left) child.
! */
! if (!isleaf && (record->xl_info & XLR_BKP_BLOCK(0)) == 0)
{
! memcpy(&cblkno, datapos, sizeof(BlockNumber));
! Assert(cblkno != 0);
datapos += sizeof(BlockNumber);
datalen -= sizeof(BlockNumber);
}
***************
*** 217,224 **** btree_xlog_insert(bool isleaf, bool ismeta,
datalen -= sizeof(xl_btree_metadata);
}
! if (record->xl_info & XLR_BKP_BLOCK(0))
! (void) RestoreBackupBlock(lsn, record, 0, false, false);
else
{
buffer = XLogReadBuffer(xlrec->target.node,
--- 158,176 ----
datalen -= sizeof(xl_btree_metadata);
}
! if (!isleaf)
! {
! if (record->xl_info & XLR_BKP_BLOCK(0))
! (void) RestoreBackupBlock(lsn, record, 0, false, false);
! else
! _bt_clear_incomplete_split(lsn, record, xlrec->target.node, cblkno);
! main_blk_index = 1;
! }
! else
! main_blk_index = 0;
!
! if (record->xl_info & XLR_BKP_BLOCK(main_blk_index))
! (void) RestoreBackupBlock(lsn, record, main_blk_index, false, false);
else
{
buffer = XLogReadBuffer(xlrec->target.node,
***************
*** 228,238 **** btree_xlog_insert(bool isleaf, bool ismeta,
{
page = (Page) BufferGetPage(buffer);
! if (lsn <= PageGetLSN(page))
! {
! UnlockReleaseBuffer(buffer);
! }
! else
{
if (PageAddItem(page, (Item) datapos, datalen,
ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
--- 180,186 ----
{
page = (Page) BufferGetPage(buffer);
! if (lsn > PageGetLSN(page))
{
if (PageAddItem(page, (Item) datapos, datalen,
ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
***************
*** 241,251 **** btree_xlog_insert(bool isleaf, bool ismeta,
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
}
}
}
/*
* Note: in normal operation, we'd update the metapage while still holding
* lock on the page we inserted into. But during replay it's not
--- 189,202 ----
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
}
+ UnlockReleaseBuffer(buffer);
}
}
+ if (BufferIsValid(cbuffer))
+ UnlockReleaseBuffer(cbuffer);
+
/*
* Note: in normal operation, we'd update the metapage while still holding
* lock on the page we inserted into. But during replay it's not
***************
*** 257,266 **** btree_xlog_insert(bool isleaf, bool ismeta,
_bt_restore_meta(xlrec->target.node, lsn,
md.root, md.level,
md.fastroot, md.fastlevel);
-
- /* Forget any split this insertion completes */
- if (!isleaf)
- forget_matching_split(xlrec->target.node, downlink, false);
}
static void
--- 208,213 ----
***************
*** 268,273 **** btree_xlog_split(bool onleft, bool isroot,
--- 215,222 ----
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
+ bool isleaf = (xlrec->level == 0);
+ Buffer lbuf;
Buffer rbuf;
Page rpage;
BTPageOpaque ropaque;
***************
*** 278,319 **** btree_xlog_split(bool onleft, bool isroot,
Size newitemsz = 0;
Item left_hikey = NULL;
Size left_hikeysz = 0;
datapos = (char *) xlrec + SizeOfBtreeSplit;
datalen = record->xl_len - SizeOfBtreeSplit;
! /* Forget any split this insertion completes */
! if (xlrec->level > 0)
! {
! /* we assume SizeOfBtreeSplit is at least 16-bit aligned */
! BlockNumber downlink = BlockIdGetBlockNumber((BlockId) datapos);
!
! datapos += sizeof(BlockIdData);
! datalen -= sizeof(BlockIdData);
!
! forget_matching_split(xlrec->node, downlink, false);
!
! /* Extract left hikey and its size (still assuming 16-bit alignment) */
! if (!(record->xl_info & XLR_BKP_BLOCK(0)))
! {
! /* We assume 16-bit alignment is enough for IndexTupleSize */
! left_hikey = (Item) datapos;
! left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
!
! datapos += left_hikeysz;
! datalen -= left_hikeysz;
! }
! }
!
! /* Extract newitem and newitemoff, if present */
if (onleft)
{
- /* Extract the offset (still assuming 16-bit alignment) */
memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
datapos += sizeof(OffsetNumber);
datalen -= sizeof(OffsetNumber);
}
-
if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
{
/*
--- 227,244 ----
Size newitemsz = 0;
Item left_hikey = NULL;
Size left_hikeysz = 0;
+ BlockNumber cblkno = InvalidBlockNumber;
datapos = (char *) xlrec + SizeOfBtreeSplit;
datalen = record->xl_len - SizeOfBtreeSplit;
! /* Extract newitemoff and newitem, if present */
if (onleft)
{
memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
datapos += sizeof(OffsetNumber);
datalen -= sizeof(OffsetNumber);
}
if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
{
/*
***************
*** 327,332 **** btree_xlog_split(bool onleft, bool isroot,
--- 252,288 ----
datalen -= newitemsz;
}
+ /* Extract left hikey and its size (still assuming 16-bit alignment) */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(0)))
+ {
+ left_hikey = (Item) datapos;
+ left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+ datapos += left_hikeysz;
+ datalen -= left_hikeysz;
+ }
+ /*
+ * If this insertion finishes an incomplete split, get the block number
+ * of the child.
+ */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(1)))
+ {
+ memcpy(&cblkno, datapos, sizeof(BlockNumber));
+ datapos += sizeof(BlockNumber);
+ datalen -= sizeof(BlockNumber);
+ }
+
+ /*
+ * Clear the incomplete split flag on the left sibling of the child page
+ * this is a downlink for.
+ */
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ (void) RestoreBackupBlock(lsn, record, 2, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
+ }
+
/* Reconstruct right (new) sibling page from scratch */
rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
Assert(BufferIsValid(rbuf));
***************
*** 338,344 **** btree_xlog_split(bool onleft, bool isroot,
ropaque->btpo_prev = xlrec->leftsib;
ropaque->btpo_next = xlrec->rnext;
ropaque->btpo.level = xlrec->level;
! ropaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0;
_bt_restore_page(rpage, datapos, datalen);
--- 294,300 ----
ropaque->btpo_prev = xlrec->leftsib;
ropaque->btpo_next = xlrec->rnext;
ropaque->btpo.level = xlrec->level;
! ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0;
_bt_restore_page(rpage, datapos, datalen);
***************
*** 347,353 **** btree_xlog_split(bool onleft, bool isroot,
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
! if (xlrec->level == 0)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
--- 303,309 ----
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
! if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
***************
*** 362,371 **** btree_xlog_split(bool onleft, bool isroot,
/* Now reconstruct left (original) sibling page */
if (record->xl_info & XLR_BKP_BLOCK(0))
! (void) RestoreBackupBlock(lsn, record, 0, false, false);
else
{
! Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
if (BufferIsValid(lbuf))
{
--- 318,327 ----
/* Now reconstruct left (original) sibling page */
if (record->xl_info & XLR_BKP_BLOCK(0))
! lbuf = RestoreBackupBlock(lsn, record, 0, false, true);
else
{
! lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
if (BufferIsValid(lbuf))
{
***************
*** 422,440 **** btree_xlog_split(bool onleft, bool isroot,
elog(PANIC, "failed to add high key to left page after split");
/* Fix opaque fields */
! lopaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
lopaque->btpo_next = xlrec->rightsib;
lopaque->btpo_cycleid = 0;
PageSetLSN(lpage, lsn);
MarkBufferDirty(lbuf);
}
-
- UnlockReleaseBuffer(lbuf);
}
}
! /* We no longer need the right buffer */
UnlockReleaseBuffer(rbuf);
/*
--- 378,398 ----
elog(PANIC, "failed to add high key to left page after split");
/* Fix opaque fields */
! lopaque->btpo_flags = BTP_INCOMPLETE_SPLIT;
! if (isleaf)
! lopaque->btpo_flags |= BTP_LEAF;
lopaque->btpo_next = xlrec->rightsib;
lopaque->btpo_cycleid = 0;
PageSetLSN(lpage, lsn);
MarkBufferDirty(lbuf);
}
}
}
! /* We no longer need the buffers */
! if (BufferIsValid(lbuf))
! UnlockReleaseBuffer(lbuf);
UnlockReleaseBuffer(rbuf);
/*
***************
*** 445,476 **** btree_xlog_split(bool onleft, bool isroot,
* replay, because no other index update can be in progress, and readers
* will cope properly when following an obsolete left-link.
*/
! if (record->xl_info & XLR_BKP_BLOCK(1))
! (void) RestoreBackupBlock(lsn, record, 1, false, false);
! else if (xlrec->rnext != P_NONE)
{
! Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
! if (BufferIsValid(buffer))
{
! Page page = (Page) BufferGetPage(buffer);
! if (lsn > PageGetLSN(page))
{
! BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
! pageop->btpo_prev = xlrec->rightsib;
! PageSetLSN(page, lsn);
! MarkBufferDirty(buffer);
}
- UnlockReleaseBuffer(buffer);
}
}
-
- /* The job ain't done till the parent link is inserted... */
- log_incomplete_split(xlrec->node,
- xlrec->leftsib, xlrec->rightsib, isroot);
}
static void
--- 403,441 ----
* replay, because no other index update can be in progress, and readers
* will cope properly when following an obsolete left-link.
*/
! if (xlrec->rnext != P_NONE)
{
! /*
! * the backup block containing right sibling is 2 or 3, depending
! * whether this was a leaf or internal page.
! */
! int rnext_index = isleaf ? 2 : 3;
! if (record->xl_info & XLR_BKP_BLOCK(rnext_index))
! (void) RestoreBackupBlock(lsn, record, rnext_index, false, false);
! else
{
! Buffer buffer;
! buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
!
! if (BufferIsValid(buffer))
{
! Page page = (Page) BufferGetPage(buffer);
! if (lsn > PageGetLSN(page))
! {
! BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
! pageop->btpo_prev = xlrec->rightsib;
!
! PageSetLSN(page, lsn);
! MarkBufferDirty(buffer);
! }
! UnlockReleaseBuffer(buffer);
}
}
}
}
static void
***************
*** 850,856 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
}
}
! /* Fix left-link of right sibling */
if (record->xl_info & XLR_BKP_BLOCK(1))
(void) RestoreBackupBlock(lsn, record, 1, false, false);
else
--- 815,821 ----
}
}
! /* Fix left-link and flags of right sibling */
if (record->xl_info & XLR_BKP_BLOCK(1))
(void) RestoreBackupBlock(lsn, record, 1, false, false);
else
***************
*** 867,872 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
--- 832,841 ----
{
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
pageop->btpo_prev = leftsib;
+ if (xlrec->flags & DP_LEFT_IS_HALF_DEAD)
+ pageop->btpo_flags |= BTP_LEFT_HALF_DEAD;
+ else
+ pageop->btpo_flags &= ~BTP_LEFT_HALF_DEAD;
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
***************
*** 876,886 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
}
/* Fix right-link of left sibling, if any */
! if (record->xl_info & XLR_BKP_BLOCK(2))
! (void) RestoreBackupBlock(lsn, record, 2, false, false);
! else
{
! if (leftsib != P_NONE)
{
buffer = XLogReadBuffer(xlrec->target.node, leftsib, false);
if (BufferIsValid(buffer))
--- 845,855 ----
}
/* Fix right-link of left sibling, if any */
! if (leftsib != P_NONE)
{
! if (record->xl_info & XLR_BKP_BLOCK(2))
! (void) RestoreBackupBlock(lsn, record, 2, false, false);
! else
{
buffer = XLogReadBuffer(xlrec->target.node, leftsib, false);
if (BufferIsValid(buffer))
***************
*** 903,908 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
--- 872,910 ----
}
}
+ /*
+ * Set LEFT_HALF_DEAD flag in the right sibling of the parent page that's
+ * now half-dead.
+ */
+ if (info == XLOG_BTREE_DELETE_PAGE_HALF)
+ {
+ int blkidx = (leftsib == P_NONE) ? 2 : 3;
+ if (record->xl_info & XLR_BKP_BLOCK(blkidx))
+ (void) RestoreBackupBlock(lsn, record, blkidx, false, false);
+ else
+ {
+ buffer = XLogReadBuffer(xlrec->target.node, xlrec->parentright, false);
+ if (BufferIsValid(buffer))
+ {
+ page = (Page) BufferGetPage(buffer);
+ if (lsn <= PageGetLSN(page))
+ {
+ UnlockReleaseBuffer(buffer);
+ }
+ else
+ {
+ pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ pageop->btpo_flags |= BTP_LEFT_HALF_DEAD;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
+ }
+ }
+
+ }
+
/* Rewrite target page as empty deleted page */
buffer = XLogReadBuffer(xlrec->target.node, target, true);
Assert(BufferIsValid(buffer));
***************
*** 932,944 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
md.root, md.level,
md.fastroot, md.fastlevel);
}
-
- /* Forget any completed deletion */
- forget_matching_deletion(xlrec->target.node, target);
-
- /* If parent became half-dead, remember it for deletion */
- if (info == XLOG_BTREE_DELETE_PAGE_HALF)
- log_incomplete_deletion(xlrec->target.node, parent);
}
static void
--- 934,939 ----
***************
*** 948,957 **** btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
Buffer buffer;
Page page;
BTPageOpaque pageop;
! BlockNumber downlink = 0;
!
! /* Backup blocks are not used in newroot records */
! Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
Assert(BufferIsValid(buffer));
--- 943,949 ----
Buffer buffer;
Page page;
BTPageOpaque pageop;
! BlockNumber cblkno;
buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
Assert(BufferIsValid(buffer));
***************
*** 974,983 **** btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeNewroot,
record->xl_len - SizeOfBtreeNewroot);
! /* extract downlink to the right-hand split page */
! itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
! downlink = ItemPointerGetBlockNumber(&(itup->t_tid));
Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
}
PageSetLSN(page, lsn);
--- 966,981 ----
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeNewroot,
record->xl_len - SizeOfBtreeNewroot);
! /* extract block number of the left-hand split page */
! itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
! cblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+
+ /* Clear the incomplete-split flag in left child */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
}
PageSetLSN(page, lsn);
***************
*** 987,996 **** btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
_bt_restore_meta(xlrec->node, lsn,
xlrec->rootblk, xlrec->level,
xlrec->rootblk, xlrec->level);
-
- /* Check to see if this satisfies any incomplete insertions */
- if (record->xl_len > SizeOfBtreeNewroot)
- forget_matching_split(xlrec->node, downlink, true);
}
static void
--- 985,990 ----
***************
*** 1068,1146 **** btree_redo(XLogRecPtr lsn, XLogRecord *record)
elog(PANIC, "btree_redo: unknown op code %u", info);
}
}
-
- void
- btree_xlog_startup(void)
- {
- incomplete_actions = NIL;
- }
-
- void
- btree_xlog_cleanup(void)
- {
- ListCell *l;
-
- foreach(l, incomplete_actions)
- {
- bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
- if (action->is_split)
- {
- /* finish an incomplete split */
- Buffer lbuf,
- rbuf;
- Page lpage,
- rpage;
- BTPageOpaque lpageop,
- rpageop;
- bool is_only;
- Relation reln;
-
- lbuf = XLogReadBuffer(action->node, action->leftblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(lbuf))
- elog(PANIC, "btree_xlog_cleanup: left block unfound");
- lpage = (Page) BufferGetPage(lbuf);
- lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
- rbuf = XLogReadBuffer(action->node, action->rightblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(rbuf))
- elog(PANIC, "btree_xlog_cleanup: right block unfound");
- rpage = (Page) BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
- /* if the pages are all of their level, it's a only-page split */
- is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
- reln = CreateFakeRelcacheEntry(action->node);
- _bt_insert_parent(reln, lbuf, rbuf, NULL,
- action->is_root, is_only);
- FreeFakeRelcacheEntry(reln);
- }
- else
- {
- /* finish an incomplete deletion (of a half-dead page) */
- Buffer buf;
-
- buf = XLogReadBuffer(action->node, action->delblk, false);
- if (BufferIsValid(buf))
- {
- Relation reln;
-
- reln = CreateFakeRelcacheEntry(action->node);
- if (_bt_pagedel(reln, buf, NULL) == 0)
- elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
- FreeFakeRelcacheEntry(reln);
- }
- }
- }
- incomplete_actions = NIL;
- }
-
- bool
- btree_safe_restartpoint(void)
- {
- if (incomplete_actions)
- return false;
- return true;
- }
--- 1062,1064 ----
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 73,78 **** typedef BTPageOpaqueData *BTPageOpaque;
--- 73,80 ----
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
+ #define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
+ #define BTP_LEFT_HALF_DEAD (1 << 8) /* left sibling is half-dead */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
***************
*** 178,183 **** typedef struct BTMetaPageData
--- 180,189 ----
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
+ #define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
+ #define P_LEFT_HALF_DEAD(opaque) ((opaque)->btpo_flags & BTP_LEFT_HALF_DEAD)
+ #define P_NEEDS_FIXUP(opaque) \
+ ((opaque)->btpo_flags & (BTP_INCOMPLETE_SPLIT | BTP_LEFT_HALF_DEAD))
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
***************
*** 254,260 **** typedef struct xl_btree_metadata
typedef struct xl_btree_insert
{
xl_btreetid target; /* inserted tuple id */
! /* BlockNumber downlink field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
} xl_btree_insert;
--- 260,266 ----
typedef struct xl_btree_insert
{
xl_btreetid target; /* inserted tuple id */
! /* BlockNumber finishes_split field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
} xl_btree_insert;
***************
*** 287,305 **** typedef struct xl_btree_split
OffsetNumber firstright; /* first item moved to right page */
/*
! * If level > 0, BlockIdData downlink follows. (We use BlockIdData rather
! * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
! * aligned.)
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in the new right page. Also, it's suppressed if
* XLogInsert chooses to store the left page's whole page image.
*
! * In the _L variants, next are OffsetNumber newitemoff and the new item.
! * (In the _R variants, the new item is one of the right page's tuples.)
! * The new item, but not newitemoff, is suppressed if XLogInsert chooses
! * to store the left page's whole page image.
*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
--- 293,310 ----
OffsetNumber firstright; /* first item moved to right page */
/*
! * In the _L variants, next are OffsetNumber newitemoff and the new item.
! * (In the _R variants, the new item is one of the right page's tuples.)
! * The new item, but not newitemoff, is suppressed if XLogInsert chooses
! * to store the left page's whole page image.
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in the new right page. Also, it's suppressed if
* XLogInsert chooses to store the left page's whole page image.
*
! * If level > 0, BlockNumber of the page whose incomplete-split flag
! * this insertion clears. (not aligned)
*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
***************
*** 379,393 **** typedef struct xl_btree_vacuum
--- 384,404 ----
typedef struct xl_btree_delete_page
{
xl_btreetid target; /* deleted tuple id in parent page */
+ uint16 flags; /* see below */
BlockNumber deadblk; /* child block being deleted */
BlockNumber leftblk; /* child block's left sibling, if any */
BlockNumber rightblk; /* child block's right sibling */
+ BlockNumber parentright; /* if parent is half-dead, its right sibling */
TransactionId btpo_xact; /* value of btpo.xact for use in recovery */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */
} xl_btree_delete_page;
#define SizeOfBtreeDeletePage (offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId))
+ /* Flags for xl_btree_delete_page */
+ #define DP_LEFT_IS_HALF_DEAD 0x01 /* deleted page's left sibling is
+ * half-dead */
+
/*
* New root log record. There are zero tuples if this is to establish an
* empty root, or two if it is the result of splitting an old root.
***************
*** 617,624 **** extern Datum btoptions(PG_FUNCTION_ARGS);
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
! extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
! BTStack stack, bool is_root, bool is_only);
/*
* prototypes for functions in nbtpage.c
--- 628,634 ----
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
! extern void _bt_fixup(Relation rel, Buffer bbuf, BTStack stack);
/*
* prototypes for functions in nbtpage.c
***************
*** 648,654 **** extern BTStack _bt_search(Relation rel,
int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
! ScanKey scankey, bool nextkey, int access);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
--- 658,665 ----
int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
! ScanKey scankey, bool nextkey,
! bool finishsplits, BTStack stack, int access);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
***************
*** 656,661 **** extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
--- 667,673 ----
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+ extern Buffer _bt_walk_left(Relation rel, Buffer buf);
/*
* prototypes for functions in nbtutils.c
***************
*** 697,704 **** extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
*/
extern void btree_redo(XLogRecPtr lsn, XLogRecord *record);
extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
- extern void btree_xlog_startup(void);
- extern void btree_xlog_cleanup(void);
- extern bool btree_safe_restartpoint(void);
#endif /* NBTREE_H */
--- 709,713 ----
*** a/src/include/access/rmgrlist.h
--- b/src/include/access/rmgrlist.h
***************
*** 36,42 **** PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL)
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, NULL, NULL, NULL)
! PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, NULL, NULL, NULL)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
--- 36,42 ----
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, NULL, NULL, NULL)
! PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, NULL, NULL, NULL)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, NULL, NULL, NULL)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
On 05.11.2013 15:07, Heikki Linnakangas wrote:
On 25.10.2013 22:13, Heikki Linnakangas wrote:
On 22.10.2013 19:55, Heikki Linnakangas wrote:
I fixed the the same problem in GiST a few years back, by making it
tolerate missing downlinks, and inserting them lazily. The B-tree code
tolerates them already on scans, but gets confused on insertion, as seen
above. I propose that we use the same approach I used with GiST, and add
a flag to the page header to indicate "the downlink hasn't been inserted
yet". When insertion (or vacuum) bumps into a flagged page, it can
finish the incomplete action by inserting the downlink.This is what I came up with.
One thing I'm not totally happy about is the way page deletions of
incompletely split pages are handled. Basically, it just bails out and
refuses to delete a page that is part of an incomplete split. That's
probably OK in practice, as incomplete splits should be very rare
anyway, but it's a bit dissatisfying to not handle the case because at
first glance it seems like it should be even simpler than usual to
delete a page that has no downlink. Nevertheless, I decided to just skip
that for now.After this patch, deleting the only child of a parent and the parent
itself is still a multi-WAL-record operation that needs to be tracked
during recovery, and completed at the end of recovery. I'd like to
eliminate that too, but that's another patch.Here's a new version of this, which uses a similar technique to handle
page deletions, eliminating the "incomplete action" tracking code
altogether (from btree). When an internal page is marked as half-dead,
its right sibling is atomically marked with a
"left-sibling-is-half-dead" flag. Whenever an insertion encounters a
page with that flag set, it will finish the deletion of the left sibling
before proceeding with the insertion.This needs a lot more testing, but I wanted to get this out for review,
in case someone sees a fundamental problem with this.
Ok, here's a new version of the patch to handle incomplete B-tree
splits. I rejected the scheme I outlined above about handling half-dead
pages - instead, see the patch in the "Race condition in b-tree page
deletion" thread:
/messages/by-id/5283A2FC.6040606@vmware.com. That
approach to handling half-dead pages makes the incomplete-split stuff a
lot easier, as half-dead pages don't need any special treatment wrt.
splits- This patch should be after that patch.
- Heikki
Attachments:
btree-incomplete-splits-2.patchtext/x-diff; name=btree-incomplete-splits-2.patchDownload
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index e57c211..2945e1e 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -168,6 +168,33 @@ parent item does still exist and can't have been deleted. Also, because
we are matching downlink page numbers and not data keys, we don't have any
problem with possibly misidentifying the parent item.
+VACUUM needs to do a linear scan of an index to search for deleted pages
+that can be reclaimed because they are older than all open transactions.
+For efficiency's sake, we'd like to use the same linear scan to search for
+deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages
+in index order, but it is possible to visit them in physical order instead.
+The tricky part of this is to avoid missing any deletable tuples in the
+presence of concurrent page splits: a page split could easily move some
+tuples from a page not yet passed over by the sequential scan to a
+lower-numbered page already passed over. (This wasn't a concern for the
+index-order scan, because splits always split right.) To implement this,
+we provide a "vacuum cycle ID" mechanism that makes it possible to
+determine whether a page has been split since the current btbulkdelete
+cycle started. If btbulkdelete finds a page that has been split since
+it started, and has a right-link pointing to a lower page number, then
+it temporarily suspends its sequential scan and visits that page instead.
+It must continue to follow right-links and vacuum dead tuples until
+reaching a page that either hasn't been split since btbulkdelete started,
+or is above the location of the outer sequential scan. Then it can resume
+the sequential scan. This ensures that all tuples are visited. It may be
+that some tuples are visited twice, but that has no worse effect than an
+inaccurate index tuple count (and we can't guarantee an accurate count
+anyway in the face of concurrent activity). Note that this still works
+if the has-been-recently-split test has a small probability of false
+positives, so long as it never gives a false negative. This makes it
+possible to implement the test with a small counter value stored on each
+index page.
+
Page Deletion
-------------
@@ -315,33 +342,6 @@ as part of the atomic update for the delete (either way, the metapage has
to be the last page locked in the update to avoid deadlock risks). This
avoids race conditions if two such operations are executing concurrently.
-VACUUM needs to do a linear scan of an index to search for deleted pages
-that can be reclaimed because they are older than all open transactions.
-For efficiency's sake, we'd like to use the same linear scan to search for
-deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages
-in index order, but it is possible to visit them in physical order instead.
-The tricky part of this is to avoid missing any deletable tuples in the
-presence of concurrent page splits: a page split could easily move some
-tuples from a page not yet passed over by the sequential scan to a
-lower-numbered page already passed over. (This wasn't a concern for the
-index-order scan, because splits always split right.) To implement this,
-we provide a "vacuum cycle ID" mechanism that makes it possible to
-determine whether a page has been split since the current btbulkdelete
-cycle started. If btbulkdelete finds a page that has been split since
-it started, and has a right-link pointing to a lower page number, then
-it temporarily suspends its sequential scan and visits that page instead.
-It must continue to follow right-links and vacuum dead tuples until
-reaching a page that either hasn't been split since btbulkdelete started,
-or is above the location of the outer sequential scan. Then it can resume
-the sequential scan. This ensures that all tuples are visited. It may be
-that some tuples are visited twice, but that has no worse effect than an
-inaccurate index tuple count (and we can't guarantee an accurate count
-anyway in the face of concurrent activity). Note that this still works
-if the has-been-recently-split test has a small probability of false
-positives, so long as it never gives a false negative. This makes it
-possible to implement the test with a small counter value stored on each
-index page.
-
On-the-Fly Deletion Of Index Tuples
-----------------------------------
@@ -401,12 +401,34 @@ an additional insertion above that, etc).
For a root split, the followon WAL entry is a "new root" entry rather than
an "insertion" entry, but details are otherwise much the same.
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course). This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because splitting involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks on-they-fly, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra updates in what would otherwise
+be a read-only operation (updating is not possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is cleared atomically
+with the insertion. The child page is kept locked until the insertion in the
+parent is finished and the flag in the child cleared, but can be released
+immediately after that, before recursing up the tree, if the parent also
+needs to be split. This ensures that incompletely split pages should not be
+seen under normal circumstances; only when insertion to the parent fails
+for some reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
When splitting a non-root page that is alone on its level, the required
metapage update (of the "fast root" link) is performed and logged as part
@@ -419,6 +441,14 @@ page is a second record. If vacuum is interrupted for some reason, or the
system crashes, the tree is consistent for searches and insertions. The next
VACUUM will find the half-dead leaf page and continue the deletion.
+Before 9.4, we used to keep track of incomplete splits and page deletions
+during recovery and finish them immediately at end of recovery, instead of
+doing it lazily at the next insertion or vacuum. However, that made the
+recovery much more complicated, and only fixed the problem when crash
+recovery was performed. An incomplete split can also occur if an otherwise
+recoverable error, like out-of-memory or out-of-disk-space, happens while
+inserting the downlink to the parent.
+
Scans during Recovery
---------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index a452fea..0c8d079 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -58,15 +58,18 @@ static void _bt_findinsertloc(Relation rel,
int keysz,
ScanKey scankey,
IndexTuple newtup,
+ BTStack stack,
Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
- OffsetNumber newitemoff, Size newitemsz,
+static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
+ OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
IndexTuple newitem, bool newitemonleft);
+static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
+ BTStack stack, bool is_root, bool is_only);
static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff,
Size newitemsz,
@@ -130,7 +133,8 @@ top:
* move right in the tree. See Lehman and Yao for an excruciatingly
* precise description.
*/
- buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+ true, stack, BT_WRITE);
/*
* If we're not allowing duplicates, make sure the key isn't already in
@@ -183,8 +187,9 @@ top:
*/
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
- _bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
- _bt_insertonpg(rel, buf, stack, itup, offset, false);
+ _bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
+ stack, heapRel);
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
@@ -508,10 +513,11 @@ _bt_findinsertloc(Relation rel,
int keysz,
ScanKey scankey,
IndexTuple newtup,
+ BTStack stack,
Relation heapRel)
{
Buffer buf = *bufptr;
- Page page = BufferGetPage(buf);
+ Page page;
Size itemsz;
BTPageOpaque lpageop;
bool movedright,
@@ -519,6 +525,7 @@ _bt_findinsertloc(Relation rel,
OffsetNumber newitemoff;
OffsetNumber firstlegaloff = *offsetptr;
+ page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
itemsz = IndexTupleDSize(*newtup);
@@ -570,6 +577,7 @@ _bt_findinsertloc(Relation rel,
while (PageGetFreeSpace(page) < itemsz)
{
Buffer rbuf;
+ BlockNumber rblkno;
/*
* before considering moving right, see if we can obtain enough space
@@ -607,18 +615,33 @@ _bt_findinsertloc(Relation rel,
*/
rbuf = InvalidBuffer;
+ rblkno = lpageop->btpo_next;
for (;;)
{
- BlockNumber rblkno = lpageop->btpo_next;
-
rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
page = BufferGetPage(rbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ /*
+ * If this page was incompletely split, finish that now. We do
+ * this while holding a lock on the left sibling, which is not
+ * good because finishing the split could be a fairly lengthy
+ * operation. But this should happen very seldom.
+ */
+ if (P_INCOMPLETE_SPLIT(lpageop))
+ {
+ _bt_finish_split(rel, rbuf, stack);
+ rbuf = InvalidBuffer;
+ continue;
+ }
+
if (!P_IGNORE(lpageop))
break;
if (P_RIGHTMOST(lpageop))
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
+
+ rblkno = lpageop->btpo_next;
}
_bt_relbuf(rel, buf);
buf = rbuf;
@@ -664,6 +687,10 @@ _bt_findinsertloc(Relation rel,
* insertion, and the buffer must be pinned and write-locked. On return,
* we will have dropped both the pin and the lock on the buffer.
*
+ * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* The locking interactions in this code are critical. You should
* grok Lehman and Yao's paper before making any changes. In addition,
* you need to understand how we disambiguate duplicate keys in this
@@ -677,6 +704,7 @@ _bt_findinsertloc(Relation rel,
static void
_bt_insertonpg(Relation rel,
Buffer buf,
+ Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
@@ -690,6 +718,14 @@ _bt_insertonpg(Relation rel,
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ /*
+ * The caller should've finished any incomplete splits and page deletions
+ * already
+ */
+ if (P_INCOMPLETE_SPLIT(lpageop))
+ elog(ERROR, "cannot insert to incompletely split page %u",
+ BufferGetBlockNumber(buf));
+
itemsz = IndexTupleDSize(*itup);
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
@@ -714,7 +750,7 @@ _bt_insertonpg(Relation rel,
&newitemonleft);
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, buf, firstright,
+ rbuf = _bt_split(rel, buf, cbuf, firstright,
newitemoff, itemsz, itup, newitemonleft);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
@@ -788,11 +824,21 @@ _bt_insertonpg(Relation rel,
MarkBufferDirty(metabuf);
}
+ /* clear INCOMPLETE_SPLIT flag on child if this finishes a split */
+ if (!P_ISLEAF(lpageop))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_insert xlrec;
- BlockNumber xldownlink;
+ BlockNumber xlleftchild;
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
@@ -812,12 +858,15 @@ _bt_insertonpg(Relation rel,
xlinfo = XLOG_BTREE_INSERT_LEAF;
else
{
- xldownlink = ItemPointerGetBlockNumber(&(itup->t_tid));
- Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
-
- nextrdata->data = (char *) &xldownlink;
+ /*
+ * Include the block number of the left child, whose
+ * INCOMPLETE_SPLIT flag is cleared.
+ */
+ xlleftchild = BufferGetBlockNumber(cbuf);
+ nextrdata->data = (char *) &xlleftchild;
nextrdata->len = sizeof(BlockNumber);
- nextrdata->buffer = InvalidBuffer;
+ nextrdata->buffer = cbuf;
+ nextrdata->buffer_std = true;
nextrdata->next = nextrdata + 1;
nextrdata++;
@@ -870,6 +919,8 @@ _bt_insertonpg(Relation rel,
END_CRIT_SECTION();
/* release buffers; send out relcache inval if metapage changed */
+ if (!P_ISLEAF(lpageop))
+ _bt_relbuf(rel, cbuf);
if (BufferIsValid(metabuf))
{
if (!InRecovery)
@@ -889,11 +940,15 @@ _bt_insertonpg(Relation rel,
* new right page. newitemoff etc. tell us about the new item that
* must be inserted along with the data from the old page.
*
+ * When splitting a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
-_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
bool newitemonleft)
{
@@ -961,6 +1016,8 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lopaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
ropaque->btpo_flags = lopaque->btpo_flags;
+ /* set flag in left page indicating that the right page has no downlink */
+ lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev;
lopaque->btpo_next = rightpagenumber;
ropaque->btpo_prev = origpagenumber;
@@ -1184,6 +1241,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
MarkBufferDirty(sbuf);
}
+ /*
+ * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
+ * a split.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
@@ -1206,34 +1275,6 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata = &rdata[0];
- if (ropaque->btpo.level > 0)
- {
- /* Log downlink on non-leaf pages */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
- lastrdata->len = sizeof(BlockIdData);
- lastrdata->buffer = InvalidBuffer;
-
- /*
- * We must also log the left page's high key, because the right
- * page's leftmost key is suppressed on non-leaf levels. Show it
- * as belonging to the left page buffer, so that it is not stored
- * if XLogInsert decides it needs a full-page image of the left
- * page.
- */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- itemid = PageGetItemId(origpage, P_HIKEY);
- item = (IndexTuple) PageGetItem(origpage, itemid);
- lastrdata->data = (char *) item;
- lastrdata->len = MAXALIGN(IndexTupleSize(item));
- lastrdata->buffer = buf; /* backup block 1 */
- lastrdata->buffer_std = true;
- }
-
/*
* Log the new item and its offset, if it was inserted on the left
* page. (If it was put on the right page, we don't need to explicitly
@@ -1260,17 +1301,40 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->buffer = buf; /* backup block 1 */
lastrdata->buffer_std = true;
}
- else if (ropaque->btpo.level == 0)
+
+ /* Log left page */
+ if (ropaque->btpo.level > 0)
{
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
/*
- * Although we don't need to WAL-log the new item, we still need
- * XLogInsert to consider storing a full-page image of the left
- * page, so make an empty entry referencing that buffer. This also
- * ensures that the left page is always backup block 1.
+ * We must also log the left page's high key, because the right
+ * page's leftmost key is suppressed on non-leaf levels. Show it
+ * as belonging to the left page buffer, so that it is not stored
+ * if XLogInsert decides it needs a full-page image of the left
+ * page.
*/
+ itemid = PageGetItemId(origpage, P_HIKEY);
+ item = (IndexTuple) PageGetItem(origpage, itemid);
+ lastrdata->data = (char *) item;
+ lastrdata->len = MAXALIGN(IndexTupleSize(item));
+ lastrdata->buffer = buf; /* backup block 1 */
+ lastrdata->buffer_std = true;
+ }
+
+ if (ropaque->btpo.level == 0 && !newitemonleft)
+ {
lastrdata->next = lastrdata + 1;
lastrdata++;
+ /*
+ * Although we don't need to WAL-log anything on the left page,
+ * the new item, we still need XLogInsert to consider storing a
+ * full-page image of the left page, so make an empty entry
+ * referencing that buffer. This also ensures that the left page
+ * is always backup block 1.
+ */
lastrdata->data = NULL;
lastrdata->len = 0;
lastrdata->buffer = buf; /* backup block 1 */
@@ -1278,6 +1342,22 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
}
/*
+ * Log block number of left child, whose INCOMPLETE_SPLIT flag this
+ * insertion clears.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ BlockNumber cblkno = BufferGetBlockNumber(cbuf);
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
+ lastrdata->data = (char *) &cblkno;
+ lastrdata->len = sizeof(BlockNumber);
+ lastrdata->buffer = cbuf; /* backup block 2 */
+ lastrdata->buffer_std = true;
+ }
+
+ /*
* Log the contents of the right page in the format understood by
* _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
* because we're going to recreate the whole page anyway, so it should
@@ -1306,7 +1386,7 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->data = NULL;
lastrdata->len = 0;
- lastrdata->buffer = sbuf; /* backup block 2 */
+ lastrdata->buffer = sbuf; /* bkp block 2 (leaf) or 3 (non-leaf) */
lastrdata->buffer_std = true;
}
@@ -1333,6 +1413,10 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
if (!P_RIGHTMOST(ropaque))
_bt_relbuf(rel, sbuf);
+ /* release the child */
+ if (ropaque->btpo.level > 0)
+ _bt_relbuf(rel, cbuf);
+
/* split's done */
return rbuf;
}
@@ -1603,10 +1687,8 @@ _bt_checksplitloc(FindSplitData *state,
* have to be efficient (concurrent ROOT split, WAL recovery)
* is_root - we split the true root
* is_only - we split a page alone on its level (might have been fast root)
- *
- * This is exported so it can be called by nbtxlog.c.
*/
-void
+static void
_bt_insert_parent(Relation rel,
Buffer buf,
Buffer rbuf,
@@ -1685,12 +1767,13 @@ _bt_insert_parent(Relation rel,
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- /* Now we can unlock the children */
+ /*
+ * Now we can unlock the right child. The left child will be unlocked
+ * by _bt_insertonpg().
+ */
_bt_relbuf(rel, rbuf);
- _bt_relbuf(rel, buf);
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
@@ -1698,7 +1781,7 @@ _bt_insert_parent(Relation rel,
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
- _bt_insertonpg(rel, pbuf, stack->bts_parent,
+ _bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
is_only);
@@ -1708,6 +1791,62 @@ _bt_insert_parent(Relation rel,
}
/*
+ * _bt_finish_split() -- Finish an incomplete split
+ *
+ * A crash or other failure can leave a split incomplete. The insertion
+ * routines won't allow to insert on a page that is incompletely split.
+ * Before inserting on such a page, call _bt_finish_split().
+ *
+ * On entry, we hold write-mode lock on it, and the lock is released on
+ * exit.
+ */
+void
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+{
+ Page lpage = BufferGetPage(lbuf);
+ BTPageOpaque lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+ Buffer rbuf;
+ Page rpage;
+ BTPageOpaque rpageop;
+ bool was_root;
+ bool was_only;
+
+ Assert(P_INCOMPLETE_SPLIT(lpageop));
+
+ /* Lock right sibling, the one missing the downlink */
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+ rpage = BufferGetPage(rbuf);
+ rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+ /* Could this be a root split? */
+ if (!stack)
+ {
+ Buffer metabuf;
+ Page metapg;
+ BTMetaPageData *metad;
+
+ /* acquire lock on the metapage */
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
+
+ was_root = (metad->btm_root == BufferGetBlockNumber(lbuf));
+
+ _bt_relbuf(rel, metabuf);
+ }
+ else
+ was_root = false;
+
+ /* Was this the only page on the level before split? */
+ was_only = (P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop));
+
+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
+
+ _bt_insert_parent(rel, lbuf, rbuf, stack, was_root, was_only);
+}
+
+/*
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
* we last looked at in the parent.
*
@@ -1739,6 +1878,12 @@ _bt_getstackbuf(Relation rel, BTStack stack, int access)
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (access == BT_WRITE && P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_finish_split(rel, buf, stack->bts_parent);
+ continue;
+ }
+
if (!P_IGNORE(opaque))
{
OffsetNumber offnum,
@@ -1843,6 +1988,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rbkno;
BlockNumber rootblknum;
BTPageOpaque rootopaque;
+ BTPageOpaque lopaque;
ItemId itemid;
IndexTuple item;
Size itemsz;
@@ -1854,6 +2000,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
lbkno = BufferGetBlockNumber(lbuf);
rbkno = BufferGetBlockNumber(rbuf);
lpage = BufferGetPage(lbuf);
+ lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
/* get a new root page */
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1927,6 +2074,11 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
pfree(new_item);
+ /* Clear the incomplete-split flag in the left child */
+ Assert(P_INCOMPLETE_SPLIT(lopaque));
+ lopaque->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(lbuf);
+
MarkBufferDirty(rootbuf);
MarkBufferDirty(metabuf);
@@ -1935,7 +2087,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
{
xl_btree_newroot xlrec;
XLogRecPtr recptr;
- XLogRecData rdata[2];
+ XLogRecData rdata[3];
xlrec.node = rel->rd_node;
xlrec.rootblk = rootblknum;
@@ -1954,7 +2106,13 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rdata[1].len = ((PageHeader) rootpage)->pd_special -
((PageHeader) rootpage)->pd_upper;
rdata[1].buffer = InvalidBuffer;
- rdata[1].next = NULL;
+ rdata[1].next = &(rdata[2]);
+
+ /* Make a full-page image of the left child if needed */
+ rdata[2].data = NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = lbuf;
+ rdata[2].next = NULL;
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index cf8dc47..bc718ea 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1022,21 +1022,60 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
/* It's rightmost child... */
if (poffset == P_FIRSTDATAKEY(opaque))
{
+ BlockNumber leftsib;
+
/*
* It's only child, so safe if parent would itself be removable.
* We have to check the parent itself, and then recurse to test
* the conditions at the parent's parent.
+ *
+ * Also check that the page, or its left sibling, isn't
+ * incompletely split. These are the same checks that we already
+ * did for the leaf page, see comments in _bt_mark_page_halfdead().
*/
- if (P_RIGHTMOST(opaque) || P_ISROOT(opaque))
+ if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
+ P_INCOMPLETE_SPLIT(opaque))
{
_bt_relbuf(rel, pbuf);
return false;
}
- *target = child;
+ *target = parent;
*rightsib = opaque->btpo_next;
+ leftsib = opaque->btpo_prev;
_bt_relbuf(rel, pbuf);
+
+ if (leftsib != P_NONE)
+ {
+ Buffer lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+ page = BufferGetPage(lbuf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ while (P_ISDELETED(opaque) || opaque->btpo_next != parent)
+ {
+ /* step right one page */
+ leftsib = opaque->btpo_next;
+ _bt_relbuf(rel, lbuf);
+ if (leftsib == P_NONE)
+ {
+ elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
+ RelationGetRelationName(rel));
+ return false;
+ }
+ lbuf = _bt_getbuf(rel, leftsib, BT_READ);
+ page = BufferGetPage(lbuf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_relbuf(rel, lbuf);
+ return false;
+ }
+ _bt_relbuf(rel, lbuf);
+ }
+
+ /* Ok, this page should have a downlink in the parent. Find it. */
return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
topparent, topoff, target, rightsib);
}
@@ -1124,6 +1163,7 @@ retry:
* that page is not already deleted and is empty.
*/
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
+ P_INCOMPLETE_SPLIT(opaque) ||
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
{
/* Should never fail to delete a half-dead page */
@@ -1142,7 +1182,7 @@ retry:
{
if (!_bt_mark_page_halfdead(rel, buf, stack))
{
- _bt_relbuf(rel, buf);
+ /* _bt_mark_page_halfdead released the lock */
return 0;
}
}
@@ -1185,10 +1225,16 @@ retry:
return ndeleted;
}
+/*
+ * On success, returns true. In that case we hold a write-lock on 'leafbuf'
+ * on return. If the page cannot be deleted, returns false, and the lock
+ * and pin on 'leafbuf' are released.
+ */
static bool
_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
{
BlockNumber leafblkno;
+ BlockNumber leafleftsib;
BlockNumber leafrightsib;
BlockNumber target;
BlockNumber rightsib;
@@ -1215,6 +1261,7 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
leafblkno = BufferGetBlockNumber(leafbuf);
itemid = PageGetItemId(page, P_HIKEY);
targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
+ leafleftsib = opaque->btpo_prev;
/*
* To avoid deadlocks, we'd better drop the leaf page lock before going
@@ -1245,15 +1292,75 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
}
/*
+ * If we see any incompletely split pages in the branch, we punt. There
+ * are two scenarios that matter:
+ *
+ * 1. The child page's left sibling was incompletely split. The child has
+ * no downlink. We'd need to transfer the keyspace of the child to its
+ * right sibling. We could do that by removing the downlink for the right
+ * sibling, so that it would appear that the right sibling is the
+ * incompletely split right half of the left sibling. But it would be a
+ * lot more complicated.
+ *
+ * 2. The child page itself was incompletely split. Its right sibling has
+ * no downlink. The child's key space can be transferred to its right
+ * sibling by changing the child's downlink to point to its right sibling
+ * instead. We could do that quite easily, but it hasn't been implemented.
+ *
+ * We already checked the leaf page's incomplete-split flag above. Check
+ * the left sibling. Note that _bt_lock_branch_parent() will perform this
+ * same check for the left sibling of any parent pages we will delete.
+ */
+ if (leafleftsib != P_NONE)
+ {
+ Buffer lbuf = _bt_getbuf(rel, leafleftsib, BT_READ);
+ page = BufferGetPage(lbuf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ while (P_ISDELETED(opaque) || opaque->btpo_next != leafblkno)
+ {
+ /* step right one page */
+ leafleftsib = opaque->btpo_next;
+ _bt_relbuf(rel, lbuf);
+ if (leafleftsib == P_NONE)
+ {
+ elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
+ RelationGetRelationName(rel));
+ ReleaseBuffer(leafbuf);
+ return false;
+ }
+ lbuf = _bt_getbuf(rel, leafleftsib, BT_READ);
+ page = BufferGetPage(lbuf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ }
+
+ /*
+ * Found it. Check that it's not incompletely split. Note that it's
+ * enough to check this once before locking the other pages, because
+ * once a page split is complete, it cannot become incomplete later.
+ */
+ if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_relbuf(rel, lbuf);
+ ReleaseBuffer(leafbuf);
+ return false;
+ }
+ _bt_relbuf(rel, lbuf);
+ }
+
+ /*
* Re-acquire lock on the leaf page, and check that it can still be
* deleted.
*/
LockBuffer(leafbuf, BT_WRITE);
+ page = BufferGetPage(leafbuf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque) ||
+ P_INCOMPLETE_SPLIT(opaque) ||
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
{
+ UnlockReleaseBuffer(leafbuf);
return false;
}
@@ -1274,7 +1381,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
target = leafblkno;
if (!_bt_lock_branch_parent(rel, leafblkno, stack,
&topparent, &topoff, &target, &rightsib))
+ {
+ UnlockReleaseBuffer(leafbuf);
return false;
+ }
/*
* Check that the parent-page index items we're about to delete/overwrite
@@ -1282,10 +1392,9 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
* for some reason. We want to throw any error before entering the
* critical section --- otherwise it'd be a PANIC.
*
- * The test on the target item is just an Assert because
- * _bt_lock_branch_parent should have guaranteed it has the expected
- * contents. The test on the next-child downlink is known to sometimes
- * fail in the field, though.
+ * The test on the target item is just an Assert because _bt_getstackbuf
+ * should have guaranteed it has the expected contents. The test on the
+ * next-child downlink is known to sometimes fail in the field, though.
*/
page = BufferGetPage(topparent);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -1388,6 +1497,8 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
END_CRIT_SECTION();
_bt_relbuf(rel, topparent);
+
+ /* keep the lock on leaf buffer */
return true;
}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ac98589..ff8c714 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -29,7 +29,6 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
static void _bt_saveitem(BTScanOpaque so, int itemIndex,
OffsetNumber offnum, IndexTuple itup);
static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf);
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
@@ -51,7 +50,9 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
- * will result in *bufP being set to InvalidBuffer.
+ * will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
+ * any incomplete splits or half-dead pages encountered during the search
+ * will be finished.
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
@@ -82,8 +83,13 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage). If it has, we may need to
* move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way.
*/
- *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey, BT_READ);
+ *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+ (access == BT_WRITE), stack_in,
+ BT_READ);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -148,6 +154,11 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* item >= scankey. When nextkey is true, we are looking for the first
* item strictly greater than scankey.
*
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * or half-dead pages that we encounter. This is required when searching for
+ * a target page for an insertion, because we don't allow inserting on a
+ * page with incomplete actions. 'stack' is only used if forupdate is true.
+ *
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
* the same on the right sibling. Return value is the buffer we stop at.
@@ -158,6 +169,8 @@ _bt_moveright(Relation rel,
int keysz,
ScanKey scankey,
bool nextkey,
+ bool forupdate,
+ BTStack stack,
int access)
{
Page page;
@@ -186,14 +199,43 @@ _bt_moveright(Relation rel,
while (!P_RIGHTMOST(opaque) &&
(P_IGNORE(opaque) ||
+ (forupdate && P_INCOMPLETE_SPLIT(opaque)) ||
_bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
{
- /* step right one page */
- BlockNumber rblkno = opaque->btpo_next;
+ /*
+ * Finish any incomplete splits and remove any half-dead pages we
+ * encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
- buf = _bt_relandgetbuf(rel, buf, rblkno, access);
- page = BufferGetPage(buf);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (access == BT_READ)
+ {
+ /* upgrade the lock, and re-check */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BT_WRITE);
+ if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_finish_split(rel, buf, stack);
+ _bt_getbuf(rel, blkno, access);
+ }
+ }
+ else
+ {
+ _bt_finish_split(rel, buf, stack);
+ _bt_getbuf(rel, blkno, access);
+ }
+ }
+ else
+ {
+ /* step right one page */
+ BlockNumber rblkno = opaque->btpo_next;
+
+ buf = _bt_relandgetbuf(rel, buf, rblkno, access);
+ page = BufferGetPage(buf);
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ }
}
if (P_IGNORE(opaque))
@@ -1314,7 +1356,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
* to be half-dead; the caller should check that condition and step left
* again if it's important.
*/
-static Buffer
+Buffer
_bt_walk_left(Relation rel, Buffer buf)
{
Page page;
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index ebeb70a..1d0a2a4 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,61 +21,6 @@
#include "miscadmin.h"
/*
- * We must keep track of expected insertions due to page splits, and apply
- * them manually if they are not seen in the WAL log during replay. This
- * makes it safe for page insertion to be a multiple-WAL-action process.
- *
- * The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split to remain incomplete for long. In any
- * case we need to respect the order of operations.
- */
-typedef struct bt_incomplete_split
-{
- RelFileNode node; /* the index */
- bool is_root; /* we split the root */
- BlockNumber leftblk; /* left half of split */
- BlockNumber rightblk; /* right half of split */
-} bt_incomplete_split;
-
-static List *incomplete_splits;
-
-
-static void
-log_incomplete_split(RelFileNode node, BlockNumber leftblk,
- BlockNumber rightblk, bool is_root)
-{
- bt_incomplete_split *action = palloc(sizeof(bt_incomplete_split));
-
- action->node = node;
- action->is_root = is_root;
- action->leftblk = leftblk;
- action->rightblk = rightblk;
- incomplete_splits = lappend(incomplete_splits, action);
-}
-
-static void
-forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
-{
- ListCell *l;
-
- foreach(l, incomplete_splits)
- {
- bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
-
- if (RelFileNodeEquals(node, action->node) &&
- downlink == action->rightblk)
- {
- if (is_root != action->is_root)
- elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
- action->is_root, is_root);
- incomplete_splits = list_delete_ptr(incomplete_splits, action);
- pfree(action);
- break; /* need not look further */
- }
- }
-}
-
-/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
* The page is freshly init'd, and *from (length len) is a copy of what
@@ -149,23 +94,60 @@ _bt_restore_meta(RelFileNode rnode, XLogRecPtr lsn,
UnlockReleaseBuffer(metabuf);
}
+/*
+ * _bt_clear_incomplete_split -- clear INCOMPLETE_SPLIT flag on a page
+ *
+ * This is a common subroutine of the redo functions of all the WAL record
+ * types that can insert a downlink: insert, split, and newroot.
+ */
+static void
+_bt_clear_incomplete_split(XLogRecPtr lsn, XLogRecord *record,
+ RelFileNode rnode, BlockNumber cblock)
+{
+ Buffer buf;
+
+ buf = XLogReadBuffer(rnode, cblock, false);
+ if (BufferIsValid(buf))
+ {
+ Page page = (Page) BufferGetPage(buf);
+
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+ pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+ UnlockReleaseBuffer(buf);
+ }
+}
+
static void
btree_xlog_insert(bool isleaf, bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
Buffer buffer;
+ Buffer cbuffer = InvalidBuffer;
Page page;
char *datapos;
int datalen;
xl_btree_metadata md;
- BlockNumber downlink = 0;
+ BlockNumber cblkno = 0;
+ int main_blk_index;
datapos = (char *) xlrec + SizeOfBtreeInsert;
datalen = record->xl_len - SizeOfBtreeInsert;
- if (!isleaf)
+ /*
+ * if this insert finishes a split at lower level, extract the block
+ * number of the (left) child.
+ */
+ if (!isleaf && (record->xl_info & XLR_BKP_BLOCK(0)) == 0)
{
- memcpy(&downlink, datapos, sizeof(BlockNumber));
+ memcpy(&cblkno, datapos, sizeof(BlockNumber));
+ Assert(cblkno != 0);
datapos += sizeof(BlockNumber);
datalen -= sizeof(BlockNumber);
}
@@ -176,8 +158,19 @@ btree_xlog_insert(bool isleaf, bool ismeta,
datalen -= sizeof(xl_btree_metadata);
}
- if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->target.node, cblkno);
+ main_blk_index = 1;
+ }
+ else
+ main_blk_index = 0;
+
+ if (record->xl_info & XLR_BKP_BLOCK(main_blk_index))
+ (void) RestoreBackupBlock(lsn, record, main_blk_index, false, false);
else
{
buffer = XLogReadBuffer(xlrec->target.node,
@@ -187,11 +180,7 @@ btree_xlog_insert(bool isleaf, bool ismeta,
{
page = (Page) BufferGetPage(buffer);
- if (lsn <= PageGetLSN(page))
- {
- UnlockReleaseBuffer(buffer);
- }
- else
+ if (lsn > PageGetLSN(page))
{
if (PageAddItem(page, (Item) datapos, datalen,
ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
@@ -200,11 +189,14 @@ btree_xlog_insert(bool isleaf, bool ismeta,
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
}
+ UnlockReleaseBuffer(buffer);
}
}
+ if (BufferIsValid(cbuffer))
+ UnlockReleaseBuffer(cbuffer);
+
/*
* Note: in normal operation, we'd update the metapage while still holding
* lock on the page we inserted into. But during replay it's not
@@ -216,10 +208,6 @@ btree_xlog_insert(bool isleaf, bool ismeta,
_bt_restore_meta(xlrec->target.node, lsn,
md.root, md.level,
md.fastroot, md.fastlevel);
-
- /* Forget any split this insertion completes */
- if (!isleaf)
- forget_matching_split(xlrec->target.node, downlink, false);
}
static void
@@ -227,6 +215,8 @@ btree_xlog_split(bool onleft, bool isroot,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
+ bool isleaf = (xlrec->level == 0);
+ Buffer lbuf;
Buffer rbuf;
Page rpage;
BTPageOpaque ropaque;
@@ -237,42 +227,18 @@ btree_xlog_split(bool onleft, bool isroot,
Size newitemsz = 0;
Item left_hikey = NULL;
Size left_hikeysz = 0;
+ BlockNumber cblkno = InvalidBlockNumber;
datapos = (char *) xlrec + SizeOfBtreeSplit;
datalen = record->xl_len - SizeOfBtreeSplit;
- /* Forget any split this insertion completes */
- if (xlrec->level > 0)
- {
- /* we assume SizeOfBtreeSplit is at least 16-bit aligned */
- BlockNumber downlink = BlockIdGetBlockNumber((BlockId) datapos);
-
- datapos += sizeof(BlockIdData);
- datalen -= sizeof(BlockIdData);
-
- forget_matching_split(xlrec->node, downlink, false);
-
- /* Extract left hikey and its size (still assuming 16-bit alignment) */
- if (!(record->xl_info & XLR_BKP_BLOCK(0)))
- {
- /* We assume 16-bit alignment is enough for IndexTupleSize */
- left_hikey = (Item) datapos;
- left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-
- datapos += left_hikeysz;
- datalen -= left_hikeysz;
- }
- }
-
- /* Extract newitem and newitemoff, if present */
+ /* Extract newitemoff and newitem, if present */
if (onleft)
{
- /* Extract the offset (still assuming 16-bit alignment) */
memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
datapos += sizeof(OffsetNumber);
datalen -= sizeof(OffsetNumber);
}
-
if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
{
/*
@@ -286,6 +252,37 @@ btree_xlog_split(bool onleft, bool isroot,
datalen -= newitemsz;
}
+ /* Extract left hikey and its size (still assuming 16-bit alignment) */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(0)))
+ {
+ left_hikey = (Item) datapos;
+ left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+ datapos += left_hikeysz;
+ datalen -= left_hikeysz;
+ }
+ /*
+ * If this insertion finishes an incomplete split, get the block number
+ * of the child.
+ */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(1)))
+ {
+ memcpy(&cblkno, datapos, sizeof(BlockNumber));
+ datapos += sizeof(BlockNumber);
+ datalen -= sizeof(BlockNumber);
+ }
+
+ /*
+ * Clear the incomplete split flag on the left sibling of the child page
+ * this is a downlink for.
+ */
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ (void) RestoreBackupBlock(lsn, record, 2, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
+ }
+
/* Reconstruct right (new) sibling page from scratch */
rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
Assert(BufferIsValid(rbuf));
@@ -297,7 +294,7 @@ btree_xlog_split(bool onleft, bool isroot,
ropaque->btpo_prev = xlrec->leftsib;
ropaque->btpo_next = xlrec->rnext;
ropaque->btpo.level = xlrec->level;
- ropaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0;
_bt_restore_page(rpage, datapos, datalen);
@@ -306,7 +303,7 @@ btree_xlog_split(bool onleft, bool isroot,
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
- if (xlrec->level == 0)
+ if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
@@ -321,10 +318,10 @@ btree_xlog_split(bool onleft, bool isroot,
/* Now reconstruct left (original) sibling page */
if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ lbuf = RestoreBackupBlock(lsn, record, 0, false, true);
else
{
- Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
+ lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
if (BufferIsValid(lbuf))
{
@@ -381,19 +378,21 @@ btree_xlog_split(bool onleft, bool isroot,
elog(PANIC, "failed to add high key to left page after split");
/* Fix opaque fields */
- lopaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ lopaque->btpo_flags = BTP_INCOMPLETE_SPLIT;
+ if (isleaf)
+ lopaque->btpo_flags |= BTP_LEAF;
lopaque->btpo_next = xlrec->rightsib;
lopaque->btpo_cycleid = 0;
PageSetLSN(lpage, lsn);
MarkBufferDirty(lbuf);
}
-
- UnlockReleaseBuffer(lbuf);
}
}
- /* We no longer need the right buffer */
+ /* We no longer need the buffers */
+ if (BufferIsValid(lbuf))
+ UnlockReleaseBuffer(lbuf);
UnlockReleaseBuffer(rbuf);
/*
@@ -404,32 +403,39 @@ btree_xlog_split(bool onleft, bool isroot,
* replay, because no other index update can be in progress, and readers
* will cope properly when following an obsolete left-link.
*/
- if (record->xl_info & XLR_BKP_BLOCK(1))
- (void) RestoreBackupBlock(lsn, record, 1, false, false);
- else if (xlrec->rnext != P_NONE)
+ if (xlrec->rnext != P_NONE)
{
- Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+ /*
+ * the backup block containing right sibling is 2 or 3, depending
+ * whether this was a leaf or internal page.
+ */
+ int rnext_index = isleaf ? 2 : 3;
- if (BufferIsValid(buffer))
+ if (record->xl_info & XLR_BKP_BLOCK(rnext_index))
+ (void) RestoreBackupBlock(lsn, record, rnext_index, false, false);
+ else
{
- Page page = (Page) BufferGetPage(buffer);
+ Buffer buffer;
- if (lsn > PageGetLSN(page))
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+
+ if (BufferIsValid(buffer))
{
- BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Page page = (Page) BufferGetPage(buffer);
- pageop->btpo_prev = xlrec->rightsib;
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
- PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
+ pageop->btpo_prev = xlrec->rightsib;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
}
- UnlockReleaseBuffer(buffer);
}
}
-
- /* The job ain't done till the parent link is inserted... */
- log_incomplete_split(xlrec->node,
- xlrec->leftsib, xlrec->rightsib, isroot);
}
static void
@@ -938,10 +944,7 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
Buffer buffer;
Page page;
BTPageOpaque pageop;
- BlockNumber downlink = 0;
-
- /* Backup blocks are not used in newroot records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+ BlockNumber cblkno;
buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
Assert(BufferIsValid(buffer));
@@ -964,10 +967,16 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeNewroot,
record->xl_len - SizeOfBtreeNewroot);
- /* extract downlink to the right-hand split page */
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
- downlink = ItemPointerGetBlockNumber(&(itup->t_tid));
+ /* extract block number of the left-hand split page */
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ cblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+
+ /* Clear the incomplete-split flag in left child */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
}
PageSetLSN(page, lsn);
@@ -977,10 +986,6 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
_bt_restore_meta(xlrec->node, lsn,
xlrec->rootblk, xlrec->level,
xlrec->rootblk, xlrec->level);
-
- /* Check to see if this satisfies any incomplete insertions */
- if (record->xl_len > SizeOfBtreeNewroot)
- forget_matching_split(xlrec->node, downlink, true);
}
static void
@@ -1060,59 +1065,3 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
elog(PANIC, "btree_redo: unknown op code %u", info);
}
}
-
-void
-btree_xlog_startup(void)
-{
- incomplete_splits = NIL;
-}
-
-void
-btree_xlog_cleanup(void)
-{
- ListCell *l;
-
- foreach(l, incomplete_splits)
- {
- /* finish an incomplete split */
- bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
- Buffer lbuf,
- rbuf;
- Page lpage,
- rpage;
- BTPageOpaque lpageop,
- rpageop;
- bool is_only;
- Relation reln;
-
- lbuf = XLogReadBuffer(action->node, action->leftblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(lbuf))
- elog(PANIC, "btree_xlog_cleanup: left block unfound");
- lpage = (Page) BufferGetPage(lbuf);
- lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
- rbuf = XLogReadBuffer(action->node, action->rightblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(rbuf))
- elog(PANIC, "btree_xlog_cleanup: right block unfound");
- rpage = (Page) BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
- /* if the pages are all of their level, it's a only-page split */
- is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
- reln = CreateFakeRelcacheEntry(action->node);
- _bt_insert_parent(reln, lbuf, rbuf, NULL,
- action->is_root, is_only);
- FreeFakeRelcacheEntry(reln);
- }
- incomplete_splits = NIL;
-}
-
-bool
-btree_safe_restartpoint(void)
-{
- if (incomplete_splits)
- return false;
- return true;
-}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d9c39df..ef56212 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -73,6 +73,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
+#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
@@ -178,6 +179,7 @@ typedef struct BTMetaPageData
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
+#define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -253,7 +255,7 @@ typedef struct xl_btree_metadata
typedef struct xl_btree_insert
{
xl_btreetid target; /* inserted tuple id */
- /* BlockNumber downlink field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
+ /* BlockNumber finishes_split field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
} xl_btree_insert;
@@ -286,19 +288,18 @@ typedef struct xl_btree_split
OffsetNumber firstright; /* first item moved to right page */
/*
- * If level > 0, BlockIdData downlink follows. (We use BlockIdData rather
- * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
- * aligned.)
+ * In the _L variants, next are OffsetNumber newitemoff and the new item.
+ * (In the _R variants, the new item is one of the right page's tuples.)
+ * The new item, but not newitemoff, is suppressed if XLogInsert chooses
+ * to store the left page's whole page image.
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in the new right page. Also, it's suppressed if
* XLogInsert chooses to store the left page's whole page image.
*
- * In the _L variants, next are OffsetNumber newitemoff and the new item.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * The new item, but not newitemoff, is suppressed if XLogInsert chooses
- * to store the left page's whole page image.
+ * If level > 0, BlockNumber of the page whose incomplete-split flag
+ * this insertion clears. (not aligned)
*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
@@ -635,8 +636,7 @@ extern Datum btoptions(PG_FUNCTION_ARGS);
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
-extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
- BTStack stack, bool is_root, bool is_only);
+extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
/*
* prototypes for functions in nbtpage.c
@@ -666,7 +666,8 @@ extern BTStack _bt_search(Relation rel,
int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey, int access);
+ ScanKey scankey, bool nextkey,
+ bool finishsplits, BTStack stack, int access);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
@@ -674,6 +675,7 @@ extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+extern Buffer _bt_walk_left(Relation rel, Buffer buf);
/*
* prototypes for functions in nbtutils.c
@@ -715,8 +717,5 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
*/
extern void btree_redo(XLogRecPtr lsn, XLogRecord *record);
extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
-extern void btree_xlog_startup(void);
-extern void btree_xlog_cleanup(void);
-extern bool btree_safe_restartpoint(void);
#endif /* NBTREE_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 7ad71b3..7de2a96 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL)
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, NULL, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, NULL, NULL, NULL)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, NULL, NULL, NULL)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
On Thu, Nov 14, 2013 at 9:23 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Ok, here's a new version of the patch to handle incomplete B-tree splits.
I finally got around to taking a look at this. Unlike with the as-yet
uncommitted "Race condition in b-tree page deletion" patch that Kevin
looked at, which there is a dependency on here, I could not really
consider your patch in isolation. I'm really reviewing both patches,
but with a particular focus on this recent set of additions
(btree-incomplete-splits-2.patch), and the issues addressed by it.
However, since the distinction between this patch and the patch that
it has a dependency on is fuzzy, expect me to be fuzzy in
differentiating the two. This patch is only very loosely an atomic
unit. This is not a criticism - I understand that it just turned out
that way.
The first thing I noticed about this patchset is that it completely
expunges btree_xlog_startup(), btree_xlog_cleanup() and
btree_safe_restartpoint(). The post-recovery cleanup that previously
occurred to address both sets of problems (the problem addressed by
this patch, incomplete page splits, and the problem addressed by the
dependency patch, incomplete page deletions) now both occur
opportunistically/lazily. Notably, this means that there are now
exactly zero entries in the resource manager list with a
'restartpoint' callback. I guess we should consider removing it
entirely for that reason. As you said in the commit message where
gin_safe_restartpoint was similarly retired (commit 631118fe):
"""
The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.
"""
I'm of the general impression that post-recovery cleanup is
questionable. I'm surprised that you didn't mention this commit
previously. You just mentioned the original analogous work on GiST,
but this certainly seems to be something you've been thinking about
*systematically* eliminating for a while.
So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.
I think you should bump XLOG_PAGE_MAGIC.
Perhaps you should increase the elevel here:
+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
Since this is a very rare occurrence that involves some subtle
interactions, I can't think why you wouldn't want to LOG this for
forensic purposes.
Why did you remove the local linkage of _bt_walk_left(), given that it
is called in exactly the same way as before? I guess that that is just
a vestige of some earlier revision.
I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that "opaque" (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the "opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto "page". So "opaque" may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the "opaque" pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets "stuck on"). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.
For a moment I thought that you might have accounted for that here:
*************** _bt_insert_parent(Relation rel,
*** 1685,1696 ****
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);! /* Now we can unlock the children */
_bt_relbuf(rel, rbuf);
- _bt_relbuf(rel, buf);/* Check for error only after writing children */ if (pbuf == InvalidBuffer) --- 1767,1779 ---- * 05/27/97 */ ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);! /*
! * Now we can unlock the right child. The left child will be unlocked
! * by _bt_insertonpg().
! */
_bt_relbuf(rel, rbuf);/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
But this is unrelated - the buffer + pin are still dropped by a later
_bt_insertonpg(), as it says right there. Actually, to see that there
is a bug in _bt_moveright() you don't really have to look much further
than _bt_moveright(). So I apologize for taking you on that
unnecessary detour, but FWIW that was my thought process, and I like
to document that for my own benefit. :-)
Do you suppose that there are similar problems in other direct callers
of _bt_finish_split()? I'm thinking in particular of
_bt_findinsertloc().
I'm also not sure about the lock escalation that may occur within
_bt_moveright() for callers with BT_READ access - there is nothing to
prevent a caller from ending up being left with a write lock where
before they only had a read lock if they find that
!P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
conclude that there is no split finishing to be done after all. It
looks like all callers currently won't care about that, but it seems
like that should either be prominently documented, or just not occur.
These interactions are complex enough that we ought to be very
explicit about where pins are required and dropped, or locks held
before or after calling.
I suggest you consider refactoring the loop in _bt_moveright() to
reflect these subtleties. I think the structure of that routine could
use some polishing. I'm not sure that I like the way that that and
similar loops get "stuck on" page splits, although no better design
occurs to me right now.
That's all I have for now. I've written plenty of notes, and will work
back through other points of possible concern. I don't suppose you
have any testing infrastructure that you could publish?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/23/2014 11:36 PM, Peter Geoghegan wrote:
The first thing I noticed about this patchset is that it completely
expunges btree_xlog_startup(), btree_xlog_cleanup() and
btree_safe_restartpoint(). The post-recovery cleanup that previously
occurred to address both sets of problems (the problem addressed by
this patch, incomplete page splits, and the problem addressed by the
dependency patch, incomplete page deletions) now both occur
opportunistically/lazily. Notably, this means that there are now
exactly zero entries in the resource manager list with a
'restartpoint' callback. I guess we should consider removing it
entirely for that reason. As you said in the commit message where
gin_safe_restartpoint was similarly retired (commit 631118fe):"""
The post-recovery cleanup mechanism was never totally reliable, as insertion
to the parent could fail e.g because of running out of memory or disk space,
leaving the tree in an inconsistent state.
"""I'm of the general impression that post-recovery cleanup is
questionable. I'm surprised that you didn't mention this commit
previously. You just mentioned the original analogous work on GiST,
but this certainly seems to be something you've been thinking about
*systematically* eliminating for a while.
Yes, that's correct, these b-tree patches are part of a grand plan to
eliminate post-recovery cleanup actions altogether.
So while post-recovery callbacks no longer exist for any
rmgr-managed-resource, 100% of remaining startup and cleanup callbacks
concern the simple management of memory of AM-specific recovery
contexts (for GiST, GiN and SP-GiST). I have to wonder if there isn't
a better abstraction than that, such as a generic recovery memory
context, allowing you to retire all 3 callbacks. I mean, StartupXLOG()
just calls those callbacks for each resource at exactly the same time
anyway, just as it initializes resource managers in precisely the same
manner earlier on. Plus if you look at what those AM-local memory
management routines do, it all seems very simple.I think you should bump XLOG_PAGE_MAGIC.
Perhaps you should increase the elevel here:
+ elog(DEBUG1, "finishing incomplete split of %u/%u", + BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));Since this is a very rare occurrence that involves some subtle
interactions, I can't think why you wouldn't want to LOG this for
forensic purposes.
Hmm. I'm not sure I agree with that line of thinking in general - what
is an admin supposed to do about the message? But there's a precedent in
vacuumlazy.c, which logs a message when it finds all-zero pages in the
heap. So I guess that should be a LOG.
Why did you remove the local linkage of _bt_walk_left(), given that it
is called in exactly the same way as before? I guess that that is just
a vestige of some earlier revision.
Yeah, reverted.
I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that "opaque" (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the "opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto "page". So "opaque" may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the "opaque" pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets "stuck on"). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.
Yep, fixed.
Do you suppose that there are similar problems in other direct callers
of _bt_finish_split()? I'm thinking in particular of
_bt_findinsertloc().
Hmm, no, the other callers look OK to me.
I'm also not sure about the lock escalation that may occur within
_bt_moveright() for callers with BT_READ access - there is nothing to
prevent a caller from ending up being left with a write lock where
before they only had a read lock if they find that
!P_INCOMPLETE_SPLIT() with the write lock (after lock promotion) and
conclude that there is no split finishing to be done after all. It
looks like all callers currently won't care about that, but it seems
like that should either be prominently documented, or just not occur.
These interactions are complex enough that we ought to be very
explicit about where pins are required and dropped, or locks held
before or after calling.
I haven't looked into this in detail yet, but I admit I don't much like
the _bt_moveright() function signature. It's strange to pass 'access'
and 'forupdate' as separate arguments, and it's not totally clear what
combinations make sense and what they mean. Some kind of refactoring is
probably in order.
Here's a new version, rebased over the latest page-deletion patch,
fix-btree-page-deletion-3.patch
(/messages/by-id/52E66E84.2050109@vmware.com). I
haven't tested this extensively, but passes "make check"...
- Heikki
Attachments:
btree-incomplete-split-3.patchtext/x-diff; name=btree-incomplete-split-3.patchDownload
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 03efc29..43ee75f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -404,12 +404,34 @@ an additional insertion above that, etc).
For a root split, the followon WAL entry is a "new root" entry rather than
an "insertion" entry, but details are otherwise much the same.
-Because insertion involves multiple atomic actions, the WAL replay logic
-has to detect the case where a page split isn't followed by a matching
-insertion on the parent level, and then do that insertion on its own (and
-recursively for any subsequent parent insertion, of course). This is
-feasible because the WAL entry for the split contains enough info to know
-what must be inserted in the parent level.
+Because splitting involves multiple atomic actions, it's possible that the
+system crashes between splitting a page and inserting the downlink for the
+new half to the parent. After recovery, the downlink for the new page will
+be missing. The search algorithm works correctly, as the page will be found
+by following the right-link from its left sibling, although if a lot of
+downlinks in the tree are missing, performance will suffer. A more serious
+consequence is that if the page without a downlink gets split again, the
+insertion algorithm will fail to find the location in the parent level to
+insert the downlink.
+
+Our approach is to create any missing downlinks on-they-fly, when
+searching the tree for a new insertion. It could be done during searches,
+too, but it seems best not to put any extra updates in what would otherwise
+be a read-only operation (updating is not possible in hot standby mode
+anyway). To identify missing downlinks, when a page is split, the left page
+is flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
+When the downlink is inserted to the parent, the flag is cleared atomically
+with the insertion. The child page is kept locked until the insertion in the
+parent is finished and the flag in the child cleared, but can be released
+immediately after that, before recursing up the tree, if the parent also
+needs to be split. This ensures that incompletely split pages should not be
+seen under normal circumstances; only when insertion to the parent fails
+for some reason.
+
+We flag the left page, even though it's the right page that's missing the
+downlink, beacuse it's more convenient to know already when following the
+right-link from the left page to the right page that it will need to have
+its downlink inserted to the parent.
When splitting a non-root page that is alone on its level, the required
metapage update (of the "fast root" link) is performed and logged as part
@@ -422,6 +444,14 @@ page is a second record. If vacuum is interrupted for some reason, or the
system crashes, the tree is consistent for searches and insertions. The next
VACUUM will find the half-dead leaf page and continue the deletion.
+Before 9.4, we used to keep track of incomplete splits and page deletions
+during recovery and finish them immediately at end of recovery, instead of
+doing it lazily at the next insertion or vacuum. However, that made the
+recovery much more complicated, and only fixed the problem when crash
+recovery was performed. An incomplete split can also occur if an otherwise
+recoverable error, like out-of-memory or out-of-disk-space, happens while
+inserting the downlink to the parent.
+
Scans during Recovery
---------------------
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 49c084a..4d60ae0 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -58,15 +58,18 @@ static void _bt_findinsertloc(Relation rel,
int keysz,
ScanKey scankey,
IndexTuple newtup,
+ BTStack stack,
Relation heapRel);
-static void _bt_insertonpg(Relation rel, Buffer buf,
+static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
bool split_only_page);
-static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
- OffsetNumber newitemoff, Size newitemsz,
+static Buffer _bt_split(Relation rel, Buffer buf, Buffer cbuf,
+ OffsetNumber firstright, OffsetNumber newitemoff, Size newitemsz,
IndexTuple newitem, bool newitemonleft);
+static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
+ BTStack stack, bool is_root, bool is_only);
static OffsetNumber _bt_findsplitloc(Relation rel, Page page,
OffsetNumber newitemoff,
Size newitemsz,
@@ -130,7 +133,8 @@ top:
* move right in the tree. See Lehman and Yao for an excruciatingly
* precise description.
*/
- buf = _bt_moveright(rel, buf, natts, itup_scankey, false, BT_WRITE);
+ buf = _bt_moveright(rel, buf, natts, itup_scankey, false,
+ true, stack, BT_WRITE);
/*
* If we're not allowing duplicates, make sure the key isn't already in
@@ -183,8 +187,9 @@ top:
*/
CheckForSerializableConflictIn(rel, NULL, buf);
/* do the insertion */
- _bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
- _bt_insertonpg(rel, buf, stack, itup, offset, false);
+ _bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup,
+ stack, heapRel);
+ _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false);
}
else
{
@@ -508,6 +513,7 @@ _bt_findinsertloc(Relation rel,
int keysz,
ScanKey scankey,
IndexTuple newtup,
+ BTStack stack,
Relation heapRel)
{
Buffer buf = *bufptr;
@@ -569,6 +575,7 @@ _bt_findinsertloc(Relation rel,
while (PageGetFreeSpace(page) < itemsz)
{
Buffer rbuf;
+ BlockNumber rblkno;
/*
* before considering moving right, see if we can obtain enough space
@@ -606,18 +613,33 @@ _bt_findinsertloc(Relation rel,
*/
rbuf = InvalidBuffer;
+ rblkno = lpageop->btpo_next;
for (;;)
{
- BlockNumber rblkno = lpageop->btpo_next;
-
rbuf = _bt_relandgetbuf(rel, rbuf, rblkno, BT_WRITE);
page = BufferGetPage(rbuf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ /*
+ * If this page was incompletely split, finish that now. We do
+ * this while holding a lock on the left sibling, which is not
+ * good because finishing the split could be a fairly lengthy
+ * operation. But this should happen very seldom.
+ */
+ if (P_INCOMPLETE_SPLIT(lpageop))
+ {
+ _bt_finish_split(rel, rbuf, stack);
+ rbuf = InvalidBuffer;
+ continue;
+ }
+
if (!P_IGNORE(lpageop))
break;
if (P_RIGHTMOST(lpageop))
elog(ERROR, "fell off the end of index \"%s\"",
RelationGetRelationName(rel));
+
+ rblkno = lpageop->btpo_next;
}
_bt_relbuf(rel, buf);
buf = rbuf;
@@ -663,6 +685,10 @@ _bt_findinsertloc(Relation rel,
* insertion, and the buffer must be pinned and write-locked. On return,
* we will have dropped both the pin and the lock on the buffer.
*
+ * When inserting to a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* The locking interactions in this code are critical. You should
* grok Lehman and Yao's paper before making any changes. In addition,
* you need to understand how we disambiguate duplicate keys in this
@@ -676,6 +702,7 @@ _bt_findinsertloc(Relation rel,
static void
_bt_insertonpg(Relation rel,
Buffer buf,
+ Buffer cbuf,
BTStack stack,
IndexTuple itup,
OffsetNumber newitemoff,
@@ -689,6 +716,14 @@ _bt_insertonpg(Relation rel,
page = BufferGetPage(buf);
lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ /*
+ * The caller should've finished any incomplete splits and page deletions
+ * already
+ */
+ if (P_INCOMPLETE_SPLIT(lpageop))
+ elog(ERROR, "cannot insert to incompletely split page %u",
+ BufferGetBlockNumber(buf));
+
itemsz = IndexTupleDSize(*itup);
itemsz = MAXALIGN(itemsz); /* be safe, PageAddItem will do this but we
* need to be consistent */
@@ -713,7 +748,7 @@ _bt_insertonpg(Relation rel,
&newitemonleft);
/* split the buffer into left and right halves */
- rbuf = _bt_split(rel, buf, firstright,
+ rbuf = _bt_split(rel, buf, cbuf, firstright,
newitemoff, itemsz, itup, newitemonleft);
PredicateLockPageSplit(rel,
BufferGetBlockNumber(buf),
@@ -787,11 +822,21 @@ _bt_insertonpg(Relation rel,
MarkBufferDirty(metabuf);
}
+ /* clear INCOMPLETE_SPLIT flag on child if this finishes a split */
+ if (!P_ISLEAF(lpageop))
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ Assert(P_INCOMPLETE_SPLIT(cpageop));
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
xl_btree_insert xlrec;
- BlockNumber xldownlink;
+ BlockNumber xlleftchild;
xl_btree_metadata xlmeta;
uint8 xlinfo;
XLogRecPtr recptr;
@@ -811,12 +856,15 @@ _bt_insertonpg(Relation rel,
xlinfo = XLOG_BTREE_INSERT_LEAF;
else
{
- xldownlink = ItemPointerGetBlockNumber(&(itup->t_tid));
- Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
-
- nextrdata->data = (char *) &xldownlink;
+ /*
+ * Include the block number of the left child, whose
+ * INCOMPLETE_SPLIT flag is cleared.
+ */
+ xlleftchild = BufferGetBlockNumber(cbuf);
+ nextrdata->data = (char *) &xlleftchild;
nextrdata->len = sizeof(BlockNumber);
- nextrdata->buffer = InvalidBuffer;
+ nextrdata->buffer = cbuf;
+ nextrdata->buffer_std = true;
nextrdata->next = nextrdata + 1;
nextrdata++;
@@ -869,6 +917,8 @@ _bt_insertonpg(Relation rel,
END_CRIT_SECTION();
/* release buffers; send out relcache inval if metapage changed */
+ if (!P_ISLEAF(lpageop))
+ _bt_relbuf(rel, cbuf);
if (BufferIsValid(metabuf))
{
if (!InRecovery)
@@ -888,11 +938,15 @@ _bt_insertonpg(Relation rel,
* new right page. newitemoff etc. tell us about the new item that
* must be inserted along with the data from the old page.
*
+ * When splitting a non-leaf page, 'cbuf' is the left-sibling of the
+ * page we're inserting the downlink for. This function will clear the
+ * INCOMPLETE_SPLIT flag on it, and release the buffer.
+ *
* Returns the new right sibling of buf, pinned and write-locked.
* The pin and lock on buf are maintained.
*/
static Buffer
-_bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
+_bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
bool newitemonleft)
{
@@ -960,6 +1014,8 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lopaque->btpo_flags = oopaque->btpo_flags;
lopaque->btpo_flags &= ~(BTP_ROOT | BTP_SPLIT_END | BTP_HAS_GARBAGE);
ropaque->btpo_flags = lopaque->btpo_flags;
+ /* set flag in left page indicating that the right page has no downlink */
+ lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev;
lopaque->btpo_next = rightpagenumber;
ropaque->btpo_prev = origpagenumber;
@@ -1183,6 +1239,18 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
MarkBufferDirty(sbuf);
}
+ /*
+ * Clear INCOMPLETE_SPLIT flag on child if inserting the new item finishes
+ * a split.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ Page cpage = BufferGetPage(cbuf);
+ BTPageOpaque cpageop = (BTPageOpaque) PageGetSpecialPointer(cpage);
+ cpageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(cbuf);
+ }
+
/* XLOG stuff */
if (RelationNeedsWAL(rel))
{
@@ -1205,34 +1273,6 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata = &rdata[0];
- if (ropaque->btpo.level > 0)
- {
- /* Log downlink on non-leaf pages */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- lastrdata->data = (char *) &newitem->t_tid.ip_blkid;
- lastrdata->len = sizeof(BlockIdData);
- lastrdata->buffer = InvalidBuffer;
-
- /*
- * We must also log the left page's high key, because the right
- * page's leftmost key is suppressed on non-leaf levels. Show it
- * as belonging to the left page buffer, so that it is not stored
- * if XLogInsert decides it needs a full-page image of the left
- * page.
- */
- lastrdata->next = lastrdata + 1;
- lastrdata++;
-
- itemid = PageGetItemId(origpage, P_HIKEY);
- item = (IndexTuple) PageGetItem(origpage, itemid);
- lastrdata->data = (char *) item;
- lastrdata->len = MAXALIGN(IndexTupleSize(item));
- lastrdata->buffer = buf; /* backup block 1 */
- lastrdata->buffer_std = true;
- }
-
/*
* Log the new item and its offset, if it was inserted on the left
* page. (If it was put on the right page, we don't need to explicitly
@@ -1259,17 +1299,40 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->buffer = buf; /* backup block 1 */
lastrdata->buffer_std = true;
}
- else if (ropaque->btpo.level == 0)
+
+ /* Log left page */
+ if (ropaque->btpo.level > 0)
{
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
/*
- * Although we don't need to WAL-log the new item, we still need
- * XLogInsert to consider storing a full-page image of the left
- * page, so make an empty entry referencing that buffer. This also
- * ensures that the left page is always backup block 1.
+ * We must also log the left page's high key, because the right
+ * page's leftmost key is suppressed on non-leaf levels. Show it
+ * as belonging to the left page buffer, so that it is not stored
+ * if XLogInsert decides it needs a full-page image of the left
+ * page.
*/
+ itemid = PageGetItemId(origpage, P_HIKEY);
+ item = (IndexTuple) PageGetItem(origpage, itemid);
+ lastrdata->data = (char *) item;
+ lastrdata->len = MAXALIGN(IndexTupleSize(item));
+ lastrdata->buffer = buf; /* backup block 1 */
+ lastrdata->buffer_std = true;
+ }
+
+ if (ropaque->btpo.level == 0 && !newitemonleft)
+ {
lastrdata->next = lastrdata + 1;
lastrdata++;
+ /*
+ * Although we don't need to WAL-log anything on the left page,
+ * the new item, we still need XLogInsert to consider storing a
+ * full-page image of the left page, so make an empty entry
+ * referencing that buffer. This also ensures that the left page
+ * is always backup block 1.
+ */
lastrdata->data = NULL;
lastrdata->len = 0;
lastrdata->buffer = buf; /* backup block 1 */
@@ -1277,6 +1340,22 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
}
/*
+ * Log block number of left child, whose INCOMPLETE_SPLIT flag this
+ * insertion clears.
+ */
+ if (ropaque->btpo.level > 0)
+ {
+ BlockNumber cblkno = BufferGetBlockNumber(cbuf);
+ lastrdata->next = lastrdata + 1;
+ lastrdata++;
+
+ lastrdata->data = (char *) &cblkno;
+ lastrdata->len = sizeof(BlockNumber);
+ lastrdata->buffer = cbuf; /* backup block 2 */
+ lastrdata->buffer_std = true;
+ }
+
+ /*
* Log the contents of the right page in the format understood by
* _bt_restore_page(). We set lastrdata->buffer to InvalidBuffer,
* because we're going to recreate the whole page anyway, so it should
@@ -1305,7 +1384,7 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
lastrdata->data = NULL;
lastrdata->len = 0;
- lastrdata->buffer = sbuf; /* backup block 2 */
+ lastrdata->buffer = sbuf; /* bkp block 2 (leaf) or 3 (non-leaf) */
lastrdata->buffer_std = true;
}
@@ -1332,6 +1411,10 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
if (!P_RIGHTMOST(ropaque))
_bt_relbuf(rel, sbuf);
+ /* release the child */
+ if (ropaque->btpo.level > 0)
+ _bt_relbuf(rel, cbuf);
+
/* split's done */
return rbuf;
}
@@ -1602,10 +1685,8 @@ _bt_checksplitloc(FindSplitData *state,
* have to be efficient (concurrent ROOT split, WAL recovery)
* is_root - we split the true root
* is_only - we split a page alone on its level (might have been fast root)
- *
- * This is exported so it can be called by nbtxlog.c.
*/
-void
+static void
_bt_insert_parent(Relation rel,
Buffer buf,
Buffer rbuf,
@@ -1624,8 +1705,7 @@ _bt_insert_parent(Relation rel,
*
* If we have to search for the parent level, we do so by re-descending
* from the root. This is not super-efficient, but it's rare enough not
- * to matter. (This path is also taken when called from WAL recovery ---
- * we have no stack in that case.)
+ * to matter.
*/
if (is_root)
{
@@ -1684,12 +1764,13 @@ _bt_insert_parent(Relation rel,
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
-
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
- /* Now we can unlock the children */
+ /*
+ * Now we can unlock the right child. The left child will be unlocked
+ * by _bt_insertonpg().
+ */
_bt_relbuf(rel, rbuf);
- _bt_relbuf(rel, buf);
/* Check for error only after writing children */
if (pbuf == InvalidBuffer)
@@ -1697,7 +1778,7 @@ _bt_insert_parent(Relation rel,
RelationGetRelationName(rel), bknum, rbknum);
/* Recursively update the parent */
- _bt_insertonpg(rel, pbuf, stack->bts_parent,
+ _bt_insertonpg(rel, pbuf, buf, stack->bts_parent,
new_item, stack->bts_offset + 1,
is_only);
@@ -1707,6 +1788,62 @@ _bt_insert_parent(Relation rel,
}
/*
+ * _bt_finish_split() -- Finish an incomplete split
+ *
+ * A crash or other failure can leave a split incomplete. The insertion
+ * routines won't allow to insert on a page that is incompletely split.
+ * Before inserting on such a page, call _bt_finish_split().
+ *
+ * On entry, we hold write-mode lock on it, and the lock is released on
+ * exit.
+ */
+void
+_bt_finish_split(Relation rel, Buffer lbuf, BTStack stack)
+{
+ Page lpage = BufferGetPage(lbuf);
+ BTPageOpaque lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+ Buffer rbuf;
+ Page rpage;
+ BTPageOpaque rpageop;
+ bool was_root;
+ bool was_only;
+
+ Assert(P_INCOMPLETE_SPLIT(lpageop));
+
+ /* Lock right sibling, the one missing the downlink */
+ rbuf = _bt_getbuf(rel, lpageop->btpo_next, BT_WRITE);
+ rpage = BufferGetPage(rbuf);
+ rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+ /* Could this be a root split? */
+ if (!stack)
+ {
+ Buffer metabuf;
+ Page metapg;
+ BTMetaPageData *metad;
+
+ /* acquire lock on the metapage */
+ metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_WRITE);
+ metapg = BufferGetPage(metabuf);
+ metad = BTPageGetMeta(metapg);
+
+ was_root = (metad->btm_root == BufferGetBlockNumber(lbuf));
+
+ _bt_relbuf(rel, metabuf);
+ }
+ else
+ was_root = false;
+
+ /* Was this the only page on the level before split? */
+ was_only = (P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop));
+
+ elog(DEBUG1, "finishing incomplete split of %u/%u",
+ BufferGetBlockNumber(lbuf), BufferGetBlockNumber(rbuf));
+
+ _bt_insert_parent(rel, lbuf, rbuf, stack, was_root, was_only);
+}
+
+/*
* _bt_getstackbuf() -- Walk back up the tree one step, and find the item
* we last looked at in the parent.
*
@@ -1738,6 +1875,12 @@ _bt_getstackbuf(Relation rel, BTStack stack, int access)
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ if (access == BT_WRITE && P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_finish_split(rel, buf, stack->bts_parent);
+ continue;
+ }
+
if (!P_IGNORE(opaque))
{
OffsetNumber offnum,
@@ -1842,6 +1985,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rbkno;
BlockNumber rootblknum;
BTPageOpaque rootopaque;
+ BTPageOpaque lopaque;
ItemId itemid;
IndexTuple item;
Size itemsz;
@@ -1853,6 +1997,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
lbkno = BufferGetBlockNumber(lbuf);
rbkno = BufferGetBlockNumber(rbuf);
lpage = BufferGetPage(lbuf);
+ lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
/* get a new root page */
rootbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1926,6 +2071,11 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
BufferGetBlockNumber(lbuf), RelationGetRelationName(rel));
pfree(new_item);
+ /* Clear the incomplete-split flag in the left child */
+ Assert(P_INCOMPLETE_SPLIT(lopaque));
+ lopaque->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+ MarkBufferDirty(lbuf);
+
MarkBufferDirty(rootbuf);
MarkBufferDirty(metabuf);
@@ -1934,7 +2084,7 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
{
xl_btree_newroot xlrec;
XLogRecPtr recptr;
- XLogRecData rdata[2];
+ XLogRecData rdata[3];
xlrec.node = rel->rd_node;
xlrec.rootblk = rootblknum;
@@ -1953,7 +2103,13 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
rdata[1].len = ((PageHeader) rootpage)->pd_special -
((PageHeader) rootpage)->pd_upper;
rdata[1].buffer = InvalidBuffer;
- rdata[1].next = NULL;
+ rdata[1].next = &(rdata[2]);
+
+ /* Make a full-page image of the left child if needed */
+ rdata[2].data = NULL;
+ rdata[2].len = 0;
+ rdata[2].buffer = lbuf;
+ rdata[2].next = NULL;
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_NEWROOT, rdata);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7ccbfd3..6bf1341 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1026,8 +1026,13 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
* It's only child, so safe if parent would itself be removable.
* We have to check the parent itself, and then recurse to test
* the conditions at the parent's parent.
+ *
+ * Also check that the page, or its left sibling, isn't
+ * incompletely split. These are the same checks that we already
+ * did for the leaf page, see comments in _bt_mark_page_hlfdead().
*/
- if (P_RIGHTMOST(opaque) || P_ISROOT(opaque))
+ if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
+ P_INCOMPLETE_SPLIT(opaque))
{
_bt_relbuf(rel, pbuf);
return false;
@@ -1128,7 +1133,8 @@ _bt_pagedel(Relation rel, Buffer buf)
* check that page is not already deleted and is empty.
*/
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
- P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+ P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+ P_INCOMPLETE_SPLIT(opaque))
{
/* Should never fail to delete a half-dead page */
Assert(!P_ISHALFDEAD(opaque));
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index bc62fbc..d4220d9 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -51,7 +51,9 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
* NOTE that the returned buffer is read-locked regardless of the access
* parameter. However, access = BT_WRITE will allow an empty root page
* to be created and returned. When access = BT_READ, an empty index
- * will result in *bufP being set to InvalidBuffer.
+ * will result in *bufP being set to InvalidBuffer. Also, in BT_WRITE mode,
+ * any incomplete splits or half-dead pages encountered during the search
+ * will be finished.
*/
BTStack
_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
@@ -82,8 +84,13 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* Race -- the page we just grabbed may have split since we read its
* pointer in the parent (or metapage). If it has, we may need to
* move right to its new sibling. Do that.
+ *
+ * In write-mode, allow _bt_moveright to finish any incomplete splits
+ * along the way.
*/
- *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey, BT_READ);
+ *bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+ (access == BT_WRITE), stack_in,
+ BT_READ);
/* if this is a leaf page, we're done */
page = BufferGetPage(*bufP);
@@ -148,6 +155,11 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
* item >= scankey. When nextkey is true, we are looking for the first
* item strictly greater than scankey.
*
+ * If forupdate is true, we will attempt to finish any incomplete splits
+ * that we encounter. This is required when searching for a target page for
+ * an insertion, because we don't allow inserting on a page with incomplete
+ * actions. 'stack' is only used if forupdate is true.
+ *
* On entry, we have the buffer pinned and a lock of the type specified by
* 'access'. If we move right, we release the buffer and lock and acquire
* the same on the right sibling. Return value is the buffer we stop at.
@@ -158,15 +170,14 @@ _bt_moveright(Relation rel,
int keysz,
ScanKey scankey,
bool nextkey,
+ bool forupdate,
+ BTStack stack,
int access)
{
Page page;
BTPageOpaque opaque;
int32 cmpval;
- page = BufferGetPage(buf);
- opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
/*
* When nextkey = false (normal case): if the scan key that brought us to
* this page is > the high key stored on the page, then the page has split
@@ -184,16 +195,48 @@ _bt_moveright(Relation rel,
*/
cmpval = nextkey ? 0 : 1;
- while (!P_RIGHTMOST(opaque) &&
- (P_IGNORE(opaque) ||
- _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval))
+ for (;;)
{
- /* step right one page */
- BlockNumber rblkno = opaque->btpo_next;
-
- buf = _bt_relandgetbuf(rel, buf, rblkno, access);
page = BufferGetPage(buf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ if (P_RIGHTMOST(opaque))
+ break;
+
+ /*
+ * Finish any incomplete splits we encounter along the way.
+ */
+ if (forupdate && P_INCOMPLETE_SPLIT(opaque))
+ {
+ BlockNumber blkno = BufferGetBlockNumber(buf);
+
+ if (access == BT_READ)
+ {
+ /* upgrade the lock, and re-check */
+ LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+ LockBuffer(buf, BT_WRITE);
+ if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ _bt_finish_split(rel, buf, stack);
+ _bt_getbuf(rel, blkno, access);
+ }
+ }
+ else
+ {
+ _bt_finish_split(rel, buf, stack);
+ _bt_getbuf(rel, blkno, access);
+ }
+ continue;
+ }
+
+ if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+ {
+ /* step right one page */
+ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+ continue;
+ }
+ else
+ break;
}
if (P_IGNORE(opaque))
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index e7317c4..8409bad 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -21,61 +21,6 @@
#include "miscadmin.h"
/*
- * We must keep track of expected insertions due to page splits, and apply
- * them manually if they are not seen in the WAL log during replay. This
- * makes it safe for page insertion to be a multiple-WAL-action process.
- *
- * The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split to remain incomplete for long. In any
- * case we need to respect the order of operations.
- */
-typedef struct bt_incomplete_split
-{
- RelFileNode node; /* the index */
- bool is_root; /* we split the root */
- BlockNumber leftblk; /* left half of split */
- BlockNumber rightblk; /* right half of split */
-} bt_incomplete_split;
-
-static List *incomplete_splits;
-
-
-static void
-log_incomplete_split(RelFileNode node, BlockNumber leftblk,
- BlockNumber rightblk, bool is_root)
-{
- bt_incomplete_split *action = palloc(sizeof(bt_incomplete_split));
-
- action->node = node;
- action->is_root = is_root;
- action->leftblk = leftblk;
- action->rightblk = rightblk;
- incomplete_splits = lappend(incomplete_splits, action);
-}
-
-static void
-forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
-{
- ListCell *l;
-
- foreach(l, incomplete_splits)
- {
- bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
-
- if (RelFileNodeEquals(node, action->node) &&
- downlink == action->rightblk)
- {
- if (is_root != action->is_root)
- elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
- action->is_root, is_root);
- incomplete_splits = list_delete_ptr(incomplete_splits, action);
- pfree(action);
- break; /* need not look further */
- }
- }
-}
-
-/*
* _bt_restore_page -- re-enter all the index tuples on a page
*
* The page is freshly init'd, and *from (length len) is a copy of what
@@ -149,23 +94,60 @@ _bt_restore_meta(RelFileNode rnode, XLogRecPtr lsn,
UnlockReleaseBuffer(metabuf);
}
+/*
+ * _bt_clear_incomplete_split -- clear INCOMPLETE_SPLIT flag on a page
+ *
+ * This is a common subroutine of the redo functions of all the WAL record
+ * types that can insert a downlink: insert, split, and newroot.
+ */
+static void
+_bt_clear_incomplete_split(XLogRecPtr lsn, XLogRecord *record,
+ RelFileNode rnode, BlockNumber cblock)
+{
+ Buffer buf;
+
+ buf = XLogReadBuffer(rnode, cblock, false);
+ if (BufferIsValid(buf))
+ {
+ Page page = (Page) BufferGetPage(buf);
+
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Assert((pageop->btpo_flags & BTP_INCOMPLETE_SPLIT) != 0);
+ pageop->btpo_flags &= ~BTP_INCOMPLETE_SPLIT;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buf);
+ }
+ UnlockReleaseBuffer(buf);
+ }
+}
+
static void
btree_xlog_insert(bool isleaf, bool ismeta,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_insert *xlrec = (xl_btree_insert *) XLogRecGetData(record);
Buffer buffer;
+ Buffer cbuffer = InvalidBuffer;
Page page;
char *datapos;
int datalen;
xl_btree_metadata md;
- BlockNumber downlink = 0;
+ BlockNumber cblkno = 0;
+ int main_blk_index;
datapos = (char *) xlrec + SizeOfBtreeInsert;
datalen = record->xl_len - SizeOfBtreeInsert;
- if (!isleaf)
+ /*
+ * if this insert finishes a split at lower level, extract the block
+ * number of the (left) child.
+ */
+ if (!isleaf && (record->xl_info & XLR_BKP_BLOCK(0)) == 0)
{
- memcpy(&downlink, datapos, sizeof(BlockNumber));
+ memcpy(&cblkno, datapos, sizeof(BlockNumber));
+ Assert(cblkno != 0);
datapos += sizeof(BlockNumber);
datalen -= sizeof(BlockNumber);
}
@@ -176,8 +158,19 @@ btree_xlog_insert(bool isleaf, bool ismeta,
datalen -= sizeof(xl_btree_metadata);
}
- if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->target.node, cblkno);
+ main_blk_index = 1;
+ }
+ else
+ main_blk_index = 0;
+
+ if (record->xl_info & XLR_BKP_BLOCK(main_blk_index))
+ (void) RestoreBackupBlock(lsn, record, main_blk_index, false, false);
else
{
buffer = XLogReadBuffer(xlrec->target.node,
@@ -187,11 +180,7 @@ btree_xlog_insert(bool isleaf, bool ismeta,
{
page = (Page) BufferGetPage(buffer);
- if (lsn <= PageGetLSN(page))
- {
- UnlockReleaseBuffer(buffer);
- }
- else
+ if (lsn > PageGetLSN(page))
{
if (PageAddItem(page, (Item) datapos, datalen,
ItemPointerGetOffsetNumber(&(xlrec->target.tid)),
@@ -200,11 +189,14 @@ btree_xlog_insert(bool isleaf, bool ismeta,
PageSetLSN(page, lsn);
MarkBufferDirty(buffer);
- UnlockReleaseBuffer(buffer);
}
+ UnlockReleaseBuffer(buffer);
}
}
+ if (BufferIsValid(cbuffer))
+ UnlockReleaseBuffer(cbuffer);
+
/*
* Note: in normal operation, we'd update the metapage while still holding
* lock on the page we inserted into. But during replay it's not
@@ -216,10 +208,6 @@ btree_xlog_insert(bool isleaf, bool ismeta,
_bt_restore_meta(xlrec->target.node, lsn,
md.root, md.level,
md.fastroot, md.fastlevel);
-
- /* Forget any split this insertion completes */
- if (!isleaf)
- forget_matching_split(xlrec->target.node, downlink, false);
}
static void
@@ -227,6 +215,8 @@ btree_xlog_split(bool onleft, bool isroot,
XLogRecPtr lsn, XLogRecord *record)
{
xl_btree_split *xlrec = (xl_btree_split *) XLogRecGetData(record);
+ bool isleaf = (xlrec->level == 0);
+ Buffer lbuf;
Buffer rbuf;
Page rpage;
BTPageOpaque ropaque;
@@ -237,42 +227,18 @@ btree_xlog_split(bool onleft, bool isroot,
Size newitemsz = 0;
Item left_hikey = NULL;
Size left_hikeysz = 0;
+ BlockNumber cblkno = InvalidBlockNumber;
datapos = (char *) xlrec + SizeOfBtreeSplit;
datalen = record->xl_len - SizeOfBtreeSplit;
- /* Forget any split this insertion completes */
- if (xlrec->level > 0)
- {
- /* we assume SizeOfBtreeSplit is at least 16-bit aligned */
- BlockNumber downlink = BlockIdGetBlockNumber((BlockId) datapos);
-
- datapos += sizeof(BlockIdData);
- datalen -= sizeof(BlockIdData);
-
- forget_matching_split(xlrec->node, downlink, false);
-
- /* Extract left hikey and its size (still assuming 16-bit alignment) */
- if (!(record->xl_info & XLR_BKP_BLOCK(0)))
- {
- /* We assume 16-bit alignment is enough for IndexTupleSize */
- left_hikey = (Item) datapos;
- left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
-
- datapos += left_hikeysz;
- datalen -= left_hikeysz;
- }
- }
-
- /* Extract newitem and newitemoff, if present */
+ /* Extract newitemoff and newitem, if present */
if (onleft)
{
- /* Extract the offset (still assuming 16-bit alignment) */
memcpy(&newitemoff, datapos, sizeof(OffsetNumber));
datapos += sizeof(OffsetNumber);
datalen -= sizeof(OffsetNumber);
}
-
if (onleft && !(record->xl_info & XLR_BKP_BLOCK(0)))
{
/*
@@ -286,6 +252,37 @@ btree_xlog_split(bool onleft, bool isroot,
datalen -= newitemsz;
}
+ /* Extract left hikey and its size (still assuming 16-bit alignment) */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(0)))
+ {
+ left_hikey = (Item) datapos;
+ left_hikeysz = MAXALIGN(IndexTupleSize(left_hikey));
+ datapos += left_hikeysz;
+ datalen -= left_hikeysz;
+ }
+ /*
+ * If this insertion finishes an incomplete split, get the block number
+ * of the child.
+ */
+ if (!isleaf && !(record->xl_info & XLR_BKP_BLOCK(1)))
+ {
+ memcpy(&cblkno, datapos, sizeof(BlockNumber));
+ datapos += sizeof(BlockNumber);
+ datalen -= sizeof(BlockNumber);
+ }
+
+ /*
+ * Clear the incomplete split flag on the left sibling of the child page
+ * this is a downlink for.
+ */
+ if (!isleaf)
+ {
+ if (record->xl_info & XLR_BKP_BLOCK(2))
+ (void) RestoreBackupBlock(lsn, record, 2, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
+ }
+
/* Reconstruct right (new) sibling page from scratch */
rbuf = XLogReadBuffer(xlrec->node, xlrec->rightsib, true);
Assert(BufferIsValid(rbuf));
@@ -297,7 +294,7 @@ btree_xlog_split(bool onleft, bool isroot,
ropaque->btpo_prev = xlrec->leftsib;
ropaque->btpo_next = xlrec->rnext;
ropaque->btpo.level = xlrec->level;
- ropaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ ropaque->btpo_flags = isleaf ? BTP_LEAF : 0;
ropaque->btpo_cycleid = 0;
_bt_restore_page(rpage, datapos, datalen);
@@ -306,7 +303,7 @@ btree_xlog_split(bool onleft, bool isroot,
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
- if (xlrec->level == 0)
+ if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
@@ -321,10 +318,10 @@ btree_xlog_split(bool onleft, bool isroot,
/* Now reconstruct left (original) sibling page */
if (record->xl_info & XLR_BKP_BLOCK(0))
- (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ lbuf = RestoreBackupBlock(lsn, record, 0, false, true);
else
{
- Buffer lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
+ lbuf = XLogReadBuffer(xlrec->node, xlrec->leftsib, false);
if (BufferIsValid(lbuf))
{
@@ -381,19 +378,21 @@ btree_xlog_split(bool onleft, bool isroot,
elog(PANIC, "failed to add high key to left page after split");
/* Fix opaque fields */
- lopaque->btpo_flags = (xlrec->level == 0) ? BTP_LEAF : 0;
+ lopaque->btpo_flags = BTP_INCOMPLETE_SPLIT;
+ if (isleaf)
+ lopaque->btpo_flags |= BTP_LEAF;
lopaque->btpo_next = xlrec->rightsib;
lopaque->btpo_cycleid = 0;
PageSetLSN(lpage, lsn);
MarkBufferDirty(lbuf);
}
-
- UnlockReleaseBuffer(lbuf);
}
}
- /* We no longer need the right buffer */
+ /* We no longer need the buffers */
+ if (BufferIsValid(lbuf))
+ UnlockReleaseBuffer(lbuf);
UnlockReleaseBuffer(rbuf);
/*
@@ -404,32 +403,39 @@ btree_xlog_split(bool onleft, bool isroot,
* replay, because no other index update can be in progress, and readers
* will cope properly when following an obsolete left-link.
*/
- if (record->xl_info & XLR_BKP_BLOCK(1))
- (void) RestoreBackupBlock(lsn, record, 1, false, false);
- else if (xlrec->rnext != P_NONE)
+ if (xlrec->rnext != P_NONE)
{
- Buffer buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+ /*
+ * the backup block containing right sibling is 2 or 3, depending
+ * whether this was a leaf or internal page.
+ */
+ int rnext_index = isleaf ? 2 : 3;
- if (BufferIsValid(buffer))
+ if (record->xl_info & XLR_BKP_BLOCK(rnext_index))
+ (void) RestoreBackupBlock(lsn, record, rnext_index, false, false);
+ else
{
- Page page = (Page) BufferGetPage(buffer);
+ Buffer buffer;
- if (lsn > PageGetLSN(page))
+ buffer = XLogReadBuffer(xlrec->node, xlrec->rnext, false);
+
+ if (BufferIsValid(buffer))
{
- BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ Page page = (Page) BufferGetPage(buffer);
- pageop->btpo_prev = xlrec->rightsib;
+ if (lsn > PageGetLSN(page))
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(page);
- PageSetLSN(page, lsn);
- MarkBufferDirty(buffer);
+ pageop->btpo_prev = xlrec->rightsib;
+
+ PageSetLSN(page, lsn);
+ MarkBufferDirty(buffer);
+ }
+ UnlockReleaseBuffer(buffer);
}
- UnlockReleaseBuffer(buffer);
}
}
-
- /* The job ain't done till the parent link is inserted... */
- log_incomplete_split(xlrec->node,
- xlrec->leftsib, xlrec->rightsib, isroot);
}
static void
@@ -992,10 +998,7 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
Buffer buffer;
Page page;
BTPageOpaque pageop;
- BlockNumber downlink = 0;
-
- /* Backup blocks are not used in newroot records */
- Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
+ BlockNumber cblkno;
buffer = XLogReadBuffer(xlrec->node, xlrec->rootblk, true);
Assert(BufferIsValid(buffer));
@@ -1018,10 +1021,16 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
_bt_restore_page(page,
(char *) xlrec + SizeOfBtreeNewroot,
record->xl_len - SizeOfBtreeNewroot);
- /* extract downlink to the right-hand split page */
- itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_FIRSTKEY));
- downlink = ItemPointerGetBlockNumber(&(itup->t_tid));
+ /* extract block number of the left-hand split page */
+ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
+ cblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
Assert(ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+
+ /* Clear the incomplete-split flag in left child */
+ if (record->xl_info & XLR_BKP_BLOCK(0))
+ (void) RestoreBackupBlock(lsn, record, 0, false, false);
+ else
+ _bt_clear_incomplete_split(lsn, record, xlrec->node, cblkno);
}
PageSetLSN(page, lsn);
@@ -1031,10 +1040,6 @@ btree_xlog_newroot(XLogRecPtr lsn, XLogRecord *record)
_bt_restore_meta(xlrec->node, lsn,
xlrec->rootblk, xlrec->level,
xlrec->rootblk, xlrec->level);
-
- /* Check to see if this satisfies any incomplete insertions */
- if (record->xl_len > SizeOfBtreeNewroot)
- forget_matching_split(xlrec->node, downlink, true);
}
static void
@@ -1114,59 +1119,3 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
elog(PANIC, "btree_redo: unknown op code %u", info);
}
}
-
-void
-btree_xlog_startup(void)
-{
- incomplete_splits = NIL;
-}
-
-void
-btree_xlog_cleanup(void)
-{
- ListCell *l;
-
- foreach(l, incomplete_splits)
- {
- /* finish an incomplete split */
- bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
- Buffer lbuf,
- rbuf;
- Page lpage,
- rpage;
- BTPageOpaque lpageop,
- rpageop;
- bool is_only;
- Relation reln;
-
- lbuf = XLogReadBuffer(action->node, action->leftblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(lbuf))
- elog(PANIC, "btree_xlog_cleanup: left block unfound");
- lpage = (Page) BufferGetPage(lbuf);
- lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
- rbuf = XLogReadBuffer(action->node, action->rightblk, false);
- /* failure is impossible because we wrote this page earlier */
- if (!BufferIsValid(rbuf))
- elog(PANIC, "btree_xlog_cleanup: right block unfound");
- rpage = (Page) BufferGetPage(rbuf);
- rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
- /* if the pages are all of their level, it's a only-page split */
- is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
- reln = CreateFakeRelcacheEntry(action->node);
- _bt_insert_parent(reln, lbuf, rbuf, NULL,
- action->is_root, is_only);
- FreeFakeRelcacheEntry(reln);
- }
- incomplete_splits = NIL;
-}
-
-bool
-btree_safe_restartpoint(void)
-{
- if (incomplete_splits)
- return false;
- return true;
-}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a1fb948..f974ea2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -73,6 +73,7 @@ typedef BTPageOpaqueData *BTPageOpaque;
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples */
+#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
/*
* The max allowed value of a cycle ID is a bit less than 64K. This is
@@ -178,6 +179,7 @@ typedef struct BTMetaPageData
#define P_ISHALFDEAD(opaque) ((opaque)->btpo_flags & BTP_HALF_DEAD)
#define P_IGNORE(opaque) ((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
#define P_HAS_GARBAGE(opaque) ((opaque)->btpo_flags & BTP_HAS_GARBAGE)
+#define P_INCOMPLETE_SPLIT(opaque) ((opaque)->btpo_flags & BTP_INCOMPLETE_SPLIT)
/*
* Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
@@ -253,7 +255,7 @@ typedef struct xl_btree_metadata
typedef struct xl_btree_insert
{
xl_btreetid target; /* inserted tuple id */
- /* BlockNumber downlink field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
+ /* BlockNumber finishes_split field FOLLOWS IF NOT XLOG_BTREE_INSERT_LEAF */
/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_INSERT_META */
/* INDEX TUPLE FOLLOWS AT END OF STRUCT */
} xl_btree_insert;
@@ -286,19 +288,18 @@ typedef struct xl_btree_split
OffsetNumber firstright; /* first item moved to right page */
/*
- * If level > 0, BlockIdData downlink follows. (We use BlockIdData rather
- * than BlockNumber for alignment reasons: SizeOfBtreeSplit is only 16-bit
- * aligned.)
+ * In the _L variants, next are OffsetNumber newitemoff and the new item.
+ * (In the _R variants, the new item is one of the right page's tuples.)
+ * The new item, but not newitemoff, is suppressed if XLogInsert chooses
+ * to store the left page's whole page image.
*
* If level > 0, an IndexTuple representing the HIKEY of the left page
* follows. We don't need this on leaf pages, because it's the same as
* the leftmost key in the new right page. Also, it's suppressed if
* XLogInsert chooses to store the left page's whole page image.
*
- * In the _L variants, next are OffsetNumber newitemoff and the new item.
- * (In the _R variants, the new item is one of the right page's tuples.)
- * The new item, but not newitemoff, is suppressed if XLogInsert chooses
- * to store the left page's whole page image.
+ * If level > 0, BlockNumber of the page whose incomplete-split flag
+ * this insertion clears. (not aligned)
*
* Last are the right page's tuples in the form used by _bt_restore_page.
*/
@@ -642,8 +643,7 @@ extern Datum btoptions(PG_FUNCTION_ARGS);
extern bool _bt_doinsert(Relation rel, IndexTuple itup,
IndexUniqueCheck checkUnique, Relation heapRel);
extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, int access);
-extern void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
- BTStack stack, bool is_root, bool is_only);
+extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
/*
* prototypes for functions in nbtpage.c
@@ -673,7 +673,8 @@ extern BTStack _bt_search(Relation rel,
int keysz, ScanKey scankey, bool nextkey,
Buffer *bufP, int access);
extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
- ScanKey scankey, bool nextkey, int access);
+ ScanKey scankey, bool nextkey,
+ bool finishsplits, BTStack stack, int access);
extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
ScanKey scankey, bool nextkey);
extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
@@ -722,8 +723,5 @@ extern void _bt_leafbuild(BTSpool *btspool, BTSpool *spool2);
*/
extern void btree_redo(XLogRecPtr lsn, XLogRecord *record);
extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
-extern void btree_xlog_startup(void);
-extern void btree_xlog_cleanup(void);
-extern bool btree_safe_restartpoint(void);
#endif /* NBTREE_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c968101..d9ee57d 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,7 +36,7 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL)
PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL)
PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, NULL, NULL, NULL)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, NULL, NULL, NULL)
PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, NULL, NULL, NULL)
PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, NULL)
PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, NULL)
On Mon, Jan 27, 2014 at 10:27 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I think I see some bugs in _bt_moveright(). If you examine
_bt_finish_split() in detail, you'll see that it doesn't just drop the
write buffer lock that the caller will have provided (per its
comments) - it also drops the buffer pin. It calls _bt_insert_parent()
last, which was previously only called last thing by _bt_insertonpg()
(some of the time), and _bt_insertonpg() is indeed documented to drop
both the lock and the pin. And if you look at _bt_moveright() again,
you'll see that there is a tacit assumption that the buffer pin isn't
dropped, or at least that "opaque" (the BTPageOpaque BT special page
area pointer) continues to point to the same page held in the same
buffer, even though the code doesn't set the "opaque' pointer again
and it may not point to a pinned buffer or even the appropriate
buffer. Ditto "page". So "opaque" may point to the wrong thing on
subsequent iterations - you don't do anything with the return value of
_bt_getbuf(), unlike all of the existing call sites. I believe you
should store its return value, and get the held page and the special
pointer on the page, and assign that to the "opaque" pointer before
the next iteration (an iteration in which you very probably really do
make progress not limited to completing a page split, the occurrence
of which the _bt_moveright() loop gets "stuck on"). You know, do what
is done in the non-split-handling case. It may not always be the same
buffer as before, obviously.Yep, fixed.
Can you explain what the fix was, please?
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 01/23/2014 11:36 PM, Peter Geoghegan wrote:
That's all I have for now. I've written plenty of notes, and will work
back through other points of possible concern. I don't suppose you
have any testing infrastructure that you could publish?
Okay, promise not to laugh. I did write a bunch of hacks, to generate
graphviz .dot files from the btree pages, and render them into pictures.
It consist of multiple parts, all in the attached tarball. Here's how to
use it:
After installing all the parts (instructions below), launch the
btree-snapshot.sh script. It will poll every seconds, and generate a
.dot graph out of an index called 'i_foo'. It compares the snapshot with
the previous one, and if it differs, it generates a new .png file from
it under /tmp.
With that, you can get a slideshow of how the index changes, when you
execute commands that modify it. However, because it only polls once per
second, you'll have to insert some sleeps into the code you're testing,
so that the script can catch the changes in action. See attached
nbtinsert-sleeps.patch.
For example:
create table foo (t text);
create index i_foo on foo (t);
-- launch btree-snapshot.sh in another terminal
insert into foo select 'aaa' || g from generate_series(1, 10000) g;
Once that finishes, the btree-snapshot.sh script should've generated a
bunch of .png files in /tmp/:
~$ ls /tmp/*.png
/tmp/g1.png /tmp/g3.png /tmp/g5.png /tmp/g7.png
/tmp/g2.png /tmp/g4.png /tmp/g6.png /tmp/g8.png
Each shows the structure of the tree, after something changed. It shows
how a page is split, and then the downlink to it is inserted into the
parent as a separate step. I've attached those files in
graphs-example.tar. I view them with "eog /tmp/g*.png", it lets you flip
through the pictures easily.
To install this hack, do the following:
1. Patch pageinspect contrib module to not lock the pages while it looks
at them. (otherwise you won't be able to snapshot transient states where
a backend is holding pages locked)
2. Install extensions pageinspect and pgstattuple:
create extension pageinspect;
create extension pgstattuple;
3. Run btree-graphviz2.sql. It creates a bunch of functions.
That's it. Have fun :-)
- Heikki
Attachments:
btree-graph-hack.tarapplication/x-tar; name=btree-graph-hack.tarDownload
btree-graphviz2.sql 0000644 0001750 0001750 00000006206 12271523543 014114 0 ustar heikki heikki CREATE OR REPLACE FUNCTION hex_to_int(hexval varchar) RETURNS integer AS $$
DECLARE
result int;
BEGIN
EXECUTE 'SELECT x''' || hexval || '''::int' INTO result;
RETURN result;
END;
$$ LANGUAGE 'plpgsql' IMMUTABLE STRICT;
create or replace function parsekey(data text) returns integer as $$
declare
i int4;
hitext text;
hiint int4;
begin
hitext := replace(data, ' ', '');
hiint := 0;
for i in 1..8 loop
hiint := hiint * 256 + hex_to_int(right(hitext,2));
hitext := left(hitext, -2);
end loop;
return hiint;
end;
$$ language plpgsql immutable strict;
create or replace function parsetid(tid) returns text as $$
select (regexp_matches($1::text, '\((\d+),\d+\)'::text))[1];
$$ language sql immutable strict;
create or replace function getgraph(idxname text) returns text as $$
declare
nblocks bigint;
blkno bigint;
i bigint;
t text;
downlink text;
label text;
statsr record;
r record;
hitext text;
hiint int4;
firstdatakey int4;
rightlink int4;
begin
nblocks = pg_relation_size(idxname) / 8192;
t := E'digraph g {\n';
t := t || E' node [shape = record,height=.1];\n ratio="auto"\n';
for blkno in 1..(nblocks-1) loop
select * into statsr from bt_page_stats(idxname, blkno::int4);
-- ignore deleted pages
continue when statsr.type = 'd';
-- extract hikey
if statsr.btpo_next <> 0 then
select data into hitext from bt_page_items(idxname, blkno::int4) where itemoffset = 1;
--hiint := parsekey(hitext);
firstdatakey := 2;
else
-- rightmost
hitext := '';
firstdatakey := 1;
end if;
hitext = left(hitext, 20);
-- print right-link
if statsr.btpo_next <> 0 then
t := t || format(E' \"node%s\" -> \"node%s\" [constraint=false];\n', blkno, statsr.btpo_next);
end if;
label := '';
if statsr.type = 'l' then
label := 'LEAF|<' || hitext;
end if;
if statsr.type = 'i' OR statsr.type = 'r' OR statsr.type = 'e' then
i := 0;
-- print downlinks
for r in select * from bt_page_items(idxname, blkno) where itemoffset >= firstdatakey loop
label := label || left(r.data, 20) || '|';
downlink := parsetid(r.ctid);
t := t || format(E' \"node%s\":f%s -> \"node%s\":f0 [style=bold];\n', blkno, i, downlink);
i := i + 1;
end loop;
label := label || '<' || hiint;
end if;
if statsr.type = 'r' then
label := label || ' (ROOT)';
end if;
if (statsr.btpo_flags & 16) <> 0 then
label := label || ' (HALF_DEAD)';
-- "uplink" to the top of the deleted branch
downlink = parsetid((select ctid from bt_page_items(idxname, blkno::int4) where itemoffset = 1));
if downlink <> '4294967295' then
t := t || format(E' \"node%s\" -> \"node%s\" [style=dotted];\n', blkno, downlink);
end if;
end if;
t := t || format(E' node%s[label = "%s|%s"];\n', blkno, blkno, label);
end loop;
t := t || E'}\n';
return t;
end;
$$ language plpgsql;
/*
\o ~/a.dot
\t on
\pset format unaligned
\pset border 0
select getgraph('i_foo');
\! sed -i 's/+//g' ~/a.dot
\! dot -Tpng -o /tmp/a.png ~/a.dot
\! eog /tmp/a.png
*/ pageinspect-no-locking.patch 0000644 0001750 0001750 00000002553 12271525577 015753 0 ustar heikki heikki diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index e3f3c28..a46d7f2 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -204,7 +204,7 @@ bt_page_stats(PG_FUNCTION_ARGS)
CHECK_RELATION_BLOCK_RANGE(rel, blkno);
buffer = ReadBuffer(rel, blkno);
- LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ //LockBuffer(buffer, BUFFER_LOCK_SHARE);
/* keep compiler quiet */
stat.btpo_prev = stat.btpo_next = InvalidBlockNumber;
@@ -212,7 +212,8 @@ bt_page_stats(PG_FUNCTION_ARGS)
GetBTPageStatistics(blkno, buffer, &stat);
- UnlockReleaseBuffer(buffer);
+ //UnlockReleaseBuffer(buffer);
+ ReleaseBuffer(buffer);
relation_close(rel, AccessShareLock);
/* Build a tuple descriptor for our result type */
@@ -308,7 +309,7 @@ bt_page_items(PG_FUNCTION_ARGS)
CHECK_RELATION_BLOCK_RANGE(rel, blkno);
buffer = ReadBuffer(rel, blkno);
- LockBuffer(buffer, BUFFER_LOCK_SHARE);
+ //LockBuffer(buffer, BUFFER_LOCK_SHARE);
/*
* We copy the page into local storage to avoid holding pin on the
@@ -322,7 +323,8 @@ bt_page_items(PG_FUNCTION_ARGS)
uargs->page = palloc(BLCKSZ);
memcpy(uargs->page, BufferGetPage(buffer), BLCKSZ);
- UnlockReleaseBuffer(buffer);
+ //UnlockReleaseBuffer(buffer);
+ ReleaseBuffer(buffer);
relation_close(rel, AccessShareLock);
uargs->offset = FirstOffsetNumber;
btree-snapshot.sh 0000755 0001750 0001750 00000000612 12271525611 013646 0 ustar heikki heikki #!/bin/bash
rm -f /tmp/prev.dot
touch /tmp/prev.dot
CNT=1
while true
do
~/pgsql.master/bin/psql -t postgres -c "select getgraph('i_foo')" | sed 's/+//g' > /tmp/a.dot
diff -q /tmp/prev.dot /tmp/a.dot
D=$?
echo $D
if [ $D -ne 0 ]; then
dot -Tpng -o /tmp/g$CNT.png /tmp/a.dot
CNT=$(($CNT + 1))
cp /tmp/a.dot /tmp/prev.dot
fi
sleep 1
done
#| dot -Tpng -o /tmp/a.png
nbtinsert-sleeps.patchtext/x-diff; name=nbtinsert-sleeps.patchDownload
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 4d60ae0..5262fbe 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -742,6 +742,8 @@ _bt_insertonpg(Relation rel,
bool newitemonleft;
Buffer rbuf;
+ sleep(2);
+
/* Choose the split point */
firstright = _bt_findsplitloc(rel, page,
newitemoff, itemsz,
@@ -754,6 +756,8 @@ _bt_insertonpg(Relation rel,
BufferGetBlockNumber(buf),
BufferGetBlockNumber(rbuf));
+ sleep(2);
+
/*----------
* By here,
*
graphs-example.tarapplication/x-tar; name=graphs-example.tarDownload
g1.png 0000644 0001750 0001750 00000000141 12271524165 011366 0 ustar heikki heikki �PNG
IHDR &��q bKGD � � ����� IDAT�c���?^��_zT* ih�p�% IEND�B`� g2.png 0000644 0001750 0001750 00000002060 12271524170 011365 0 ustar heikki heikki �PNG
IHDR � - ��5� bKGD � � ����� �IDATx���MH2]��A
W
���E?�R�B���Z������ \E��]�k-B����f��,�A��DR�"P
�[���N}���v��BF��s��{����B� ���������"�HLP��q:���� �����5*�ZBV����Y�����777���z�^�D�����q\kk��lV,���}$���(� �d���@8�;�2��a ��|J&e��yFb� D$&��@Db� bD^__/,,X�V�J����"!����|�����X,�+�4w"?���|���M�:������� NOO��hGGG�+�13���}vv���R�j���-s����U �h4��o���r #2������/���{0���% �ooo�.��Dbqq�d2���+��e����l�z�&����igg��E ������������������~�H~-���(�
��
��f�q8
^���������rU�����k���,>�R�����F��(
<������A2�4���}F�����C�����___@�V���Q5<<��j+ ��Ha�~>�w:�:� t:����0�@`ffF�J��4m�X(�r�\uuu��ai�Z���������x~u
�'�b-���TD
�q\< ��^����v ����)YL*�����x<Z�6�LNN655
����+m%`.R��Q��Q������`���q}}}pp!���7::Z__�v�Y�g��L&
!�noo���D"�R�B�P&�Q��������e��������������S���sH��%�<�� ��s���f3�e1
��f��r4M�T�w��������n~~���St(�g$�r��9����A�7�EZZZ*<��P���D����>u��n�����qp���
"�HL "1���"�HL "1�������7���� ���(�L��K����b����%��������B���~�#��<#1���"�HL�Y�M*~ IEND�B`� g3.png 0000644 0001750 0001750 00000005513 12271524172 011376 0 ustar heikki heikki �PNG
IHDR - D`� bKGD � � ����� IDATx���{H� ��������w%��
�������!�%T
�D��cd���AO#�i��XYV��,TRI��r��G���?.�avG���]m�|��v�{��3�svfvW��� �B�'�� ���`a@!�B!-XBi1�?��y�T�!�WY�bERR�D"`>1-�o����u�����R���}mmm~~�Z�6vG�C�V������N�@2�g��=}�t�A0�0��|�R�����6�� @^^������ jjj���������{�N�@2�R�T*'� �)�'yyy:� �1 ����!��,!��`a@!�B!-XBia)��������/�J����c�H$���� �\�R"�,[�L�P�� ������`�������q�B3���������+����/""���������"�~�:))������������}��=�����Ef�4�a�L��o�������
�B&�-Z�(>>����}a>����@��c�4��w<n|�y���|����Q]]}�qi��gdd�s��I�g�;���6OO�qW�!6�����3g�z�J���"��
����x8��9�Qw��Y�jkk){�|2q�������C�����k�NU����@&�q��e���hnnNII������������4 HNNniia����R������^\\��hz{{���BBB���)F������P������������zg��6f��a��c'N�x��M__�F�)--uss�����Q�iZ���Z�O*
#�0x�7������# �;w ����B ����9==�����r�� ����7����������[� �������f`���#}�� ����{���G?f����x8"gW�^����{�����x}� ���s�%��DH�O����|.))��������]�|}};;;���3�� ���:��������������{�GUUU �oO>x�������P(���w�����/����7 !!����p�0��B��� ���d������,������������� $#O0&��� ��r�4,S�'��'
9���c�����$���W^^^dd�����]��������l��_^��� �������/�# P���W�...���S�����111�]�x� ���O�:�p�B�L���������9�I�bd>�F����-[� �[E'��X���,�6A�\ww��������?~l�
A^�FDDL<���b>���L�R���T*��3d2�Z�Q��2�L*������f:��~���w�^LL�B� �������rrrzzzt�����A����A ��� ��� ���{��AOOOUU�������U���8�����utt���S��oOxxx\�ti"��Gf3���i#N�R���gI|> pqqINNnhh�&��������3g*�����`>��'���KI�/_������mmm���6**jxx���+�<�9s����_�v���w��eg���������yE����P(BCCO�>-����a ���(,,$�%
\\\ZZZ:�0��,����w���j�����2/�J���������y��Q}�|��a��u��������7�(
a�-:U��D/X� ***�� �`����a�Pc������|����L������ ���7����v ptt$�����n�AWW��Df�� UUU����� 0w�\a{��b����{� ))��w��lc�Y4m�)�J�Q3������A����K�677��XSSCRmjj*M{��|"Nx�0��Kee%}����������m���O�H����\� �������`��U�!Ca �I���?�����=�n1��B���������Gf3���i#N�T�������;E
�T*%?]�d dgg�tWZZjmm
���4��0���q���$��B�|��;l7n���a�D"!�~NNN[�n-))��Y� rrr�����P����
����>C`` �|�����_�={6��L3��$���c������,�6�rpp�<R)����%I``����U*�E�|K��+W������N�:E9�)a��Do�*���������R"�od���J�RKKK��6�\F&������s���Z��;��~���Dn�>|��%���&����3�"���EEE$laa����XS��-�y���EEECCC��/))������0���#���fa��G��t�h����t�_4�|r��1��\��|��'�s)������@��%� ++ ��];j�u��@VV?�����#G�x{{�7���\���N��8�6mw" ������KRl���"������������mo��0(��/R����
lmm���?~������Wb?~��*��Q����0�s����Q����@@@ ?�����&AAAr�\.��4hnn���������qqq}}}4���
R(���?��7`>�����'��o�7o����\.�����o���_zg���7�]�m�"^
ElllYY����az�i�O�b���l��w#��)������4A0�0�K
��H�D!���$h�M�w%!���BZ�0 ����!��,!��`a@!�B!-XBi������O��%������mmm��k��466����4�R�t���
�}MI>�0�?=i�@����JzS=������C!���Bi���BH�BZ�0 ���?T����&V IEND�B`� g4.png 0000644 0001750 0001750 00000014335 12271524174 011403 0 ustar heikki heikki �PNG
IHDR � ��� bKGD � � ����� �IDATx���kTSW���p'A�z)KT�T
(7�N���]VPg��:��������ZpF����Q������R�+��R�X�����B���-y?�y��������}��lv�~Nr�����#� ���� �/H � � 1 �]� �<6l`;U wuue;
`f ���������+**rss�B�J{���mhhPi�0c PDXX�����]JJJBB�Jw�9��
� HAb )H � � 1 �$ ��� �Bw�� ���366644������l� �c P!GGG�����������o����W_e+* �0c P!gg����������PXXXhoo����s�N�Cf *t��m�����+W�����;wnyy9�Q���Z���B�N��v �Bb P�������o����v8 ���$ ��|8����g�}��?���x ���@��<yr�����^�����b�XL]�$RRRv���vP �Bb P���s��Brrr�`XH jED���`;�a!1 ��������?������������M���`;4�a��$ ******Thaa����J<