Race condition in b-tree page deletion

Started by Heikki Linnakangasabout 12 years ago20 messages

hlinnakangas@vmware.com

about 12 years ago

The B-tree page deletion algorithm has a race condition. We don't allow
a page to be deleted if it's the rightmost child of its parent, but that
situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right
sibling, and the parent, in that order. However, if the parent page is
split after the check but before acquiring the locks, the target page
might become the rightmost child, if the split happens at the right
place. That leads to an error in vacuum (I reproduced this by setting a
breakpoint in debugger):

ERROR: failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and
throw the above error if it's not. We could easily just give up rather
than throw an error, but that approach doesn't scale to half-dead pages.
To recap, although we don't normally allow deleting the rightmost child,
if the page is the *only* child of its parent, we delete the child page
and mark the parent page as half-dead in one atomic operation. But
before we do that, we check that the parent can later be deleted, by
checking that it in turn is not the rightmost child of the grandparent
(potentially recursing all the way up to the root). But the same
situation can arise there - the grandparent can be split while we're not
holding the locks. We end up with a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace
at the grandparent level is "out-of-whack" until the half-dead page is
deleted, and if enough tuples with keys in the transferred keyspace are
inserted, the page might get split and a downlink might be inserted into
the grandparent that is out-of-order. That might not cause any serious
problem if it's transient (as the README ponders), but is surely bad if
it stays that way.

Solutions
---------

1. The simplest solution I can see is to never delete internal pages,
avoiding the whole concept of half-dead pages. That obviously leads to
some bloat, though.

2. The second-simplest solution I see is to keep locked the whole chain
of pages that will be deleted, and delete all of them as one atomic
WAL-logged operation. Ie. the leaf page, and all the parent pages above
it that will become half-dead, and the parent of the last half-dead page.

3. Another approach would be to get rid of the "can't delete rightmost
child" limitation. We currently have that limitation because it ensures
that we never need to change the high-key of a page. If we delete a page
that is the rightmost child of its parent, we transfer the deleted
keyspace from the parent page to its right sibling. To do that, we need
to update the high key of the parent, as well as the downlink of the
right sibling at the grandparent level. That's a bit complicated,
because updating the high key might require splitting the page.

Still, I think it would be doable: When we delete a page that is the
rightmost child of its parent, we mark the right sibling with an
"out-of-whack" flag. It indicates that the keyspace of that page is not
in sync in the parent level. Searches work fine, as there are no keys in
the keyspace that is out-of-whack, but before allowing any insertions to
the page, the situation needs to be fixed. That is the responsibility of
the next insert to the page that comes along. To fix it, the parent's
left sibling's high key needs to be updated, as well as the parent's
downlink in the grandparent. We can do that in a few discrete steps;
first update the high key, then update the downlink, and finally once
the high key and parent's downlink are in sync with the reality at the
bottom level, clear the "out-of-whack" flag.

On reflection, approach 2 seems best. It's fairly simple, although it
potentially requires holding many pages locked simultaneously. In
practice, B-trees are typically not very deep. We have a limit of 4
full-page images in a single WAL record, but since all the pages are
empty, we can just WAL-log all the information needed to reconstruct the
pages in deleted state without using full-page images. This also might
be back-patchable, although it requires changing the WAL record format,
so it would break replication from higher minor version to lower.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Heikki Linnakangas (#1)

Re: Race condition in b-tree page deletion

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

2. The second-simplest solution I see is to keep locked the whole chain
of pages that will be deleted, and delete all of them as one atomic
WAL-logged operation. Ie. the leaf page, and all the parent pages above
it that will become half-dead, and the parent of the last half-dead page.

This would be more tenable if we could put a known limit on the number of
pages to be changed at once. I'm not too awake at the moment, so maybe
this is not possible, but could we simply decide in advance that we won't
propagate page deletion up more than a-small-number of levels? It'd
require allowing a page to remain half-dead until some other vacuum comes
along and fixes it, but that seems OK.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Tom Lane (#2)

Re: Race condition in b-tree page deletion

On 09.11.2013 18:24, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

2. The second-simplest solution I see is to keep locked the whole chain
of pages that will be deleted, and delete all of them as one atomic
WAL-logged operation. Ie. the leaf page, and all the parent pages above
it that will become half-dead, and the parent of the last half-dead page.

This would be more tenable if we could put a known limit on the number of
pages to be changed at once. I'm not too awake at the moment, so maybe
this is not possible, but could we simply decide in advance that we won't
propagate page deletion up more than a-small-number of levels? It'd
require allowing a page to remain half-dead until some other vacuum comes
along and fixes it, but that seems OK.

I don't feel comfortable leaving pages in half-dead state. Looking back
at the archives, your original design ten years ago did exactly that
(/messages/by-id/8281.1045089764@sss.pgh.pa.us),
but that turned out to be a bad idea
(http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=70ce5c908202ada7616f7afded8a91bbf2742471).
Mind you, even if we check that the half-dead page at some upper level
is eventually deletable because it's not the rightmost child, it might
become the rightmost child if we don't hold the lock so that the next
vacuum cannot delete it, and we're back to square one.

We could just punt if more than X pages would need to be changed. That
would mean that we never delete pages at the top (h - X) levels of the
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
As another data point, in the worst case the keys are so wide that only
2 keys fit on each B-tree page. With that assumption, the tree can be at
most 32 levels deep if you just insert into it with no deletions, since
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
sure). Deletions make it more complicated, but I would be pretty
surprised if you can construct a B-tree tallers than, say 40 levels.

Things gets tricky if shared_buffers is very small; with
shared_buffers=16, you certainly can't hold more than 16 buffers pinned
at once. (in fact, I think ginfast.c already has a problem with that...)

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Heikki Linnakangas (#3)

Re: Race condition in b-tree page deletion

On 09.11.2013 18:49, Heikki Linnakangas wrote:

We could just punt if more than X pages would need to be changed. That
would mean that we never delete pages at the top (h - X) levels of the
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
As another data point, in the worst case the keys are so wide that only
2 keys fit on each B-tree page. With that assumption, the tree can be at
most 32 levels deep if you just insert into it with no deletions, since
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
sure). Deletions make it more complicated, but I would be pretty
surprised if you can construct a B-tree tallers than, say 40 levels.

On further thought, it's worse than that. To delete a page, you need to
lock the left and right siblings, so you need 3 pages locked per each
level you delete...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Heikki Linnakangas (#4)

Re: Race condition in b-tree page deletion

On 09.11.2013 19:18, Heikki Linnakangas wrote:

On 09.11.2013 18:49, Heikki Linnakangas wrote:

We could just punt if more than X pages would need to be changed. That
would mean that we never delete pages at the top (h - X) levels of the
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
As another data point, in the worst case the keys are so wide that only
2 keys fit on each B-tree page. With that assumption, the tree can be at
most 32 levels deep if you just insert into it with no deletions, since
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
sure). Deletions make it more complicated, but I would be pretty
surprised if you can construct a B-tree tallers than, say 40 levels.

On further thought, it's worse than that. To delete a page, you need to
lock the left and right siblings, so you need 3 pages locked per each
level you delete...

On further further thought, we don't need to unlink the pages
immediately. It's enough to mark them as half-dead and remove the
downlink to the upmost half-dead page. Marking a page as half-dead is as
good as deleting it outright as far as searches and insertions are
concerned. Actually unlinking the pages from the left and right siblings
can be done at any later time, and doesn't need to be done in any
particular order.

So, my original musings about the number of concurrent locks needed
still holds.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#5)

Re: Race condition in b-tree page deletion

On Sat, Nov 9, 2013 at 12:40 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 09.11.2013 19:18, Heikki Linnakangas wrote:

On 09.11.2013 18:49, Heikki Linnakangas wrote:

We could just punt if more than X pages would need to be changed. That
would mean that we never delete pages at the top (h - X) levels of the
tree. In practice that should be fine if X is high enough.
As a data point, GIN list page deletion holds 16 pages locked at once
(GIN_NDELETE_AT_ONCE). Normally, a 16-level deep B-tree is pretty huge.
As another data point, in the worst case the keys are so wide that only
2 keys fit on each B-tree page. With that assumption, the tree can be at
most 32 levels deep if you just insert into it with no deletions, since
MaxBlockNumber ~= 2^32 (I may be off by one in either direction, not
sure). Deletions make it more complicated, but I would be pretty
surprised if you can construct a B-tree tallers than, say 40 levels.

On further thought, it's worse than that. To delete a page, you need to
lock the left and right siblings, so you need 3 pages locked per each
level you delete...

On further further thought, we don't need to unlink the pages immediately.
It's enough to mark them as half-dead and remove the downlink to the upmost
half-dead page. Marking a page as half-dead is as good as deleting it
outright as far as searches and insertions are concerned. Actually unlinking
the pages from the left and right siblings can be done at any later time,
and doesn't need to be done in any particular order.

So, my original musings about the number of concurrent locks needed still
holds.

I think we've tried pretty hard to avoid algorithms where the maximum
number of lwlocks that must be held at one time is not a constant, and
I think we're in for a bad time of it if we start to deviate from that
principal. I'm not sure what to do about this problem, but I think
locking N levels of the tree at once, where N can be as large as the
tree is deep, is probably a bad plan, whether the number of locks
required is N or 3N.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Robert Haas (#6)

Re: Race condition in b-tree page deletion

On 10.11.2013 01:47, Robert Haas wrote:

I think we've tried pretty hard to avoid algorithms where the maximum
number of lwlocks that must be held at one time is not a constant, and
I think we're in for a bad time of it if we start to deviate from that
principal. I'm not sure what to do about this problem, but I think
locking N levels of the tree at once, where N can be as large as the
tree is deep, is probably a bad plan, whether the number of locks
required is N or 3N.

I think I found a solution that accomplishes that. It's actually not
that complicated:

Like we currently do, first climb up the tree to check that it's safe to
delete, ie. the downlink in the first non-empty parent is not the
rightmost entry. But when we reach the level where the parent is
non-empty - I'll call that the "topmost" parent - we keep that page
locked. The leaf page is kept locked while we climb.

This is enough to plug the race condition. As long as we hold a lock on
the topmost parent containing the downlink to the branch of pages we're
about to delete, it cannot become the rightmost entry. And as long as we
hold a lock on the leaf page, no new insertions can happen on any of the
internal pages in the branch, as insertions to internal pages only
happen when a child is split. However, the rest of the algorithm needs
to be slightly modified, as we cannot re-lock the lower-level pages
until we release the lock on the topmost page, to avoid deadlock.

So at this point, we hold two locks: the leaf page, and the topmost
parent containing the downlink to the branch we're deleting. Next, we
remove the downlink from the topmost parent, and mark the leaf page as
half-dead in one atomic operation. Also, a pointer to the highest
internal page in the branch we're deleting - the one the removed
downlink pointed to - is put on the leaf page. We can now release the locks.

At this point, searches and inserts work fine. The leaf page has been
marked as half-dead, so any insertions to the deleted page's keyspace
will go to its right sibling. The downlink is to the top of the branch
is gone, so even if the right sibling is split many times, any keys in
the transferred keyspace that propagate to the higher levels won't be
out-of-order.

All that is left is to unlink the all the lingering pages in the branch
we're deleting from their left and right siblings. This can be done at
any later time, and if we error out or crash for some reason, next
vacuum that comes along can finish the job. This is done one level at a
time. Lock the leaf page, and the internal page the leaf page points to,
and the internal page's left and right siblings (in the right order, not
this order). Change the left and right sibling's right- and left-links,
mark the internal page as deleted, and update the pointer in the leaf
page to point to the child of the deleted internal page. Then recurse to
the child, until we reach the leaf level.

This has the nice extra property that we don't need the
incomplete-action tracking in WAL recovery. I'd like to get rid of that
anyway.

I'm not sure what to do about stable branches. This could be
back-patched, with the caveat that this introduces new WAL record types
so standbys need to be upgraded before the master. But given the lack of
field reports in the ten years this race condition has existed, I'm not
sure it's worth the hassle.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#7)

Re: Race condition in b-tree page deletion

On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 10.11.2013 01:47, Robert Haas wrote:

I think we've tried pretty hard to avoid algorithms where the maximum
number of lwlocks that must be held at one time is not a constant, and
I think we're in for a bad time of it if we start to deviate from that
principal. I'm not sure what to do about this problem, but I think
locking N levels of the tree at once, where N can be as large as the
tree is deep, is probably a bad plan, whether the number of locks
required is N or 3N.

I think I found a solution that accomplishes that. It's actually not that
complicated:

Like we currently do, first climb up the tree to check that it's safe to
delete, ie. the downlink in the first non-empty parent is not the rightmost
entry. But when we reach the level where the parent is non-empty - I'll call
that the "topmost" parent - we keep that page locked. The leaf page is kept
locked while we climb.

This is enough to plug the race condition. As long as we hold a lock on the
topmost parent containing the downlink to the branch of pages we're about to
delete, it cannot become the rightmost entry. And as long as we hold a lock
on the leaf page, no new insertions can happen on any of the internal pages
in the branch, as insertions to internal pages only happen when a child is
split. However, the rest of the algorithm needs to be slightly modified, as
we cannot re-lock the lower-level pages until we release the lock on the
topmost page, to avoid deadlock.

So at this point, we hold two locks: the leaf page, and the topmost parent
containing the downlink to the branch we're deleting. Next, we remove the
downlink from the topmost parent, and mark the leaf page as half-dead in one
atomic operation. Also, a pointer to the highest internal page in the branch
we're deleting - the one the removed downlink pointed to - is put on the
leaf page. We can now release the locks.

At this point, searches and inserts work fine. The leaf page has been marked
as half-dead, so any insertions to the deleted page's keyspace will go to
its right sibling. The downlink is to the top of the branch is gone, so even
if the right sibling is split many times, any keys in the transferred
keyspace that propagate to the higher levels won't be out-of-order.

All that is left is to unlink the all the lingering pages in the branch
we're deleting from their left and right siblings. This can be done at any
later time, and if we error out or crash for some reason, next vacuum that
comes along can finish the job. This is done one level at a time. Lock the
leaf page, and the internal page the leaf page points to, and the internal
page's left and right siblings (in the right order, not this order). Change
the left and right sibling's right- and left-links, mark the internal page
as deleted, and update the pointer in the leaf page to point to the child of
the deleted internal page. Then recurse to the child, until we reach the
leaf level.

This has the nice extra property that we don't need the incomplete-action
tracking in WAL recovery. I'd like to get rid of that anyway.

I am not very knowledgeable of this code, but at least with my limited
knowledge I can't spot any problems with that approach. The only
thing that seems a little unfortunate is that the next vacuum may need
to finish cleaning things up; I thought at one point you were trying
to get rid of the amvacuumcleanup stuff. But maybe I'm
misremembering, and anyway it still seems like a step forward.

I'm not sure what to do about stable branches. This could be back-patched,
with the caveat that this introduces new WAL record types so standbys need
to be upgraded before the master. But given the lack of field reports in the
ten years this race condition has existed, I'm not sure it's worth the
hassle.

In the absence of at least one (and maybe several) reports from the
field, -1 for back-patching this. At this point, it seems like far
more people will be confused by the need to upgrade things in order
than will benefit from the fix.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Robert Haas (#8)

1 attachment(s)

Re: Race condition in b-tree page deletion

On 12.11.2013 17:13, Robert Haas wrote:

On Mon, Nov 11, 2013 at 3:03 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

All that is left is to unlink the all the lingering pages in the branch
we're deleting from their left and right siblings. This can be done at any
later time, and if we error out or crash for some reason, next vacuum that
comes along can finish the job. This is done one level at a time. Lock the
leaf page, and the internal page the leaf page points to, and the internal
page's left and right siblings (in the right order, not this order). Change
the left and right sibling's right- and left-links, mark the internal page
as deleted, and update the pointer in the leaf page to point to the child of
the deleted internal page. Then recurse to the child, until we reach the
leaf level.

This has the nice extra property that we don't need the incomplete-action
tracking in WAL recovery. I'd like to get rid of that anyway.

I am not very knowledgeable of this code, but at least with my limited
knowledge I can't spot any problems with that approach. The only
thing that seems a little unfortunate is that the next vacuum may need
to finish cleaning things up; I thought at one point you were trying
to get rid of the amvacuumcleanup stuff. But maybe I'm
misremembering, and anyway it still seems like a step forward.

Ah no, it only requires the next vacuum to clean up, if the first vacuum
is interrupted for some reason. Normally, all of the steps are performed
one after another in the same vacuum process, and half-dead pages will
only be seen by backends that look at the affected pages at the same
instant. But in case of an error or crash, the next vacuum will pick up
where the previous one left off.

Here's a patch implementing this scheme. One noteworthy detail if you're
familiar with the old scheme: in this patch, only leaf pages are marked
as half-dead. Never internal pages. This is the opposite of what we do
currently; currently only internal pages can be marked half-dead, never
leaf pages. This also gives us a chance to tell apart half-dead pages
lingering after pg_upgrade from an older version, and pages marked as
half-dead with the new code. The incomplete-action mechanism should have
ensured that the former don't exist, but in case something went wrong
some time before the upgrade and there is such a page in the tree after
all, the patch gives a LOG message about it, advising to reindex.
Searches will work fine in any case, but insertions to unfortunate pages
might cause page splits and out-of-order keys in the upper levels, if
the pre-upgrade half-dead pages are not cleaned up by reindexing.

- Heikki

Attachments:

fix-btree-page-deletion-2.patchtext/x-diff; name=fix-btree-page-deletion-2.patchDownload

*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
***************
*** 168,173 **** parent item does still exist and can't have been deleted.  Also, because
--- 168,176 ----
  we are matching downlink page numbers and not data keys, we don't have any
  problem with possibly misidentifying the parent item.
  
+ Page Deletion
+ -------------
+ 
  We consider deleting an entire page from the btree only when it's become
  completely empty of items.  (Merging partly-full pages would allow better
  space reuse, but it seems impractical to move existing data items left or
***************
*** 176,193 **** might miss the items if so.)  Also, we *never* delete the rightmost page
  on a tree level (this restriction simplifies the traversal algorithms, as
  explained below).
  
! To delete an empty page, we acquire write lock on its left sibling (if
! any), the target page itself, the right sibling (there must be one), and
! the parent page, in that order.  The parent page must be found using the
! same type of search as used to find the parent during an insertion split.
! Then we update the side-links in the siblings, mark the target page
! deleted, and remove the downlink from the parent, as well as the parent's
! upper bounding key for the target (the one separating it from its right
! sibling).  This causes the target page's key space to effectively belong
! to its right sibling.  (Neither the left nor right sibling pages need to
! change their "high key" if any; so there is no problem with possibly not
! having enough space to replace a high key.)  The side-links in the target
! page are not changed.
  
  (Note: Lanin and Shasha prefer to make the key space move left, but their
  argument for doing so hinges on not having left-links, which we have
--- 179,198 ----
  on a tree level (this restriction simplifies the traversal algorithms, as
  explained below).
  
! Page deletion always begins at the leaf level. An internal page can only be
! deleted as part of a branch leading to a leaf page, where each internal page
! has only one child and that child is also to be deleted.
! 
! Page deletion is a two-stage process. First, the parent page must be found
! using the same type of search as used to find the parent during an insertion
! split. We lock the target and the parent pages, change the target's downlink
! to point to the right sibling, and remove its old downlink. This causes the
! target page's key space to effectively belong to its right sibling.  (Neither
! the left nor right sibling pages need to change their "high key" if any; so
! there is no problem with possibly not having enough space to replace a high
! key.)  At the same time, we mark the target page as half-dead, which causes
! any subsequent searches to ignore it and move right (or left, in a backwards
! scan).
  
  (Note: Lanin and Shasha prefer to make the key space move left, but their
  argument for doing so hinges on not having left-links, which we have
***************
*** 199,229 **** the same parent --- otherwise, the parent's key space assignment changes
  too, meaning we'd have to make bounding-key updates in its parent, and
  perhaps all the way up the tree.  Since we can't possibly do that
  atomically, we forbid this case.  That means that the rightmost child of a
! parent node can't be deleted unless it's the only remaining child.
! 
! When we delete the last remaining child of a parent page, we mark the
! parent page "half-dead" as part of the atomic update that deletes the
! child page.  This implicitly transfers the parent's key space to its right
! sibling (which it must have, since we never delete the overall-rightmost
! page of a level).  Searches ignore the half-dead page and immediately move
! right.  We need not worry about insertions into a half-dead page --- insertions
! into upper tree levels happen only as a result of splits of child pages, and
! the half-dead page no longer has any children that could split.  Therefore
! the page stays empty even when we don't have lock on it, and we can complete
! its deletion in a second atomic action.
! 
! The notion of a half-dead page means that the key space relationship between
! the half-dead page's level and its parent's level may be a little out of
! whack: key space that appears to belong to the half-dead page's parent on the
! parent level may really belong to its right sibling.  To prevent any possible
! problems, we hold lock on the deleted child page until we have finished
! deleting any now-half-dead parent page(s).  This prevents any insertions into
! the transferred keyspace until the operation is complete.  The reason for
! doing this is that a sufficiently large number of insertions into the
! transferred keyspace, resulting in multiple page splits, could propagate keys
! from that keyspace into the parent level, resulting in transiently
! out-of-order keys in that level.  It is thought that that wouldn't cause any
! serious problem, but it seems too risky to allow.
  
  A deleted page cannot be reclaimed immediately, since there may be other
  processes waiting to reference it (ie, search processes that just left the
--- 204,246 ----
  too, meaning we'd have to make bounding-key updates in its parent, and
  perhaps all the way up the tree.  Since we can't possibly do that
  atomically, we forbid this case.  That means that the rightmost child of a
! parent node can't be deleted unless it's the only remaining child, in which
! case we will delete the parent too.
! 
! When we're about to delete the last remaining child of a parent page, we
! leave the parent page alone and remove the downlink to the parent page
! instead, in the grandparent. If it's the last child of the grandparent too,
! we recurse up while keeping the leaf page locked, until we find a parent with
! more than one child. We don't need to hold locks on the intermediate pages
! between the leaf and the final parent -- insertions into upper tree levels
! happen only as a result of splits of child pages, and we never split a
! half-dead leaf page. Removing the downlink to the top of the to-be-deleted
! branch effectively transfers the key space to the right sibling for all the
! intermediate levels too, in one atomic operation. A concurrent search might
! still visit the intermediate pages, but it will move right when it reaches
! the half-dead page at the leaf level.
! 
! When we mark the leaf page as half-dead, we also store a link to the top
! of the to-be-deleted branch, ie. the block number of the page whose downlink
! we removed. This is needed in the second phase, to unlink the pages from
! their siblings.
! 
! The second phase is to unlink the half-dead leaf page, and any internal pages
! in the to-be-deleted branch. If the half-dead page still has a parent pointing
! to it, ie. if we're removing a whole branch, we unlink the topmost remaining
! parent first. We use the link we stashed in the half-dead leaf page to get its
! block number. That is the target of the unlink operation. We first lock the
! leaf page (unless we're removing the leaf), then the left sibling (if any) of
! the target, the target page itself, and its right sibling (there must be one)
! in that order. Then we update the side-links in the siblings, and mark the
! target page deleted. If the target was the leaf page, we're done. Otherwise we
! update the top-link in leaf page with the next page in the branch we're
! deleting, ie. the only child of the target page. The side-links in the target
! page are not changed.
! 
! This is repeated, chewing pages in the branch top-down, until all the internal
! pages above the half-dead leaf page have been removed. Finally, the half-dead
! leaf page itself is removed.
  
  A deleted page cannot be reclaimed immediately, since there may be other
  processes waiting to reference it (ie, search processes that just left the
***************
*** 396,407 **** metapage update (of the "fast root" link) is performed and logged as part
  of the insertion into the parent level.  When splitting the root page, the
  metapage update is handled as part of the "new root" action.
  
! A page deletion is logged as a single WAL entry covering all four
! required page updates (target page, left and right siblings, and parent)
! as an atomic event.  (Any required fast-root link update is also part
! of the WAL entry.)  If the parent page becomes half-dead but is not
! immediately deleted due to a subsequent crash, there is no loss of
! consistency, and the empty page will be picked up by the next VACUUM.
  
  Scans during Recovery
  ---------------------
--- 413,423 ----
  of the insertion into the parent level.  When splitting the root page, the
  metapage update is handled as part of the "new root" action.
  
! Each step in page deletion are logged as separate WAL entries: marking the
! leaf as half-dead and removing the downlink is one record, and unlinking a
! page is a second record. If vacuum is interrupted for some reason, or the
! system crashes, the tree is consistent for searches and insertions. The next
! VACUUM will find the half-dead leaf page and continue the deletion.
  
  Scans during Recovery
  ---------------------
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 31,36 ****
--- 31,44 ----
  #include "utils/inval.h"
  #include "utils/snapmgr.h"
  
+ static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
+ static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
+ 						 bool *rightsib_empty);
+ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
+ 					   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
+ 					   BlockNumber *target, BlockNumber *rightsib);
+ static void  _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+ 				   TransactionId latestRemovedXid);
  
  /*
   *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
***************
*** 954,977 **** _bt_delitems_delete(Relation rel, Buffer buf,
  }
  
  /*
!  * Subroutine to pre-check whether a page deletion is safe, that is, its
!  * parent page would be left in a valid or deletable state.
   *
!  * "target" is the page we wish to delete, and "stack" is a search stack
   * leading to it (approximately).  Note that we will update the stack
   * entry(s) to reflect current downlink positions --- this is harmless and
!  * indeed saves later search effort in _bt_pagedel.
   *
!  * Note: it's OK to release page locks after checking, because a safe
!  * deletion can't become unsafe due to concurrent activity.  A non-rightmost
!  * page cannot become rightmost unless there's a concurrent page deletion,
!  * but only VACUUM does page deletion and we only allow one VACUUM on an index
!  * at a time.  An only child could acquire a sibling (of the same parent) only
!  * by being split ... but that would make it a non-rightmost child so the
!  * deletion is still safe.
   */
  static bool
! _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
  {
  	BlockNumber parent;
  	OffsetNumber poffset,
--- 962,997 ----
  }
  
  /*
!  * Subroutine to find the parent of the branch we're deleting.  This climbs
!  * up the tree until it finds a page with more than one child, ie. a page
!  * that will not be totally emptied by the deletion.  The chain of pages below
!  * it, with one downlink each, will be part of the branch that we need to
!  * delete.
   *
!  * If we cannot remove the downlink from the parent, because it's the
!  * rightmost entry, returns false.  On success, *topparent and *topoff are set
!  * to the buffer holding the parent, and the offsetnumber of the downlink in
!  * it.  The parent buffer is write-locked, the caller is responsible for
!  * releasing it when done.  *target is set to the topmost page in the branch
!  * to-be-deleted, ie. the page whose downlink *topparent / *topoff point to,
!  * and *rightsib to its right sibling.
!  *
!  * "child" is the leaf page we wish to delete, and "stack" is a search stack
   * leading to it (approximately).  Note that we will update the stack
   * entry(s) to reflect current downlink positions --- this is harmless and
!  * indeed saves later search effort in _bt_pagedel.  The caller should
!  * initialize *target and *rightsib to the leaf page and its right sibling.
   *
!  * Note: it's OK to release page locks on any internal pages between the leaf
!  * and *topparent, because a safe deletion can't become unsafe due to
!  * concurrent activity.  An internal page can only acquire an entry if the
!  * child is split, but that cannot happen as long as we hold a lock on the
!  * leaf.
   */
  static bool
! _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
! 					   Buffer *topparent, OffsetNumber *topoff,
! 					   BlockNumber *target, BlockNumber *rightsib)
  {
  	BlockNumber parent;
  	OffsetNumber poffset,
***************
*** 980,999 **** _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
  	Page		page;
  	BTPageOpaque opaque;
  
- 	/*
- 	 * In recovery mode, assume the deletion being replayed is valid.  We
- 	 * can't always check it because we won't have a full search stack, and we
- 	 * should complain if there's a problem, anyway.
- 	 */
- 	if (InRecovery)
- 		return true;
- 
  	/* Locate the parent's downlink (updating the stack entry if needed) */
! 	ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
  	pbuf = _bt_getstackbuf(rel, stack, BT_READ);
  	if (pbuf == InvalidBuffer)
  		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
! 			 RelationGetRelationName(rel), target);
  	parent = stack->bts_blkno;
  	poffset = stack->bts_offset;
  
--- 1000,1011 ----
  	Page		page;
  	BTPageOpaque opaque;
  
  	/* Locate the parent's downlink (updating the stack entry if needed) */
! 	ItemPointerSet(&(stack->bts_btentry.t_tid), child, P_HIKEY);
  	pbuf = _bt_getstackbuf(rel, stack, BT_READ);
  	if (pbuf == InvalidBuffer)
  		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
! 			 RelationGetRelationName(rel), child);
  	parent = stack->bts_blkno;
  	poffset = stack->bts_offset;
  
***************
*** 1021,1028 **** _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
  				return false;
  			}
  
  			_bt_relbuf(rel, pbuf);
! 			return _bt_parent_deletion_safe(rel, parent, stack->bts_parent);
  		}
  		else
  		{
--- 1033,1044 ----
  				return false;
  			}
  
+ 			*target = child;
+ 			*rightsib = opaque->btpo_next;
+ 
  			_bt_relbuf(rel, pbuf);
! 			return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
! 										  topparent, topoff, target, rightsib);
  		}
  		else
  		{
***************
*** 1034,1040 **** _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
  	else
  	{
  		/* Not rightmost child, so safe to delete */
! 		_bt_relbuf(rel, pbuf);
  		return true;
  	}
  }
--- 1050,1057 ----
  	else
  	{
  		/* Not rightmost child, so safe to delete */
! 		*topparent = pbuf;
! 		*topoff = poffset;
  		return true;
  	}
  }
***************
*** 1057,1063 **** _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
   * to avoid duplicated search effort.
   *
   * Returns the number of pages successfully deleted (zero if page cannot
!  * be deleted now; could be more than one if parent pages were deleted too).
   *
   * NOTE: this leaks memory.  Rather than trying to clean up everything
   * carefully, it's better to run it in a temp context that can be reset
--- 1074,1081 ----
   * to avoid duplicated search effort.
   *
   * Returns the number of pages successfully deleted (zero if page cannot
!  * be deleted now; could be more than one if parent or sibling pages were
!  * deleted too).
   *
   * NOTE: this leaks memory.  Rather than trying to clean up everything
   * carefully, it's better to run it in a temp context that can be reset
***************
*** 1066,1102 **** _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
  int
  _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  {
! 	int			result;
! 	BlockNumber target,
! 				leftsib,
! 				rightsib,
! 				parent;
! 	OffsetNumber poffset,
! 				maxoff;
! 	uint32		targetlevel,
! 				ilevel;
! 	ItemId		itemid;
! 	IndexTuple	targetkey,
! 				itup;
! 	ScanKey		itup_scankey;
! 	Buffer		lbuf,
! 				rbuf,
! 				pbuf;
! 	bool		parent_half_dead;
! 	bool		parent_one_child;
  	bool		rightsib_empty;
- 	Buffer		metabuf = InvalidBuffer;
- 	Page		metapg = NULL;
- 	BTMetaPageData *metad = NULL;
  	Page		page;
  	BTPageOpaque opaque;
  
  	/*
  	 * We can never delete rightmost pages nor root pages.	While at it, check
  	 * that page is not already deleted and is empty.
  	 */
- 	page = BufferGetPage(buf);
- 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
  		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
  	{
--- 1084,1128 ----
  int
  _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  {
! 	int			ndeleted = 0;
! 	BlockNumber rightsib;
  	bool		rightsib_empty;
  	Page		page;
  	BTPageOpaque opaque;
  
+ retry:
+ 	page = BufferGetPage(buf);
+ 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ 
+ 	/*
+ 	 * Internal pages are never deleted directly, only as part of deleting
+ 	 * the whole branch all the way down to leaf level.
+ 	 */
+ 	if (!P_ISLEAF(opaque))
+ 	{
+ 		/*
+ 		 * Pre-9.4 page deletion only marked internal pages as half-dead, but
+ 		 * now we only use that flag on leaf pages. The old algorithm was
+ 		 * never supposed to leave half-dead pages in the tree, it was just
+ 		 * a transient state, but it was nevertheless possible in error
+ 		 * scenarios. We don't know how to deal with them here. They are
+ 		 * harmless as far as searches are considered, but inserts into the
+ 		 * deleted keyspace could add out-of-order downlinks in the upper
+ 		 * levels. Log a notice, hopefully the admin will notice and reindex.
+ 		 */
+ 		if (P_ISHALFDEAD(opaque))
+ 			ereport(LOG,
+ 					(errcode(ERRCODE_INDEX_CORRUPTED),
+ 					 errmsg("index \"%s\" contains a half-dead internal page",
+ 							RelationGetRelationName(rel)),
+ 					 errhint("This can be caused by an interrupt VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
+ 		return 0;
+ 	}
+ 
  	/*
  	 * We can never delete rightmost pages nor root pages.	While at it, check
  	 * that page is not already deleted and is empty.
  	 */
  	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
  		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
  	{
***************
*** 1108,1127 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	}
  
  	/*
  	 * Save info about page, including a copy of its high key (it must have
  	 * one, being non-rightmost).
  	 */
! 	target = BufferGetBlockNumber(buf);
! 	targetlevel = opaque->btpo.level;
! 	leftsib = opaque->btpo_prev;
  	itemid = PageGetItemId(page, P_HIKEY);
  	targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
  
  	/*
! 	 * To avoid deadlocks, we'd better drop the target page lock before going
  	 * further.
  	 */
! 	_bt_relbuf(rel, buf);
  
  	/*
  	 * We need an approximate pointer to the page's parent page.  We use the
--- 1134,1226 ----
  	}
  
  	/*
+ 	 * First, remove downlink pointing to the page (or a parent of the page,
+ 	 * if we are going to delete a taller branch), and mark the page as
+ 	 * half-dead.
+ 	 */
+ 	if (!P_ISHALFDEAD(opaque))
+ 	{
+ 		if (!_bt_mark_page_halfdead(rel, buf, stack))
+ 		{
+ 			_bt_relbuf(rel, buf);
+ 			return 0;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Then unlink it from its siblings. Each call to _bt_unlink_halfdead_pages
+ 	 * unlinks a single page from the branch, making it shallower. Iterate
+ 	 * until the leaf page is gone.
+ 	 */
+ 	rightsib_empty = false;
+ 	while (P_ISHALFDEAD(opaque))
+ 	{
+ 		if (_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
+ 			ndeleted++;
+ 		else
+ 			break;
+ 	}
+ 
+ 	rightsib = opaque->btpo_next;
+ 
+ 	_bt_relbuf(rel, buf);
+ 
+ 	/*
+ 	 * The page has now been deleted. If its right sibling is completely
+ 	 * empty, it's possible that the reason we haven't deleted it earlier is
+ 	 * that it was the rightmost child of the parent. Now that we removed the
+ 	 * downlink for this page, the right sibling might now be the only child
+ 	 * of the parent, and could be removed. It would be picked up by the next
+ 	 * vacuum anyway, but might as well try to remove it now.
+ 	 *
+ 	 * XXX: this could recurse very deeply. Turn this into iteration to
+ 	 * avoid running out of stack.
+ 	 */
+ 	if (rightsib_empty)
+ 	{
+ 		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
+ 		goto retry;
+ 	}
+ 
+ 	return ndeleted;
+ }
+ 
+ static bool
+ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+ {
+ 	BlockNumber	leafblkno;
+ 	BlockNumber leafrightsib;
+ 	BlockNumber	target;
+ 	BlockNumber rightsib;
+ 	ItemId		itemid;
+ 	Page		page;
+ 	BTPageOpaque opaque;
+ 	Buffer		topparent;
+ 	OffsetNumber topoff;
+ 	OffsetNumber nextoffset;
+ 	IndexTuple	itup;
+ 	IndexTuple	targetkey;
+ 
+ 	page = BufferGetPage(leafbuf);
+ 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ 
+ 	Assert(!P_RIGHTMOST(opaque) && !P_ISROOT(opaque) && !P_ISDELETED(opaque) &&
+ 		   !P_ISHALFDEAD(opaque) && P_ISLEAF(opaque) &&
+ 		   P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
+ 
+ 	/*
  	 * Save info about page, including a copy of its high key (it must have
  	 * one, being non-rightmost).
  	 */
! 	leafblkno = BufferGetBlockNumber(leafbuf);
  	itemid = PageGetItemId(page, P_HIKEY);
  	targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
  
  	/*
! 	 * To avoid deadlocks, we'd better drop the leaf page lock before going
  	 * further.
  	 */
! 	LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
  
  	/*
  	 * We need an approximate pointer to the page's parent page.  We use the
***************
*** 1133,1201 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	 */
  	if (stack == NULL)
  	{
! 		if (!InRecovery)
! 		{
! 			/* we need an insertion scan key to do our search, so build one */
! 			itup_scankey = _bt_mkscankey(rel, targetkey);
! 			/* find the leftmost leaf page containing this key */
! 			stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
! 							   &lbuf, BT_READ);
! 			/* don't need a pin on that either */
! 			_bt_relbuf(rel, lbuf);
  
! 			/*
! 			 * If we are trying to delete an interior page, _bt_search did
! 			 * more than we needed.  Locate the stack item pointing to our
! 			 * parent level.
! 			 */
! 			ilevel = 0;
! 			for (;;)
! 			{
! 				if (stack == NULL)
! 					elog(ERROR, "not enough stack items");
! 				if (ilevel == targetlevel)
! 					break;
! 				stack = stack->bts_parent;
! 				ilevel++;
! 			}
! 		}
! 		else
! 		{
! 			/*
! 			 * During WAL recovery, we can't use _bt_search (for one reason,
! 			 * it might invoke user-defined comparison functions that expect
! 			 * facilities not available in recovery mode).	Instead, just set
! 			 * up a dummy stack pointing to the left end of the parent tree
! 			 * level, from which _bt_getstackbuf will walk right to the parent
! 			 * page.  Painful, but we don't care too much about performance in
! 			 * this scenario.
! 			 */
! 			pbuf = _bt_get_endpoint(rel, targetlevel + 1, false);
! 			stack = (BTStack) palloc(sizeof(BTStackData));
! 			stack->bts_blkno = BufferGetBlockNumber(pbuf);
! 			stack->bts_offset = InvalidOffsetNumber;
! 			/* bts_btentry will be initialized below */
! 			stack->bts_parent = NULL;
! 			_bt_relbuf(rel, pbuf);
! 		}
  	}
  
  	/*
  	 * We cannot delete a page that is the rightmost child of its immediate
  	 * parent, unless it is the only child --- in which case the parent has to
  	 * be deleted too, and the same condition applies recursively to it. We
! 	 * have to check this condition all the way up before trying to delete. We
! 	 * don't need to re-test when deleting a non-leaf page, though.
  	 */
! 	if (targetlevel == 0 &&
! 		!_bt_parent_deletion_safe(rel, target, stack))
! 		return 0;
  
  	/*
  	 * We have to lock the pages we need to modify in the standard order:
  	 * moving right, then up.  Else we will deadlock against other writers.
  	 *
! 	 * So, we need to find and write-lock the current left sibling of the
  	 * target page.  The sibling that was current a moment ago could have
  	 * split, so we may have to move right.  This search could fail if either
  	 * the sibling or the target page was deleted by someone else meanwhile;
--- 1232,1474 ----
  	 */
  	if (stack == NULL)
  	{
! 		ScanKey itup_scankey;
! 		Buffer lbuf;
! 
! 		/* we need an insertion scan key to do our search, so build one */
! 		itup_scankey = _bt_mkscankey(rel, targetkey);
! 		/* find the leftmost leaf page containing this key */
! 		stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
! 						   &lbuf, BT_READ);
! 		/* don't need a pin on that either */
! 		_bt_relbuf(rel, lbuf);
! 	}
  
! 	/*
! 	 * Re-acquire lock on the leaf page, and check that it can still be
! 	 * deleted.
! 	 */
! 	LockBuffer(leafbuf, BT_WRITE);
! 
! 	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
! 		P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque) ||
! 		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
! 	{
! 		return false;
  	}
  
  	/*
+ 	 * Save info about page, including a copy of its high key (it must have
+ 	 * one, being non-rightmost).
+ 	 */
+ 	leafrightsib = opaque->btpo_next;
+ 
+ 	/*
  	 * We cannot delete a page that is the rightmost child of its immediate
  	 * parent, unless it is the only child --- in which case the parent has to
  	 * be deleted too, and the same condition applies recursively to it. We
! 	 * have to check this condition all the way up before trying to delete,
! 	 * and lock the final parent of the to-be-deleted branch.
  	 */
! 	rightsib = leafrightsib;
! 	target = leafblkno;
! 	if (!_bt_lock_branch_parent(rel, leafblkno, stack,
! 								&topparent, &topoff, &target, &rightsib))
! 		return false;
! 
! 	/*
! 	 * Check that the parent-page index items we're about to delete/overwrite
! 	 * contain what we expect.	This can fail if the index has become corrupt
! 	 * for some reason.  We want to throw any error before entering the
! 	 * critical section --- otherwise it'd be a PANIC.
! 	 *
! 	 * The test on the target item is just an Assert because
! 	 * _bt_lock_branch_parent should have guaranteed it has the expected
! 	 * contents.  The test on the next-child downlink is known to sometimes
! 	 * fail in the field, though.
! 	 */
! 	page = BufferGetPage(topparent);
! 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 
! #ifdef USE_ASSERT_CHECKING
! 	itemid = PageGetItemId(page, topoff);
! 	itup = (IndexTuple) PageGetItem(page, itemid);
! 	Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
! #endif
! 
! 	nextoffset = OffsetNumberNext(topoff);
! 	itemid = PageGetItemId(page, nextoffset);
! 	itup = (IndexTuple) PageGetItem(page, itemid);
! 	if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
! 		elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
! 			 rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
! 			 BufferGetBlockNumber(topparent), RelationGetRelationName(rel));
! 
! 	/*
! 	 * Any insert which would have gone on the target block will now go to the
! 	 * right sibling block.
! 	 */
! 	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
! 
! 	/* No ereport(ERROR) until changes are logged */
! 	START_CRIT_SECTION();
! 
! 	/*
! 	 * Update parent.  The normal case is a tad tricky because we want to
! 	 * delete the target's downlink and the *following* key.  Easiest way is
! 	 * to copy the right sibling's downlink over the target downlink, and then
! 	 * delete the following item.
! 	 */
! 	page = BufferGetPage(topparent);
! 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 
! 	itemid = PageGetItemId(page, topoff);
! 	itup = (IndexTuple) PageGetItem(page, itemid);
! 	ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
! 
! 	nextoffset = OffsetNumberNext(topoff);
! 	PageIndexTupleDelete(page, nextoffset);
! 
! 	/*
! 	 * Mark the leaf page as half-dead, and stamp it with a pointer to the
! 	 * highest internal page in the branch we're deleting. We use the tid of
! 	 * the high key to store it.
! 	 */
! 	page = BufferGetPage(leafbuf);
! 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 	opaque->btpo_flags |= BTP_HALF_DEAD;
! 
! 	itemid = PageGetItemId(page, P_HIKEY);
! 	itup = (IndexTuple) PageGetItem(page, itemid);
! 	if (target == leafblkno)
! 		ItemPointerSetInvalid(&(itup->t_tid));
! 	else
! 		ItemPointerSet(&(itup->t_tid), target, P_HIKEY);
! 
! 	/* Must mark buffers dirty before XLogInsert */
! 	MarkBufferDirty(topparent);
! 	MarkBufferDirty(leafbuf);
! 
! 	/* XLOG stuff */
! 	if (RelationNeedsWAL(rel))
! 	{
! 		xl_btree_mark_page_halfdead xlrec;
! 		XLogRecPtr	recptr;
! 		XLogRecData rdata[2];
! 
! 		xlrec.target.node = rel->rd_node;
! 		ItemPointerSet(&(xlrec.target.tid), BufferGetBlockNumber(topparent), topoff);
! 		xlrec.leafblk = leafblkno;
! 		xlrec.downlink = target;
! 
! 		page = BufferGetPage(leafbuf);
! 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 		xlrec.leftblk = opaque->btpo_prev;
! 		xlrec.rightblk = opaque->btpo_next;
! 
! 		rdata[0].data = (char *) &xlrec;
! 		rdata[0].len = SizeOfBtreeMarkPageHalfDead;
! 		rdata[0].buffer = InvalidBuffer;
! 		rdata[0].next = &(rdata[1]);
! 
! 		rdata[1].data = NULL;
! 		rdata[1].len = 0;
! 		rdata[1].buffer = topparent;
! 		rdata[1].buffer_std = true;
! 		rdata[1].next = NULL;
! 
! 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_MARK_PAGE_HALFDEAD, rdata);
! 
! 		page = BufferGetPage(topparent);
! 		PageSetLSN(page, recptr);
! 		page = BufferGetPage(leafbuf);
! 		PageSetLSN(page, recptr);
! 	}
! 
! 	END_CRIT_SECTION();
! 
! 	_bt_relbuf(rel, topparent);
! 	return true;
! }
! 
! /*
!  * Unlink a page in a branch of half-dead pages from its siblings.
!  *
!  * If the leaf page still has a downlink pointing to it, unlinks the highest
!  * parent in the to-be-deleted branch instead, and returns false. To get rid
!  * of the whole branch, including the leaf page itself, iterate until returns
!  * true.
!  */
! static bool
! _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
! {
! 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
! 	BlockNumber	target;
! 	BlockNumber	leftsib;
! 	BlockNumber	rightsib;
! 	Buffer		lbuf = InvalidBuffer;
! 	Buffer		buf;
! 	Buffer		rbuf;
! 	Buffer		metabuf = InvalidBuffer;
! 	Page		metapg;
! 	BTMetaPageData *metad;
! 	ItemId		itemid;
! 	Page		page;
! 	BTPageOpaque opaque;
! 	bool		rightsib_is_rightmost;
! 	int			targetlevel;
! 	ItemPointer leafhikey;
! 	BlockNumber nextchild;
! 	BlockNumber	topblkno;
! 
! 	page = BufferGetPage(leafbuf);
! 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 
! 	Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
! 
! 	/*
! 	 * If the leaf page still has a parent pointing to it (or a chain of
! 	 * parents), we don't unlink the leaf page yet, but the topmost remaining
! 	 * parent in the branch.
! 	 */
! 	itemid = PageGetItemId(page, P_HIKEY);
! 	leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
! 
! 	LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
! 
! 	if (ItemPointerIsValid(leafhikey))
! 	{
! 		topblkno = ItemPointerGetBlockNumber(leafhikey);
! 		target = topblkno;
! 
! 		buf = _bt_getbuf(rel, topblkno, BT_READ);
! 		page = BufferGetPage(buf);
! 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 		leftsib = opaque->btpo_prev;
! 		targetlevel = opaque->btpo.level;
! 
! 		/*
! 		 * To avoid deadlocks, we'd better drop the target page lock before
! 		 * going further.
! 		 */
! 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
! 	}
! 	else
! 	{
! 		topblkno = InvalidBlockNumber;
! 		target = leafblkno;
! 		buf = leafbuf;
! 		targetlevel = 0;
! 	}
! 
! 	leftsib = opaque->btpo_prev;
  
  	/*
  	 * We have to lock the pages we need to modify in the standard order:
  	 * moving right, then up.  Else we will deadlock against other writers.
  	 *
! 	 * So, first lock the leaf page, if it's not the target.
! 	 * Then find and write-lock the current left sibling of the
  	 * target page.  The sibling that was current a moment ago could have
  	 * split, so we may have to move right.  This search could fail if either
  	 * the sibling or the target page was deleted by someone else meanwhile;
***************
*** 1203,1208 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
--- 1476,1483 ----
  	 * deletion is only done in VACUUM and there shouldn't be multiple VACUUMs
  	 * concurrently on the same table.)
  	 */
+ 	if (target != leafblkno)
+ 		LockBuffer(leafbuf, BT_WRITE);
  	if (leftsib != P_NONE)
  	{
  		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
***************
*** 1217,1223 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  			{
  				elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
  					 RelationGetRelationName(rel));
! 				return 0;
  			}
  			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
  			page = BufferGetPage(lbuf);
--- 1492,1498 ----
  			{
  				elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
  					 RelationGetRelationName(rel));
! 				return false;
  			}
  			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
  			page = BufferGetPage(lbuf);
***************
*** 1232,1258 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	 * a write lock not a superexclusive lock, since no scans would stop on an
  	 * empty page.
  	 */
! 	buf = _bt_getbuf(rel, target, BT_WRITE);
  	page = BufferGetPage(buf);
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  
  	/*
! 	 * Check page is still empty etc, else abandon deletion.  The empty check
! 	 * is necessary since someone else might have inserted into it while we
! 	 * didn't have it locked; the others are just for paranoia's sake.
  	 */
! 	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
! 		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
  	{
! 		_bt_relbuf(rel, buf);
! 		if (BufferIsValid(lbuf))
! 			_bt_relbuf(rel, lbuf);
! 		return 0;
  	}
  	if (opaque->btpo_prev != leftsib)
  		elog(ERROR, "left link changed unexpectedly in block %u of index \"%s\"",
  			 target, RelationGetRelationName(rel));
  
  	/*
  	 * And next write-lock the (current) right sibling.
  	 */
--- 1507,1550 ----
  	 * a write lock not a superexclusive lock, since no scans would stop on an
  	 * empty page.
  	 */
! 	LockBuffer(buf, BT_WRITE);
  	page = BufferGetPage(buf);
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  
  	/*
! 	 * Check page is still empty etc, else abandon deletion.  This is just
! 	 * for paranoia's sake; a half-dead page cannot resurrect because there
! 	 * can be only one vacuum process running at a time.
  	 */
! 	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque))
  	{
! 		elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
! 			 target, RelationGetRelationName(rel));
  	}
  	if (opaque->btpo_prev != leftsib)
  		elog(ERROR, "left link changed unexpectedly in block %u of index \"%s\"",
  			 target, RelationGetRelationName(rel));
  
+ 	if (target == leafblkno)
+ 	{
+ 		if (P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+ 			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
+ 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+ 				 target, RelationGetRelationName(rel));
+ 		nextchild = InvalidBlockNumber;
+ 	}
+ 	else
+ 	{
+ 		if (P_FIRSTDATAKEY(opaque) != PageGetMaxOffsetNumber(page) ||
+ 			P_ISLEAF(opaque))
+ 			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+ 				 target, RelationGetRelationName(rel));
+ 
+ 		/* remember the next child down in the branch. */
+ 		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+ 		nextchild = ItemPointerGetBlockNumber(&((IndexTuple) PageGetItem(page, itemid))->t_tid);
+ 	}
+ 
  	/*
  	 * And next write-lock the (current) right sibling.
  	 */
***************
*** 1265,1314 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  			 "block %u links to %u instead of expected %u in index \"%s\"",
  			 rightsib, opaque->btpo_prev, target,
  			 RelationGetRelationName(rel));
! 
! 	/*
! 	 * Any insert which would have gone on the target block will now go to the
! 	 * right sibling block.
! 	 */
! 	PredicateLockPageCombine(rel, target, rightsib);
! 
! 	/*
! 	 * Next find and write-lock the current parent of the target page. This is
! 	 * essentially the same as the corresponding step of splitting.
! 	 */
! 	ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
! 	pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
! 	if (pbuf == InvalidBuffer)
! 		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
! 			 RelationGetRelationName(rel), target);
! 	parent = stack->bts_blkno;
! 	poffset = stack->bts_offset;
! 
! 	/*
! 	 * If the target is the rightmost child of its parent, then we can't
! 	 * delete, unless it's also the only child --- in which case the parent
! 	 * changes to half-dead status.  The "can't delete" case should have been
! 	 * detected by _bt_parent_deletion_safe, so complain if we see it now.
! 	 */
! 	page = BufferGetPage(pbuf);
! 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 	maxoff = PageGetMaxOffsetNumber(page);
! 	parent_half_dead = false;
! 	parent_one_child = false;
! 	if (poffset >= maxoff)
! 	{
! 		if (poffset == P_FIRSTDATAKEY(opaque))
! 			parent_half_dead = true;
! 		else
! 			elog(ERROR, "failed to delete rightmost child %u of block %u in index \"%s\"",
! 				 target, parent, RelationGetRelationName(rel));
! 	}
! 	else
! 	{
! 		/* Will there be exactly one child left in this parent? */
! 		if (OffsetNumberNext(P_FIRSTDATAKEY(opaque)) == maxoff)
! 			parent_one_child = true;
! 	}
  
  	/*
  	 * If we are deleting the next-to-last page on the target's level, then
--- 1557,1564 ----
  			 "block %u links to %u instead of expected %u in index \"%s\"",
  			 rightsib, opaque->btpo_prev, target,
  			 RelationGetRelationName(rel));
! 	rightsib_is_rightmost = P_RIGHTMOST(opaque);
! 	*rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
  
  	/*
  	 * If we are deleting the next-to-last page on the target's level, then
***************
*** 1322,1332 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	 * We can safely acquire a lock on the metapage here --- see comments for
  	 * _bt_newroot().
  	 */
! 	if (leftsib == P_NONE && !parent_half_dead)
  	{
  		page = BufferGetPage(rbuf);
  		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- 		Assert(opaque->btpo.level == targetlevel);
  		if (P_RIGHTMOST(opaque))
  		{
  			/* rightsib will be the only one left on the level */
--- 1572,1581 ----
  	 * We can safely acquire a lock on the metapage here --- see comments for
  	 * _bt_newroot().
  	 */
! 	if (leftsib == P_NONE && rightsib_is_rightmost)
  	{
  		page = BufferGetPage(rbuf);
  		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  		if (P_RIGHTMOST(opaque))
  		{
  			/* rightsib will be the only one left on the level */
***************
*** 1349,1386 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	}
  
  	/*
- 	 * Check that the parent-page index items we're about to delete/overwrite
- 	 * contain what we expect.	This can fail if the index has become corrupt
- 	 * for some reason.  We want to throw any error before entering the
- 	 * critical section --- otherwise it'd be a PANIC.
- 	 *
- 	 * The test on the target item is just an Assert because _bt_getstackbuf
- 	 * should have guaranteed it has the expected contents.  The test on the
- 	 * next-child downlink is known to sometimes fail in the field, though.
- 	 */
- 	page = BufferGetPage(pbuf);
- 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
- 
- #ifdef USE_ASSERT_CHECKING
- 	itemid = PageGetItemId(page, poffset);
- 	itup = (IndexTuple) PageGetItem(page, itemid);
- 	Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
- #endif
- 
- 	if (!parent_half_dead)
- 	{
- 		OffsetNumber nextoffset;
- 
- 		nextoffset = OffsetNumberNext(poffset);
- 		itemid = PageGetItemId(page, nextoffset);
- 		itup = (IndexTuple) PageGetItem(page, itemid);
- 		if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
- 			elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
- 				 rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
- 				 parent, RelationGetRelationName(rel));
- 	}
- 
- 	/*
  	 * Here we begin doing the deletion.
  	 */
  
--- 1598,1603 ----
***************
*** 1388,1416 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	START_CRIT_SECTION();
  
  	/*
- 	 * Update parent.  The normal case is a tad tricky because we want to
- 	 * delete the target's downlink and the *following* key.  Easiest way is
- 	 * to copy the right sibling's downlink over the target downlink, and then
- 	 * delete the following item.
- 	 */
- 	if (parent_half_dead)
- 	{
- 		PageIndexTupleDelete(page, poffset);
- 		opaque->btpo_flags |= BTP_HALF_DEAD;
- 	}
- 	else
- 	{
- 		OffsetNumber nextoffset;
- 
- 		itemid = PageGetItemId(page, poffset);
- 		itup = (IndexTuple) PageGetItem(page, itemid);
- 		ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
- 
- 		nextoffset = OffsetNumberNext(poffset);
- 		PageIndexTupleDelete(page, nextoffset);
- 	}
- 
- 	/*
  	 * Update siblings' side-links.  Note the target page's side-links will
  	 * continue to point to the siblings.  Asserts here are just rechecking
  	 * things we already verified above.
--- 1605,1610 ----
***************
*** 1426,1432 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	Assert(opaque->btpo_prev == target);
  	opaque->btpo_prev = leftsib;
! 	rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
  
  	/*
  	 * Mark the page itself deleted.  It can be recycled when all current
--- 1620,1640 ----
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	Assert(opaque->btpo_prev == target);
  	opaque->btpo_prev = leftsib;
! 
! 	/*
! 	 * If we deleted a parent of the targeted leaf page, instead of the leaf
! 	 * itself, update the leaf to point to the next remaining child in the
! 	 * branch.
! 	 */
! 	if (target != leafblkno)
! 	{
! 		if (nextchild == leafblkno)
! 			ItemPointerSetInvalid(leafhikey);
! 		else
! 			ItemPointerSet(leafhikey, nextchild, P_HIKEY);
! 
! 		MarkBufferDirty(leafbuf);
! 	}
  
  	/*
  	 * Mark the page itself deleted.  It can be recycled when all current
***************
*** 1453,1459 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	}
  
  	/* Must mark buffers dirty before XLogInsert */
- 	MarkBufferDirty(pbuf);
  	MarkBufferDirty(rbuf);
  	MarkBufferDirty(buf);
  	if (BufferIsValid(lbuf))
--- 1661,1666 ----
***************
*** 1462,1483 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  	/* XLOG stuff */
  	if (RelationNeedsWAL(rel))
  	{
! 		xl_btree_delete_page xlrec;
  		xl_btree_metadata xlmeta;
  		uint8		xlinfo;
  		XLogRecPtr	recptr;
! 		XLogRecData rdata[5];
  		XLogRecData *nextrdata;
  
! 		xlrec.target.node = rel->rd_node;
! 		ItemPointerSet(&(xlrec.target.tid), parent, poffset);
  		xlrec.deadblk = target;
  		xlrec.leftblk = leftsib;
  		xlrec.rightblk = rightsib;
  		xlrec.btpo_xact = opaque->btpo.xact;
  
  		rdata[0].data = (char *) &xlrec;
! 		rdata[0].len = SizeOfBtreeDeletePage;
  		rdata[0].buffer = InvalidBuffer;
  		rdata[0].next = nextrdata = &(rdata[1]);
  
--- 1669,1691 ----
  	/* XLOG stuff */
  	if (RelationNeedsWAL(rel))
  	{
! 		xl_btree_unlink_page xlrec;
  		xl_btree_metadata xlmeta;
  		uint8		xlinfo;
  		XLogRecPtr	recptr;
! 		XLogRecData rdata[4];
  		XLogRecData *nextrdata;
  
! 		xlrec.node = rel->rd_node;
  		xlrec.deadblk = target;
  		xlrec.leftblk = leftsib;
  		xlrec.rightblk = rightsib;
+ 		xlrec.leafblk = leafblkno;
+ 		xlrec.downlink = nextchild;
  		xlrec.btpo_xact = opaque->btpo.xact;
  
  		rdata[0].data = (char *) &xlrec;
! 		rdata[0].len = SizeOfBtreeUnlinkPage;
  		rdata[0].buffer = InvalidBuffer;
  		rdata[0].next = nextrdata = &(rdata[1]);
  
***************
*** 1493,1511 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  			nextrdata->buffer = InvalidBuffer;
  			nextrdata->next = nextrdata + 1;
  			nextrdata++;
! 			xlinfo = XLOG_BTREE_DELETE_PAGE_META;
  		}
- 		else if (parent_half_dead)
- 			xlinfo = XLOG_BTREE_DELETE_PAGE_HALF;
  		else
! 			xlinfo = XLOG_BTREE_DELETE_PAGE;
! 
! 		nextrdata->data = NULL;
! 		nextrdata->len = 0;
! 		nextrdata->next = nextrdata + 1;
! 		nextrdata->buffer = pbuf;
! 		nextrdata->buffer_std = true;
! 		nextrdata++;
  
  		nextrdata->data = NULL;
  		nextrdata->len = 0;
--- 1701,1710 ----
  			nextrdata->buffer = InvalidBuffer;
  			nextrdata->next = nextrdata + 1;
  			nextrdata++;
! 			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
  		}
  		else
! 			xlinfo = XLOG_BTREE_UNLINK_PAGE;
  
  		nextrdata->data = NULL;
  		nextrdata->len = 0;
***************
*** 1530,1537 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  		{
  			PageSetLSN(metapg, recptr);
  		}
- 		page = BufferGetPage(pbuf);
- 		PageSetLSN(page, recptr);
  		page = BufferGetPage(rbuf);
  		PageSetLSN(page, recptr);
  		page = BufferGetPage(buf);
--- 1729,1734 ----
***************
*** 1551,1596 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  		CacheInvalidateRelcache(rel);
  		_bt_relbuf(rel, metabuf);
  	}
! 	/* can always release leftsib immediately */
  	if (BufferIsValid(lbuf))
  		_bt_relbuf(rel, lbuf);
  
! 	/*
! 	 * If parent became half dead, recurse to delete it. Otherwise, if right
! 	 * sibling is empty and is now the last child of the parent, recurse to
! 	 * try to delete it.  (These cases cannot apply at the same time, though
! 	 * the second case might itself recurse to the first.)
! 	 *
! 	 * When recursing to parent, we hold the lock on the target page until
! 	 * done.  This delays any insertions into the keyspace that was just
! 	 * effectively reassigned to the parent's right sibling.  If we allowed
! 	 * that, and there were enough such insertions before we finish deleting
! 	 * the parent, page splits within that keyspace could lead to inserting
! 	 * out-of-order keys into the grandparent level.  It is thought that that
! 	 * wouldn't have any serious consequences, but it still seems like a
! 	 * pretty bad idea.
! 	 */
! 	if (parent_half_dead)
! 	{
! 		/* recursive call will release pbuf */
! 		_bt_relbuf(rel, rbuf);
! 		result = _bt_pagedel(rel, pbuf, stack->bts_parent) + 1;
  		_bt_relbuf(rel, buf);
- 	}
- 	else if (parent_one_child && rightsib_empty)
- 	{
- 		_bt_relbuf(rel, pbuf);
- 		_bt_relbuf(rel, buf);
- 		/* recursive call will release rbuf */
- 		result = _bt_pagedel(rel, rbuf, stack) + 1;
- 	}
- 	else
- 	{
- 		_bt_relbuf(rel, pbuf);
- 		_bt_relbuf(rel, buf);
- 		_bt_relbuf(rel, rbuf);
- 		result = 1;
- 	}
  
! 	return result;
  }
--- 1748,1760 ----
  		CacheInvalidateRelcache(rel);
  		_bt_relbuf(rel, metabuf);
  	}
! 	/* release siblings */
  	if (BufferIsValid(lbuf))
  		_bt_relbuf(rel, lbuf);
+ 	_bt_relbuf(rel, rbuf);
  
! 	if (target != leafblkno)
  		_bt_relbuf(rel, buf);
  
! 	return true;
  }
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
***************
*** 25,65 ****
   * them manually if they are not seen in the WAL log during replay.  This
   * makes it safe for page insertion to be a multiple-WAL-action process.
   *
-  * Similarly, deletion of an only child page and deletion of its parent page
-  * form multiple WAL log entries, and we have to be prepared to follow through
-  * with the deletion if the log ends between.
-  *
   * The data structure is a simple linked list --- this should be good enough,
!  * since we don't expect a page split or multi deletion to remain incomplete
!  * for long.  In any case we need to respect the order of operations.
   */
! typedef struct bt_incomplete_action
  {
  	RelFileNode node;			/* the index */
- 	bool		is_split;		/* T = pending split, F = pending delete */
- 	/* these fields are for a split: */
  	bool		is_root;		/* we split the root */
  	BlockNumber leftblk;		/* left half of split */
  	BlockNumber rightblk;		/* right half of split */
! 	/* these fields are for a delete: */
! 	BlockNumber delblk;			/* parent block to be deleted */
! } bt_incomplete_action;
  
! static List *incomplete_actions;
  
  
  static void
  log_incomplete_split(RelFileNode node, BlockNumber leftblk,
  					 BlockNumber rightblk, bool is_root)
  {
! 	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
  
  	action->node = node;
- 	action->is_split = true;
  	action->is_root = is_root;
  	action->leftblk = leftblk;
  	action->rightblk = rightblk;
! 	incomplete_actions = lappend(incomplete_actions, action);
  }
  
  static void
--- 25,56 ----
   * them manually if they are not seen in the WAL log during replay.  This
   * makes it safe for page insertion to be a multiple-WAL-action process.
   *
   * The data structure is a simple linked list --- this should be good enough,
!  * since we don't expect a page split to remain incomplete for long.  In any
!  * case we need to respect the order of operations.
   */
! typedef struct bt_incomplete_split
  {
  	RelFileNode node;			/* the index */
  	bool		is_root;		/* we split the root */
  	BlockNumber leftblk;		/* left half of split */
  	BlockNumber rightblk;		/* right half of split */
! } bt_incomplete_split;
  
! static List *incomplete_splits;
  
  
  static void
  log_incomplete_split(RelFileNode node, BlockNumber leftblk,
  					 BlockNumber rightblk, bool is_root)
  {
! 	bt_incomplete_split *action = palloc(sizeof(bt_incomplete_split));
  
  	action->node = node;
  	action->is_root = is_root;
  	action->leftblk = leftblk;
  	action->rightblk = rightblk;
! 	incomplete_splits = lappend(incomplete_splits, action);
  }
  
  static void
***************
*** 67,115 **** forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
  {
  	ListCell   *l;
  
! 	foreach(l, incomplete_actions)
  	{
! 		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
  
  		if (RelFileNodeEquals(node, action->node) &&
- 			action->is_split &&
  			downlink == action->rightblk)
  		{
  			if (is_root != action->is_root)
  				elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
  					 action->is_root, is_root);
! 			incomplete_actions = list_delete_ptr(incomplete_actions, action);
! 			pfree(action);
! 			break;				/* need not look further */
! 		}
! 	}
! }
! 
! static void
! log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
! {
! 	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
! 
! 	action->node = node;
! 	action->is_split = false;
! 	action->delblk = delblk;
! 	incomplete_actions = lappend(incomplete_actions, action);
! }
! 
! static void
! forget_matching_deletion(RelFileNode node, BlockNumber delblk)
! {
! 	ListCell   *l;
! 
! 	foreach(l, incomplete_actions)
! 	{
! 		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
! 
! 		if (RelFileNodeEquals(node, action->node) &&
! 			!action->is_split &&
! 			delblk == action->delblk)
! 		{
! 			incomplete_actions = list_delete_ptr(incomplete_actions, action);
  			pfree(action);
  			break;				/* need not look further */
  		}
--- 58,74 ----
  {
  	ListCell   *l;
  
! 	foreach(l, incomplete_splits)
  	{
! 		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
  
  		if (RelFileNodeEquals(node, action->node) &&
  			downlink == action->rightblk)
  		{
  			if (is_root != action->is_root)
  				elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
  					 action->is_root, is_root);
! 			incomplete_splits = list_delete_ptr(incomplete_splits, action);
  			pfree(action);
  			break;				/* need not look further */
  		}
***************
*** 779,799 **** btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  }
  
  static void
! btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
  {
! 	xl_btree_delete_page *xlrec = (xl_btree_delete_page *) XLogRecGetData(record);
  	BlockNumber parent;
- 	BlockNumber target;
- 	BlockNumber leftsib;
- 	BlockNumber rightsib;
  	Buffer		buffer;
  	Page		page;
  	BTPageOpaque pageop;
  
  	parent = ItemPointerGetBlockNumber(&(xlrec->target.tid));
- 	target = xlrec->deadblk;
- 	leftsib = xlrec->leftblk;
- 	rightsib = xlrec->rightblk;
  
  	/*
  	 * In normal operation, we would lock all the pages this WAL record
--- 738,753 ----
  }
  
  static void
! btree_xlog_mark_page_halfdead(uint8 info, XLogRecPtr lsn, XLogRecord *record)
  {
! 	xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) XLogRecGetData(record);
  	BlockNumber parent;
  	Buffer		buffer;
  	Page		page;
  	BTPageOpaque pageop;
+ 	IndexTupleData trunctuple;
  
  	parent = ItemPointerGetBlockNumber(&(xlrec->target.tid));
  
  	/*
  	 * In normal operation, we would lock all the pages this WAL record
***************
*** 813,861 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
  		{
  			page = (Page) BufferGetPage(buffer);
  			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
! 			if (lsn <= PageGetLSN(page))
! 			{
! 				UnlockReleaseBuffer(buffer);
! 			}
! 			else
  			{
  				OffsetNumber poffset;
  
  				poffset = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
! 				if (poffset >= PageGetMaxOffsetNumber(page))
! 				{
! 					Assert(info == XLOG_BTREE_DELETE_PAGE_HALF);
! 					Assert(poffset == P_FIRSTDATAKEY(pageop));
! 					PageIndexTupleDelete(page, poffset);
! 					pageop->btpo_flags |= BTP_HALF_DEAD;
! 				}
! 				else
! 				{
! 					ItemId		itemid;
! 					IndexTuple	itup;
! 					OffsetNumber nextoffset;
! 
! 					Assert(info != XLOG_BTREE_DELETE_PAGE_HALF);
! 					itemid = PageGetItemId(page, poffset);
! 					itup = (IndexTuple) PageGetItem(page, itemid);
! 					ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
! 					nextoffset = OffsetNumberNext(poffset);
! 					PageIndexTupleDelete(page, nextoffset);
! 				}
  
  				PageSetLSN(page, lsn);
  				MarkBufferDirty(buffer);
- 				UnlockReleaseBuffer(buffer);
  			}
  		}
  	}
  
  	/* Fix left-link of right sibling */
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
  	else
  	{
! 		buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
  		if (BufferIsValid(buffer))
  		{
  			page = (Page) BufferGetPage(buffer);
--- 767,856 ----
  		{
  			page = (Page) BufferGetPage(buffer);
  			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
! 			if (lsn > PageGetLSN(page))
  			{
  				OffsetNumber poffset;
+ 				ItemId		itemid;
+ 				IndexTuple	itup;
+ 				OffsetNumber nextoffset;
+ 				BlockNumber	rightsib;
  
  				poffset = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
! 
! 				nextoffset = OffsetNumberNext(poffset);
! 				itemid = PageGetItemId(page, nextoffset);
! 				itup = (IndexTuple) PageGetItem(page, itemid);
! 				rightsib = ItemPointerGetBlockNumber(&itup->t_tid);
! 
! 				itemid = PageGetItemId(page, poffset);
! 				itup = (IndexTuple) PageGetItem(page, itemid);
! 				ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
! 				nextoffset = OffsetNumberNext(poffset);
! 				PageIndexTupleDelete(page, nextoffset);
  
  				PageSetLSN(page, lsn);
  				MarkBufferDirty(buffer);
  			}
+ 			UnlockReleaseBuffer(buffer);
  		}
  	}
  
+ 	/* Rewrite the leaf page as a halfdead page */
+ 	buffer = XLogReadBuffer(xlrec->target.node, xlrec->leafblk, true);
+ 	Assert(BufferIsValid(buffer));
+ 	page = (Page) BufferGetPage(buffer);
+ 
+ 	_bt_pageinit(page, BufferGetPageSize(buffer));
+ 	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+ 
+ 	pageop->btpo_prev = xlrec->leftblk;
+ 	pageop->btpo_next = xlrec->rightblk;
+ 	pageop->btpo.level = 0;
+ 	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
+ 	pageop->btpo_cycleid = 0;
+ 
+ 	/* a dummy hikey that points to the next parent to be deleted (if any) */
+ 	trunctuple.t_info = sizeof(IndexTupleData);
+ 	ItemPointerSet(&trunctuple.t_tid, xlrec->downlink, P_HIKEY);
+ 	if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
+ 					false, false) == InvalidOffsetNumber)
+ 		elog(ERROR, "could not add dummy high key to half-dead page");
+ 
+ 	PageSetLSN(page, lsn);
+ 	MarkBufferDirty(buffer);
+ 	UnlockReleaseBuffer(buffer);
+ }
+ 
+ 
+ static void
+ btree_xlog_unlink_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
+ 	BlockNumber target;
+ 	BlockNumber leftsib;
+ 	BlockNumber rightsib;
+ 	Buffer		buffer;
+ 	Page		page;
+ 	BTPageOpaque pageop;
+ 
+ 	target = xlrec->deadblk;
+ 	leftsib = xlrec->leftblk;
+ 	rightsib = xlrec->rightblk;
+ 
+ 	/*
+ 	 * In normal operation, we would lock all the pages this WAL record
+ 	 * touches before changing any of them.  In WAL replay, it should be okay
+ 	 * to lock just one page at a time, since no concurrent index updates can
+ 	 * be happening, and readers should not care whether they arrive at the
+ 	 * target page or not (since it's surely empty).
+ 	 */
+ 
  	/* Fix left-link of right sibling */
! 	if (record->xl_info & XLR_BKP_BLOCK(0))
! 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
  	else
  	{
! 		buffer = XLogReadBuffer(xlrec->node, rightsib, false);
  		if (BufferIsValid(buffer))
  		{
  			page = (Page) BufferGetPage(buffer);
***************
*** 876,888 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
  	}
  
  	/* Fix right-link of left sibling, if any */
! 	if (record->xl_info & XLR_BKP_BLOCK(2))
! 		(void) RestoreBackupBlock(lsn, record, 2, false, false);
  	else
  	{
  		if (leftsib != P_NONE)
  		{
! 			buffer = XLogReadBuffer(xlrec->target.node, leftsib, false);
  			if (BufferIsValid(buffer))
  			{
  				page = (Page) BufferGetPage(buffer);
--- 871,883 ----
  	}
  
  	/* Fix right-link of left sibling, if any */
! 	if (record->xl_info & XLR_BKP_BLOCK(1))
! 		(void) RestoreBackupBlock(lsn, record, 1, false, false);
  	else
  	{
  		if (leftsib != P_NONE)
  		{
! 			buffer = XLogReadBuffer(xlrec->node, leftsib, false);
  			if (BufferIsValid(buffer))
  			{
  				page = (Page) BufferGetPage(buffer);
***************
*** 904,910 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
  	}
  
  	/* Rewrite target page as empty deleted page */
! 	buffer = XLogReadBuffer(xlrec->target.node, target, true);
  	Assert(BufferIsValid(buffer));
  	page = (Page) BufferGetPage(buffer);
  
--- 899,905 ----
  	}
  
  	/* Rewrite target page as empty deleted page */
! 	buffer = XLogReadBuffer(xlrec->node, target, true);
  	Assert(BufferIsValid(buffer));
  	page = (Page) BufferGetPage(buffer);
  
***************
*** 921,944 **** btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
  	MarkBufferDirty(buffer);
  	UnlockReleaseBuffer(buffer);
  
  	/* Update metapage if needed */
! 	if (info == XLOG_BTREE_DELETE_PAGE_META)
  	{
  		xl_btree_metadata md;
  
! 		memcpy(&md, (char *) xlrec + SizeOfBtreeDeletePage,
  			   sizeof(xl_btree_metadata));
! 		_bt_restore_meta(xlrec->target.node, lsn,
  						 md.root, md.level,
  						 md.fastroot, md.fastlevel);
  	}
- 
- 	/* Forget any completed deletion */
- 	forget_matching_deletion(xlrec->target.node, target);
- 
- 	/* If parent became half-dead, remember it for deletion */
- 	if (info == XLOG_BTREE_DELETE_PAGE_HALF)
- 		log_incomplete_deletion(xlrec->target.node, parent);
  }
  
  static void
--- 916,934 ----
  	MarkBufferDirty(buffer);
  	UnlockReleaseBuffer(buffer);
  
+ 	/* Update leaf block */
+ 
  	/* Update metapage if needed */
! 	if (info == XLOG_BTREE_UNLINK_PAGE_META)
  	{
  		xl_btree_metadata md;
  
! 		memcpy(&md, (char *) xlrec + SizeOfBtreeUnlinkPage,
  			   sizeof(xl_btree_metadata));
! 		_bt_restore_meta(xlrec->node, lsn,
  						 md.root, md.level,
  						 md.fastroot, md.fastlevel);
  	}
  }
  
  static void
***************
*** 1053,1062 **** btree_redo(XLogRecPtr lsn, XLogRecord *record)
  		case XLOG_BTREE_DELETE:
  			btree_xlog_delete(lsn, record);
  			break;
! 		case XLOG_BTREE_DELETE_PAGE:
! 		case XLOG_BTREE_DELETE_PAGE_META:
! 		case XLOG_BTREE_DELETE_PAGE_HALF:
! 			btree_xlog_delete_page(info, lsn, record);
  			break;
  		case XLOG_BTREE_NEWROOT:
  			btree_xlog_newroot(lsn, record);
--- 1043,1054 ----
  		case XLOG_BTREE_DELETE:
  			btree_xlog_delete(lsn, record);
  			break;
! 		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
! 			btree_xlog_mark_page_halfdead(info, lsn, record);
! 			break;
! 		case XLOG_BTREE_UNLINK_PAGE:
! 		case XLOG_BTREE_UNLINK_PAGE_META:
! 			btree_xlog_unlink_page(info, lsn, record);
  			break;
  		case XLOG_BTREE_NEWROOT:
  			btree_xlog_newroot(lsn, record);
***************
*** 1072,1078 **** btree_redo(XLogRecPtr lsn, XLogRecord *record)
  void
  btree_xlog_startup(void)
  {
! 	incomplete_actions = NIL;
  }
  
  void
--- 1064,1070 ----
  void
  btree_xlog_startup(void)
  {
! 	incomplete_splits = NIL;
  }
  
  void
***************
*** 1080,1146 **** btree_xlog_cleanup(void)
  {
  	ListCell   *l;
  
! 	foreach(l, incomplete_actions)
  	{
! 		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
! 
! 		if (action->is_split)
! 		{
! 			/* finish an incomplete split */
! 			Buffer		lbuf,
! 						rbuf;
! 			Page		lpage,
! 						rpage;
! 			BTPageOpaque lpageop,
! 						rpageop;
! 			bool		is_only;
! 			Relation	reln;
! 
! 			lbuf = XLogReadBuffer(action->node, action->leftblk, false);
! 			/* failure is impossible because we wrote this page earlier */
! 			if (!BufferIsValid(lbuf))
! 				elog(PANIC, "btree_xlog_cleanup: left block unfound");
! 			lpage = (Page) BufferGetPage(lbuf);
! 			lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
! 			rbuf = XLogReadBuffer(action->node, action->rightblk, false);
! 			/* failure is impossible because we wrote this page earlier */
! 			if (!BufferIsValid(rbuf))
! 				elog(PANIC, "btree_xlog_cleanup: right block unfound");
! 			rpage = (Page) BufferGetPage(rbuf);
! 			rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
! 
! 			/* if the pages are all of their level, it's a only-page split */
! 			is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
! 
! 			reln = CreateFakeRelcacheEntry(action->node);
! 			_bt_insert_parent(reln, lbuf, rbuf, NULL,
! 							  action->is_root, is_only);
! 			FreeFakeRelcacheEntry(reln);
! 		}
! 		else
! 		{
! 			/* finish an incomplete deletion (of a half-dead page) */
! 			Buffer		buf;
! 
! 			buf = XLogReadBuffer(action->node, action->delblk, false);
! 			if (BufferIsValid(buf))
! 			{
! 				Relation	reln;
! 
! 				reln = CreateFakeRelcacheEntry(action->node);
! 				if (_bt_pagedel(reln, buf, NULL) == 0)
! 					elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
! 				FreeFakeRelcacheEntry(reln);
! 			}
! 		}
  	}
! 	incomplete_actions = NIL;
  }
  
  bool
  btree_safe_restartpoint(void)
  {
! 	if (incomplete_actions)
  		return false;
  	return true;
  }
--- 1072,1118 ----
  {
  	ListCell   *l;
  
! 	foreach(l, incomplete_splits)
  	{
! 		/* finish an incomplete split */
! 		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
! 		Buffer		lbuf,
! 					rbuf;
! 		Page		lpage,
! 					rpage;
! 		BTPageOpaque lpageop,
! 					rpageop;
! 		bool		is_only;
! 		Relation	reln;
! 
! 		lbuf = XLogReadBuffer(action->node, action->leftblk, false);
! 		/* failure is impossible because we wrote this page earlier */
! 		if (!BufferIsValid(lbuf))
! 			elog(PANIC, "btree_xlog_cleanup: left block unfound");
! 		lpage = (Page) BufferGetPage(lbuf);
! 		lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
! 		rbuf = XLogReadBuffer(action->node, action->rightblk, false);
! 		/* failure is impossible because we wrote this page earlier */
! 		if (!BufferIsValid(rbuf))
! 			elog(PANIC, "btree_xlog_cleanup: right block unfound");
! 		rpage = (Page) BufferGetPage(rbuf);
! 		rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
! 
! 		/* if the pages are all of their level, it's a only-page split */
! 		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
! 
! 		reln = CreateFakeRelcacheEntry(action->node);
! 		_bt_insert_parent(reln, lbuf, rbuf, NULL,
! 						  action->is_root, is_only);
! 		FreeFakeRelcacheEntry(reln);
  	}
! 	incomplete_splits = NIL;
  }
  
  bool
  btree_safe_restartpoint(void)
  {
! 	if (incomplete_splits)
  		return false;
  	return true;
  }
*** a/src/backend/access/rmgrdesc/nbtdesc.c
--- b/src/backend/access/rmgrdesc/nbtdesc.c
***************
*** 124,139 **** btree_desc(StringInfo buf, uint8 xl_info, char *rec)
  								 xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
  				break;
  			}
! 		case XLOG_BTREE_DELETE_PAGE:
! 		case XLOG_BTREE_DELETE_PAGE_META:
! 		case XLOG_BTREE_DELETE_PAGE_HALF:
  			{
! 				xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
  
! 				appendStringInfoString(buf, "delete_page: ");
  				out_target(buf, &(xlrec->target));
! 				appendStringInfo(buf, "; dead %u; left %u; right %u",
! 							xlrec->deadblk, xlrec->leftblk, xlrec->rightblk);
  				break;
  			}
  		case XLOG_BTREE_NEWROOT:
--- 124,149 ----
  								 xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
  				break;
  			}
! 		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
  			{
! 				xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) rec;
  
! 				appendStringInfoString(buf, "mark_page_halfdead: ");
  				out_target(buf, &(xlrec->target));
! 				appendStringInfo(buf, "; downlink %u; leaf %u; left %u; right %u",
! 								 xlrec->leafblk, xlrec->leftblk, xlrec->rightblk, xlrec->downlink);
! 				break;
! 			}
! 		case XLOG_BTREE_UNLINK_PAGE_META:
! 		case XLOG_BTREE_UNLINK_PAGE:
! 			{
! 				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
! 
! 				appendStringInfoString(buf, "unlink_page: ");
! 				appendStringInfo(buf, "rel %u/%u/%u; ",
! 								 xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode);
! 				appendStringInfo(buf, "; dead %u; left %u; right %u; leaf %u; downlink %u, btpo_xact %u",
! 								 xlrec->deadblk, xlrec->leftblk, xlrec->rightblk, xlrec->leafblk, xlrec->downlink, xlrec->btpo_xact);
  				break;
  			}
  		case XLOG_BTREE_NEWROOT:
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 215,225 **** typedef struct BTMetaPageData
  #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
  #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
  #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
! #define XLOG_BTREE_DELETE_PAGE	0x80	/* delete an entire page */
! #define XLOG_BTREE_DELETE_PAGE_META 0x90		/* same, and update metapage */
  #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
! #define XLOG_BTREE_DELETE_PAGE_HALF 0xB0		/* page deletion that makes
! 												 * parent half-dead */
  #define XLOG_BTREE_VACUUM		0xC0	/* delete entries on a page during
  										 * vacuum */
  #define XLOG_BTREE_REUSE_PAGE	0xD0	/* old page is about to be reused from
--- 215,224 ----
  #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
  #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
  #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
! #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
! #define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
  #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
! #define XLOG_BTREE_MARK_PAGE_HALFDEAD 0xB0 /* mark a leaf as half-dead */
  #define XLOG_BTREE_VACUUM		0xC0	/* delete entries on a page during
  										 * vacuum */
  #define XLOG_BTREE_REUSE_PAGE	0xD0	/* old page is about to be reused from
***************
*** 370,392 **** typedef struct xl_btree_vacuum
  #define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
  
  /*
!  * This is what we need to know about deletion of a btree page.  The target
!  * identifies the tuple removed from the parent page (note that we remove
!  * this tuple's downlink and the *following* tuple's key).	Note we do not
!  * store any content for the deleted page --- it is just rewritten as empty
!  * during recovery, apart from resetting the btpo.xact.
   */
! typedef struct xl_btree_delete_page
  {
  	xl_btreetid target;			/* deleted tuple id in parent page */
! 	BlockNumber deadblk;		/* child block being deleted */
! 	BlockNumber leftblk;		/* child block's left sibling, if any */
! 	BlockNumber rightblk;		/* child block's right sibling */
  	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
! 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */
! } xl_btree_delete_page;
  
! #define SizeOfBtreeDeletePage	(offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId))
  
  /*
   * New root log record.  There are zero tuples if this is to establish an
--- 369,410 ----
  #define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
  
  /*
!  * This is what we need to know about marking an empty branch for deletion.
!  * The target identifies the tuple removed from the parent page (note that we
!  * remove this tuple's downlink and the *following* tuple's key).  Note that
!  * the leaf page is empty, so we don't need to store its content --- it is
!  * just reinitialized during recovery using the rest of the fields.
   */
! typedef struct xl_btree_mark_page_halfdead
  {
  	xl_btreetid target;			/* deleted tuple id in parent page */
! 	BlockNumber leafblk;		/* leaf block ultimately being deleted */
! 	BlockNumber leftblk;		/* leaf block's left sibling, if any */
! 	BlockNumber rightblk;		/* leaf block's right sibling */
! 	BlockNumber downlink;		/* next child down in the branch */
! } xl_btree_mark_page_halfdead;
! 
! #define SizeOfBtreeMarkPageHalfDead	(offsetof(xl_btree_mark_page_halfdead, leafblk) + sizeof(BlockNumber))
! 
! /*
!  * This is what we need to know about deletion of a btree page.  Note we do
!  * not store any content for the deleted page --- it is just rewritten as empty
!  * during recovery, apart from resetting the btpo.xact.
!  */
! typedef struct xl_btree_unlink_page
! {
! 	RelFileNode	node;
! 	BlockNumber deadblk;		/* target block being deleted */
! 	BlockNumber leftblk;		/* taregt block's left sibling, if any */
! 	BlockNumber rightblk;		/* target block's right sibling */
! 	BlockNumber	leafblk;		/* if target is aninternal page, the leaf of
! 								 * the branch */
! 	BlockNumber	downlink;		/* next child down in the branch */
  	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
! 	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
! } xl_btree_unlink_page;
  
! #define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
  
  /*
   * New root log record.  There are zero tuples if this is to establish an

#10

Peter Eisentraut

peter_e@gmx.net

about 12 years ago

In reply to: Heikki Linnakangas (#9)

Re: Race condition in b-tree page deletion

On 11/13/13, 11:04 AM, Heikki Linnakangas wrote:

Here's a patch implementing this scheme.

Compiler warnings:

nbtpage.c: In function ï¿½_bt_pagedelï¿½:
nbtpage.c:1695:24: warning: ï¿½metadï¿½ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1414:18: note: ï¿½metadï¿½ was declared here
nbtpage.c:1730:117: warning: ï¿½metapgï¿½ may be used uninitialized in this function [-Wmaybe-uninitialized]
nbtpage.c:1413:7: note: ï¿½metapgï¿½ was declared here

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kevin Grittner

kgrittn@ymail.com

about 12 years ago

In reply to: Heikki Linnakangas (#9)

Re: Race condition in b-tree page deletion

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Here's a patch implementing this scheme.

I've read through the papers on the btree technique, read the
thread, and have been reviewing the patch. It will take me another
day or two to complete a close review and testing, but I wanted to
give preliminary results, and just let people know that I am
working on it.

After reading through the papers and the README for btree, I have
to say that the approach to deletion seemed like the weak link.
Searches, scans, and insertions all were pretty cohesive. The
logic was all based on the fact that missing *upper* levels was OK
because the sibling pointers would be automatically used to search
until the correct page was found at each level. The old deletion
approach turned this on its head by having missing pages at the
bottom, which introduces whole new classes of problems which
otherwise don't exist. This patch makes interim states in the
deletion process look almost exactly the same as interim states in
the insertion process, which I think is A Very Good Thing.

The approach to limiting the number of locks to a finite (and
small) number is clever, but seems quite sound. The locks are
always taken in an order which cannot cause a deadlock.

The write-up lacks any explicit mention of what happens if the
branch being considered for deletion reaches all the way to the
root. Although an astute reader will notice that since root page
will always be the only page at its level, it must always be the
rightmost page at its level, and therefore is correctly handled by
the logic dealing with that, I think it merits an explicit mention
in the README.

I agree with others that this patch is not suitable for
back-patching. I wonder whether some other solution might be worth
back-patching, since it is clearly a bug. The "never delete
internal pages" might be safe enough to consider. It would lead to
some degree of bloat, but perhaps bloat is better than errors. Of
course, the argument can certainly be made that people hit the bug
so rarely that we're better off just leaving it alone in stable
branches. Anyway, that would be a different patch.

Attachments:

fix-btree-page-deletion-3.patchtext/x-diff; name=fix-btree-page-deletion-3.patchDownload

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..03efc29 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -168,26 +168,33 @@ parent item does still exist and can't have been deleted.  Also, because
 we are matching downlink page numbers and not data keys, we don't have any
 problem with possibly misidentifying the parent item.
 
+Page Deletion
+-------------
+
 We consider deleting an entire page from the btree only when it's become
 completely empty of items.  (Merging partly-full pages would allow better
 space reuse, but it seems impractical to move existing data items left or
 right to make this happen --- a scan moving in the opposite direction
 might miss the items if so.)  Also, we *never* delete the rightmost page
 on a tree level (this restriction simplifies the traversal algorithms, as
-explained below).
-
-To delete an empty page, we acquire write lock on its left sibling (if
-any), the target page itself, the right sibling (there must be one), and
-the parent page, in that order.  The parent page must be found using the
-same type of search as used to find the parent during an insertion split.
-Then we update the side-links in the siblings, mark the target page
-deleted, and remove the downlink from the parent, as well as the parent's
-upper bounding key for the target (the one separating it from its right
-sibling).  This causes the target page's key space to effectively belong
-to its right sibling.  (Neither the left nor right sibling pages need to
-change their "high key" if any; so there is no problem with possibly not
-having enough space to replace a high key.)  The side-links in the target
-page are not changed.
+explained below).  Page deletion always begins from an empty leaf page. An
+internal page can only be deleted as part of a branch leading to a leaf
+page, where each internal page has only one child and that child is also to
+be deleted.
+
+Deleting a leaf page is a two-stage process. In the first stage, the page is
+unlinked from its parent, by removing its downlink, and marked as half-dead.
+The parent page must be found using the same type of search as used to find
+the parent during an insertion split.  We lock the target and the parent
+pages, change the target's downlink to point to the right sibling, and remove
+its old downlink.  This causes the target page's key space to effectively
+belong to its right sibling.  (Neither the left nor right sibling pages need
+to change their "high key" if any; so there is no problem with possibly not
+having enough space to replace a high key.)  At the same time, we mark the
+target page as half-dead, which causes any subsequent searches to ignore it
+and move right (or left, in a backwards scan).  This leaves the tree in a
+similar state as during a page split: the the page has no downlink pointing
+to it, but it's still linked to its siblings.
 
 (Note: Lanin and Shasha prefer to make the key space move left, but their
 argument for doing so hinges on not having left-links, which we have
@@ -199,31 +206,44 @@ the same parent --- otherwise, the parent's key space assignment changes
 too, meaning we'd have to make bounding-key updates in its parent, and
 perhaps all the way up the tree.  Since we can't possibly do that
 atomically, we forbid this case.  That means that the rightmost child of a
-parent node can't be deleted unless it's the only remaining child.
-
-When we delete the last remaining child of a parent page, we mark the
-parent page "half-dead" as part of the atomic update that deletes the
-child page.  This implicitly transfers the parent's key space to its right
-sibling (which it must have, since we never delete the overall-rightmost
-page of a level).  Searches ignore the half-dead page and immediately move
-right.  We need not worry about insertions into a half-dead page --- insertions
-into upper tree levels happen only as a result of splits of child pages, and
-the half-dead page no longer has any children that could split.  Therefore
-the page stays empty even when we don't have lock on it, and we can complete
-its deletion in a second atomic action.
-
-The notion of a half-dead page means that the key space relationship between
-the half-dead page's level and its parent's level may be a little out of
-whack: key space that appears to belong to the half-dead page's parent on the
-parent level may really belong to its right sibling.  To prevent any possible
-problems, we hold lock on the deleted child page until we have finished
-deleting any now-half-dead parent page(s).  This prevents any insertions into
-the transferred keyspace until the operation is complete.  The reason for
-doing this is that a sufficiently large number of insertions into the
-transferred keyspace, resulting in multiple page splits, could propagate keys
-from that keyspace into the parent level, resulting in transiently
-out-of-order keys in that level.  It is thought that that wouldn't cause any
-serious problem, but it seems too risky to allow.
+parent node can't be deleted unless it's the only remaining child, in which
+case we will delete the parent too (see below).
+
+In the second-stage, the half-dead leaf page is unlinked from its siblings.
+We first lock the left sibling (if any) of the target, the target page itself,
+and its right sibling (there must be one) in that order. Then we update the
+side-links in the siblings, and mark the target page deleted.
+
+When we're about to delete the last remaining child of a parent page, things
+are slightly more complicated.  In the first stage, we leave the immediate
+parent of the leaf page alone, and remove the downlink to the parent page
+instead, from the grandparent.  If it's the last child of the grandparent too,
+we recurse up until we find a parent with more than one child, and remove the
+downlink of that page.  The leaf page is marked as half-dead, and the block
+number of the page whose downlink was removed is stashed in the half-dead leaf
+page.  This leaves us with a chain of internal pages, with one downlink each,
+leading to the half-dead leaf page, and no downlink pointing to the topmost
+page in the chain.
+
+While we recurse up to find the topmost parent in the chain, we keep the
+leaf page locked, but don't need to hold locks on the intermediate pages
+between the leaf and the topmost parent -- insertions into upper tree levels
+happen only as a result of splits of child pages, and that can't happen as
+long as we're keeping the leaf locked. The internal pages in the chain cannot
+acquire new children afterwards either, because the leaf page is marked as
+half-dead and won't be split.
+
+Removing the downlink to the top of the to-be-deleted chain effectively
+transfers the key space to the right sibling for all the intermediate levels
+too, in one atomic operation.  A concurrent search might still visit the
+intermediate pages, but it will move right when it reaches the half-dead page
+at the leaf level.
+
+In the second stage, the topmost page in the chain is unlinked from its
+siblings, and the half-dead leaf page is updated to point to the next page
+down in the chain.  This is repeated until there are no internal pages left
+in the chain.  Finally, the half-dead leaf page itself is unlinked from its
+siblings.
 
 A deleted page cannot be reclaimed immediately, since there may be other
 processes waiting to reference it (ie, search processes that just left the
@@ -396,12 +416,11 @@ metapage update (of the "fast root" link) is performed and logged as part
 of the insertion into the parent level.  When splitting the root page, the
 metapage update is handled as part of the "new root" action.
 
-A page deletion is logged as a single WAL entry covering all four
-required page updates (target page, left and right siblings, and parent)
-as an atomic event.  (Any required fast-root link update is also part
-of the WAL entry.)  If the parent page becomes half-dead but is not
-immediately deleted due to a subsequent crash, there is no loss of
-consistency, and the empty page will be picked up by the next VACUUM.
+Each step in page deletion are logged as separate WAL entries: marking the
+leaf as half-dead and removing the downlink is one record, and unlinking a
+page is a second record.  If vacuum is interrupted for some reason, or the
+system crashes, the tree is consistent for searches and insertions.  The next
+VACUUM will find the half-dead leaf page and continue the deletion.
 
 Scans during Recovery
 ---------------------
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c7c65a6..7ccbfd3 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,14 @@
 #include "utils/inval.h"
 #include "utils/snapmgr.h"
 
+static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
+static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
+						 bool *rightsib_empty);
+static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
+					   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
+					   BlockNumber *target, BlockNumber *rightsib);
+static void  _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+				   TransactionId latestRemovedXid);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -954,24 +962,36 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 }
 
 /*
- * Subroutine to pre-check whether a page deletion is safe, that is, its
- * parent page would be left in a valid or deletable state.
+ * Subroutine to find the parent of the branch we're deleting.  This climbs
+ * up the tree until it finds a page with more than one child, ie. a page
+ * that will not be totally emptied by the deletion.  The chain of pages below
+ * it, with one downlink each, will be part of the branch that we need to
+ * delete.
  *
- * "target" is the page we wish to delete, and "stack" is a search stack
+ * If we cannot remove the downlink from the parent, because it's the
+ * rightmost entry, returns false.  On success, *topparent and *topoff are set
+ * to the buffer holding the parent, and the offset of the downlink in it.
+ * *topparent is write-locked, the caller is responsible for releasing it when
+ * done.  *target is set to the topmost page in the branch to-be-deleted, ie.
+ * the page whose downlink *topparent / *topoff point to, and *rightsib to its
+ * right sibling.
+ *
+ * "child" is the leaf page we wish to delete, and "stack" is a search stack
  * leading to it (approximately).  Note that we will update the stack
  * entry(s) to reflect current downlink positions --- this is harmless and
- * indeed saves later search effort in _bt_pagedel.
+ * indeed saves later search effort in _bt_pagedel.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
- * Note: it's OK to release page locks after checking, because a safe
- * deletion can't become unsafe due to concurrent activity.  A non-rightmost
- * page cannot become rightmost unless there's a concurrent page deletion,
- * but only VACUUM does page deletion and we only allow one VACUUM on an index
- * at a time.  An only child could acquire a sibling (of the same parent) only
- * by being split ... but that would make it a non-rightmost child so the
- * deletion is still safe.
+ * Note: it's OK to release page locks on any internal pages between the leaf
+ * and *topparent, because a safe deletion can't become unsafe due to
+ * concurrent activity.  An internal page can only acquire an entry if the
+ * child is split, but that cannot happen as long as we hold a lock on the
+ * leaf.
  */
 static bool
-_bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
+_bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
+					   Buffer *topparent, OffsetNumber *topoff,
+					   BlockNumber *target, BlockNumber *rightsib)
 {
 	BlockNumber parent;
 	OffsetNumber poffset,
@@ -980,20 +1000,12 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
 	Page		page;
 	BTPageOpaque opaque;
 
-	/*
-	 * In recovery mode, assume the deletion being replayed is valid.  We
-	 * can't always check it because we won't have a full search stack, and we
-	 * should complain if there's a problem, anyway.
-	 */
-	if (InRecovery)
-		return true;
-
 	/* Locate the parent's downlink (updating the stack entry if needed) */
-	ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
+	ItemPointerSet(&(stack->bts_btentry.t_tid), child, P_HIKEY);
 	pbuf = _bt_getstackbuf(rel, stack, BT_READ);
 	if (pbuf == InvalidBuffer)
 		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
-			 RelationGetRelationName(rel), target);
+			 RelationGetRelationName(rel), child);
 	parent = stack->bts_blkno;
 	poffset = stack->bts_offset;
 
@@ -1021,8 +1033,12 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
 				return false;
 			}
 
+			*target = child;
+			*rightsib = opaque->btpo_next;
+
 			_bt_relbuf(rel, pbuf);
-			return _bt_parent_deletion_safe(rel, parent, stack->bts_parent);
+			return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
+										  topparent, topoff, target, rightsib);
 		}
 		else
 		{
@@ -1034,7 +1050,8 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
 	else
 	{
 		/* Not rightmost child, so safe to delete */
-		_bt_relbuf(rel, pbuf);
+		*topparent = pbuf;
+		*topoff = poffset;
 		return true;
 	}
 }
@@ -1051,158 +1068,420 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
  * On entry, the target buffer must be pinned and locked (either read or write
  * lock is OK).  This lock and pin will be dropped before exiting.
  *
- * The "stack" argument can be a search stack leading (approximately) to the
- * target page, or NULL --- outside callers typically pass NULL since they
- * have not done such a search, but internal recursion cases pass the stack
- * to avoid duplicated search effort.
- *
  * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent pages were deleted too).
+ * be deleted now; could be more than one if parent or sibling pages were
+ * deleted too).
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 int
-_bt_pagedel(Relation rel, Buffer buf, BTStack stack)
+_bt_pagedel(Relation rel, Buffer buf)
 {
-	int			result;
-	BlockNumber target,
-				leftsib,
-				rightsib,
-				parent;
-	OffsetNumber poffset,
-				maxoff;
-	uint32		targetlevel,
-				ilevel;
-	ItemId		itemid;
-	IndexTuple	targetkey,
-				itup;
-	ScanKey		itup_scankey;
-	Buffer		lbuf,
-				rbuf,
-				pbuf;
-	bool		parent_half_dead;
-	bool		parent_one_child;
+	int			ndeleted = 0;
+	BlockNumber rightsib;
 	bool		rightsib_empty;
-	Buffer		metabuf = InvalidBuffer;
-	Page		metapg = NULL;
-	BTMetaPageData *metad = NULL;
 	Page		page;
 	BTPageOpaque opaque;
-
 	/*
-	 * We can never delete rightmost pages nor root pages.	While at it, check
-	 * that page is not already deleted and is empty.
+	 * "stack" is a search stack leading (approximately) to the target page.
+	 * It is initially NULL, but when iterating, we keep it to avoid
+	 * duplicated search effort.
 	 */
-	page = BufferGetPage(buf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
-		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+	BTStack		stack = NULL;
+
+	for (;;)
 	{
-		/* Should never fail to delete a half-dead page */
-		Assert(!P_ISHALFDEAD(opaque));
+		page = BufferGetPage(buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-		_bt_relbuf(rel, buf);
-		return 0;
-	}
+		/*
+		 * Internal pages are never deleted directly, only as part of deleting
+		 * the whole branch all the way down to leaf level.
+		 */
+		if (!P_ISLEAF(opaque))
+		{
+			/*
+			 * Pre-9.4 page deletion only marked internal pages as half-dead,
+			 * but now we only use that flag on leaf pages. The old algorithm
+			 * was never supposed to leave half-dead pages in the tree, it was
+			 * just a transient state, but it was nevertheless possible in
+			 * error scenarios. We don't know how to deal with them here. They
+			 * are harmless as far as searches are considered, but inserts into
+			 * the deleted keyspace could add out-of-order downlinks in the
+			 * upper levels. Log a notice, hopefully the admin will notice and
+			 * reindex.
+			 */
+			if (P_ISHALFDEAD(opaque))
+				ereport(LOG,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("index \"%s\" contains a half-dead internal page",
+								RelationGetRelationName(rel)),
+						 errhint("This can be caused by an interrupt VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
+			_bt_relbuf(rel, buf);
+			return ndeleted;
+		}
 
-	/*
-	 * Save info about page, including a copy of its high key (it must have
-	 * one, being non-rightmost).
-	 */
-	target = BufferGetBlockNumber(buf);
-	targetlevel = opaque->btpo.level;
-	leftsib = opaque->btpo_prev;
-	itemid = PageGetItemId(page, P_HIKEY);
-	targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
+		/*
+		 * We can never delete rightmost pages nor root pages.	While at it,
+		 * check that page is not already deleted and is empty.
+		 */
+		if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
+			P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+		{
+			/* Should never fail to delete a half-dead page */
+			Assert(!P_ISHALFDEAD(opaque));
 
-	/*
-	 * To avoid deadlocks, we'd better drop the target page lock before going
-	 * further.
-	 */
-	_bt_relbuf(rel, buf);
+			_bt_relbuf(rel, buf);
+			return ndeleted;
+		}
 
-	/*
-	 * We need an approximate pointer to the page's parent page.  We use the
-	 * standard search mechanism to search for the page's high key; this will
-	 * give us a link to either the current parent or someplace to its left
-	 * (if there are multiple equal high keys).  In recursion cases, the
-	 * caller already generated a search stack and we can just re-use that
-	 * work.
-	 */
-	if (stack == NULL)
-	{
-		if (!InRecovery)
+		/*
+		 * First, remove downlink pointing to the page (or a parent of the page,
+		 * if we are going to delete a taller branch), and mark the page as
+		 * half-dead.
+		 */
+		if (!P_ISHALFDEAD(opaque))
 		{
-			/* we need an insertion scan key to do our search, so build one */
-			itup_scankey = _bt_mkscankey(rel, targetkey);
-			/* find the leftmost leaf page containing this key */
-			stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
-							   &lbuf, BT_READ);
-			/* don't need a pin on that either */
-			_bt_relbuf(rel, lbuf);
-
 			/*
-			 * If we are trying to delete an interior page, _bt_search did
-			 * more than we needed.  Locate the stack item pointing to our
-			 * parent level.
+			 * We need an approximate pointer to the page's parent page.  We
+			 * use the standard search mechanism to search for the page's high
+			 * key; this will give us a link to either the current parent or
+			 * someplace to its left (if there are multiple equal high keys).
 			 */
-			ilevel = 0;
-			for (;;)
+			if (!stack)
 			{
-				if (stack == NULL)
-					elog(ERROR, "not enough stack items");
-				if (ilevel == targetlevel)
-					break;
-				stack = stack->bts_parent;
-				ilevel++;
+				ScanKey		itup_scankey;
+				ItemId		itemid;
+				IndexTuple	targetkey;
+				Buffer		lbuf;
+
+				itemid = PageGetItemId(page, P_HIKEY);
+				targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
+
+				/*
+				 * To avoid deadlocks, we'd better drop the leaf page lock
+				 * before going further.
+				 */
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+				/* we need an insertion scan key for the search, so build one */
+				itup_scankey = _bt_mkscankey(rel, targetkey);
+				/* find the leftmost leaf page containing this key */
+				stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey,
+								   false, &lbuf, BT_READ);
+				/* don't need a pin on the page */
+				_bt_relbuf(rel, lbuf);
+
+				/*
+				 * Re-lock the leaf page, and start over, to re-check that the
+				 * page can still be deleted.
+				 */
+				LockBuffer(buf, BT_WRITE);
+				continue;
+			}
+
+			if (!_bt_mark_page_halfdead(rel, buf, stack))
+			{
+				_bt_relbuf(rel, buf);
+				return ndeleted;
 			}
 		}
-		else
+
+		/*
+		 * Then unlink it from its siblings. Each call to
+		 *_bt_unlink_halfdead_page unlinks the topmost page from the branch,
+		 * making it shallower. Iterate until the leaf page is gone.
+		 */
+		rightsib_empty = false;
+		while (P_ISHALFDEAD(opaque))
 		{
-			/*
-			 * During WAL recovery, we can't use _bt_search (for one reason,
-			 * it might invoke user-defined comparison functions that expect
-			 * facilities not available in recovery mode).	Instead, just set
-			 * up a dummy stack pointing to the left end of the parent tree
-			 * level, from which _bt_getstackbuf will walk right to the parent
-			 * page.  Painful, but we don't care too much about performance in
-			 * this scenario.
-			 */
-			pbuf = _bt_get_endpoint(rel, targetlevel + 1, false);
-			stack = (BTStack) palloc(sizeof(BTStackData));
-			stack->bts_blkno = BufferGetBlockNumber(pbuf);
-			stack->bts_offset = InvalidOffsetNumber;
-			/* bts_btentry will be initialized below */
-			stack->bts_parent = NULL;
-			_bt_relbuf(rel, pbuf);
+			if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
+			{
+				_bt_relbuf(rel, buf);
+				return ndeleted;
+			}
+			ndeleted++;
 		}
+
+		rightsib = opaque->btpo_next;
+
+		_bt_relbuf(rel, buf);
+
+		/*
+		 * The page has now been deleted. If its right sibling is completely
+		 * empty, it's possible that the reason we haven't deleted it earlier
+		 * is that it was the rightmost child of the parent. Now that we
+		 * removed the downlink for this page, the right sibling might now be
+		 * the only child of the parent, and could be removed. It would be
+		 * picked up by the next vacuum anyway, but might as well try to remove
+		 * it now, so loop back to process the right sibling.
+		 */
+		if (!rightsib_empty)
+			break;
+
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
 	}
 
+	return ndeleted;
+}
+
+static bool
+_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+{
+	BlockNumber	leafblkno;
+	BlockNumber leafrightsib;
+	BlockNumber	target;
+	BlockNumber rightsib;
+	ItemId		itemid;
+	Page		page;
+	BTPageOpaque opaque;
+	Buffer		topparent;
+	OffsetNumber topoff;
+	OffsetNumber nextoffset;
+	IndexTuple	itup;
+
+	page = BufferGetPage(leafbuf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(!P_RIGHTMOST(opaque) && !P_ISROOT(opaque) && !P_ISDELETED(opaque) &&
+		   !P_ISHALFDEAD(opaque) && P_ISLEAF(opaque) &&
+		   P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
+
+	/*
+	 * Save info about page.
+	 */
+	leafblkno = BufferGetBlockNumber(leafbuf);
+	leafrightsib = opaque->btpo_next;
+
 	/*
 	 * We cannot delete a page that is the rightmost child of its immediate
 	 * parent, unless it is the only child --- in which case the parent has to
 	 * be deleted too, and the same condition applies recursively to it. We
-	 * have to check this condition all the way up before trying to delete. We
-	 * don't need to re-test when deleting a non-leaf page, though.
+	 * have to check this condition all the way up before trying to delete,
+	 * and lock the final parent of the to-be-deleted branch.
+	 */
+	rightsib = leafrightsib;
+	target = leafblkno;
+	if (!_bt_lock_branch_parent(rel, leafblkno, stack,
+								&topparent, &topoff, &target, &rightsib))
+		return false;
+
+	/*
+	 * Check that the parent-page index items we're about to delete/overwrite
+	 * contain what we expect.	This can fail if the index has become corrupt
+	 * for some reason.  We want to throw any error before entering the
+	 * critical section --- otherwise it'd be a PANIC.
+	 *
+	 * The test on the target item is just an Assert because
+	 * _bt_lock_branch_parent should have guaranteed it has the expected
+	 * contents.  The test on the next-child downlink is known to sometimes
+	 * fail in the field, though.
+	 */
+	page = BufferGetPage(topparent);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+#ifdef USE_ASSERT_CHECKING
+	itemid = PageGetItemId(page, topoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
+#endif
+
+	nextoffset = OffsetNumberNext(topoff);
+	itemid = PageGetItemId(page, nextoffset);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
+		elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
+			 rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
+			 BufferGetBlockNumber(topparent), RelationGetRelationName(rel));
+
+	/*
+	 * Any insert which would have gone on the target block will now go to the
+	 * right sibling block.
+	 */
+	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
+
+	/* No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/*
+	 * Update parent.  The normal case is a tad tricky because we want to
+	 * delete the target's downlink and the *following* key.  Easiest way is
+	 * to copy the right sibling's downlink over the target downlink, and then
+	 * delete the following item.
 	 */
-	if (targetlevel == 0 &&
-		!_bt_parent_deletion_safe(rel, target, stack))
-		return 0;
+	page = BufferGetPage(topparent);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	itemid = PageGetItemId(page, topoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
+
+	nextoffset = OffsetNumberNext(topoff);
+	PageIndexTupleDelete(page, nextoffset);
+
+	/*
+	 * Mark the leaf page as half-dead, and stamp it with a pointer to the
+	 * highest internal page in the branch we're deleting. We use the tid of
+	 * the high key to store it.
+	 */
+	page = BufferGetPage(leafbuf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	opaque->btpo_flags |= BTP_HALF_DEAD;
+
+	itemid = PageGetItemId(page, P_HIKEY);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (target == leafblkno)
+		ItemPointerSetInvalid(&(itup->t_tid));
+	else
+		ItemPointerSet(&(itup->t_tid), target, P_HIKEY);
+
+	/* Must mark buffers dirty before XLogInsert */
+	MarkBufferDirty(topparent);
+	MarkBufferDirty(leafbuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_mark_page_halfdead xlrec;
+		XLogRecPtr	recptr;
+		XLogRecData rdata[2];
+
+		xlrec.target.node = rel->rd_node;
+		ItemPointerSet(&(xlrec.target.tid), BufferGetBlockNumber(topparent), topoff);
+		xlrec.leafblk = leafblkno;
+		xlrec.downlink = target;
+
+		page = BufferGetPage(leafbuf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		xlrec.leftblk = opaque->btpo_prev;
+		xlrec.rightblk = opaque->btpo_next;
+
+		rdata[0].data = (char *) &xlrec;
+		rdata[0].len = SizeOfBtreeMarkPageHalfDead;
+		rdata[0].buffer = InvalidBuffer;
+		rdata[0].next = &(rdata[1]);
+
+		rdata[1].data = NULL;
+		rdata[1].len = 0;
+		rdata[1].buffer = topparent;
+		rdata[1].buffer_std = true;
+		rdata[1].next = NULL;
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_MARK_PAGE_HALFDEAD, rdata);
+
+		page = BufferGetPage(topparent);
+		PageSetLSN(page, recptr);
+		page = BufferGetPage(leafbuf);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_bt_relbuf(rel, topparent);
+	return true;
+}
+
+/*
+ * Unlink a page in a branch of half-dead pages from its siblings.
+ *
+ * If the leaf page still has a downlink pointing to it, unlinks the highest
+ * parent in the to-be-deleted branch instead of the leaf page. To get rid
+ * of the whole branch, including the leaf page itself, iterate until the
+ * leaf page is deleted.
+ *
+ * Returns 'false' if the page could not be unlinked (shouldn't happen).
+ * If the (new) right sibling of the page is empty, *rightsib_empty is set
+ * to true.
+ */
+static bool
+_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
+{
+	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
+	BlockNumber leafleftsib;
+	BlockNumber leafrightsib;
+	BlockNumber	target;
+	BlockNumber	leftsib;
+	BlockNumber	rightsib;
+	Buffer		lbuf = InvalidBuffer;
+	Buffer		buf;
+	Buffer		rbuf;
+	Buffer		metabuf = InvalidBuffer;
+	Page		metapg = NULL;
+	BTMetaPageData *metad = NULL;
+	ItemId		itemid;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		rightsib_is_rightmost;
+	int			targetlevel;
+	ItemPointer leafhikey;
+	BlockNumber nextchild;
+	BlockNumber	topblkno;
+
+	page = BufferGetPage(leafbuf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
+
+	/*
+	 * Remember some information about the leaf page.
+	 */
+	itemid = PageGetItemId(page, P_HIKEY);
+	leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
+	leafleftsib = opaque->btpo_prev;
+	leafrightsib = opaque->btpo_next;
+
+	LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * If the leaf page still has a parent pointing to it (or a chain of
+	 * parents), we don't unlink the leaf page yet, but the topmost remaining
+	 * parent in the branch.
+	 */
+	if (ItemPointerIsValid(leafhikey))
+	{
+		topblkno = ItemPointerGetBlockNumber(leafhikey);
+		target = topblkno;
+
+		/* fetch the block number of the topmost parent's left sibling */
+		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		page = BufferGetPage(buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		leftsib = opaque->btpo_prev;
+		targetlevel = opaque->btpo.level;
+
+		/*
+		 * To avoid deadlocks, we'd better drop the target page lock before
+		 * going further.
+		 */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		topblkno = InvalidBlockNumber;
+		target = leafblkno;
+		buf = leafbuf;
+		targetlevel = 0;
+	}
+
+	leftsib = opaque->btpo_prev;
 
 	/*
 	 * We have to lock the pages we need to modify in the standard order:
 	 * moving right, then up.  Else we will deadlock against other writers.
 	 *
-	 * So, we need to find and write-lock the current left sibling of the
-	 * target page.  The sibling that was current a moment ago could have
-	 * split, so we may have to move right.  This search could fail if either
-	 * the sibling or the target page was deleted by someone else meanwhile;
-	 * if so, give up.	(Right now, that should never happen, since page
-	 * deletion is only done in VACUUM and there shouldn't be multiple VACUUMs
-	 * concurrently on the same table.)
+	 * So, first lock the leaf page, if it's not the target.  Then find and
+	 * write-lock the current left sibling of the target page.  The sibling
+	 * that was current a moment ago could have split, so we may have to move
+	 * right.  This search could fail if either the sibling or the target page
+	 * was deleted by someone else meanwhile; if so, give up.  (Right now,
+	 * that should never happen, since page deletion is only done in VACUUM
+	 * and there shouldn't be multiple VACUUMs concurrently on the same
+	 * table.)
 	 */
+	if (target != leafblkno)
+		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
 		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
@@ -1217,7 +1496,7 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			{
 				elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
 					 RelationGetRelationName(rel));
-				return 0;
+				return false;
 			}
 			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
@@ -1232,27 +1511,44 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	 * a write lock not a superexclusive lock, since no scans would stop on an
 	 * empty page.
 	 */
-	buf = _bt_getbuf(rel, target, BT_WRITE);
+	LockBuffer(buf, BT_WRITE);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	/*
-	 * Check page is still empty etc, else abandon deletion.  The empty check
-	 * is necessary since someone else might have inserted into it while we
-	 * didn't have it locked; the others are just for paranoia's sake.
+	 * Check page is still empty etc, else abandon deletion.  This is just
+	 * for paranoia's sake; a half-dead page cannot resurrect because there
+	 * can be only one vacuum process running at a time.
 	 */
-	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
-		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque))
 	{
-		_bt_relbuf(rel, buf);
-		if (BufferIsValid(lbuf))
-			_bt_relbuf(rel, lbuf);
-		return 0;
+		elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+			 target, RelationGetRelationName(rel));
 	}
 	if (opaque->btpo_prev != leftsib)
 		elog(ERROR, "left link changed unexpectedly in block %u of index \"%s\"",
 			 target, RelationGetRelationName(rel));
 
+	if (target == leafblkno)
+	{
+		if (P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
+			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+				 target, RelationGetRelationName(rel));
+		nextchild = InvalidBlockNumber;
+	}
+	else
+	{
+		if (P_FIRSTDATAKEY(opaque) != PageGetMaxOffsetNumber(page) ||
+			P_ISLEAF(opaque))
+			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+				 target, RelationGetRelationName(rel));
+
+		/* remember the next child down in the branch. */
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		nextchild = ItemPointerGetBlockNumber(&((IndexTuple) PageGetItem(page, itemid))->t_tid);
+	}
+
 	/*
 	 * And next write-lock the (current) right sibling.
 	 */
@@ -1265,50 +1561,8 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			 "block %u links to %u instead of expected %u in index \"%s\"",
 			 rightsib, opaque->btpo_prev, target,
 			 RelationGetRelationName(rel));
-
-	/*
-	 * Any insert which would have gone on the target block will now go to the
-	 * right sibling block.
-	 */
-	PredicateLockPageCombine(rel, target, rightsib);
-
-	/*
-	 * Next find and write-lock the current parent of the target page. This is
-	 * essentially the same as the corresponding step of splitting.
-	 */
-	ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
-	pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
-	if (pbuf == InvalidBuffer)
-		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
-			 RelationGetRelationName(rel), target);
-	parent = stack->bts_blkno;
-	poffset = stack->bts_offset;
-
-	/*
-	 * If the target is the rightmost child of its parent, then we can't
-	 * delete, unless it's also the only child --- in which case the parent
-	 * changes to half-dead status.  The "can't delete" case should have been
-	 * detected by _bt_parent_deletion_safe, so complain if we see it now.
-	 */
-	page = BufferGetPage(pbuf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-	parent_half_dead = false;
-	parent_one_child = false;
-	if (poffset >= maxoff)
-	{
-		if (poffset == P_FIRSTDATAKEY(opaque))
-			parent_half_dead = true;
-		else
-			elog(ERROR, "failed to delete rightmost child %u of block %u in index \"%s\"",
-				 target, parent, RelationGetRelationName(rel));
-	}
-	else
-	{
-		/* Will there be exactly one child left in this parent? */
-		if (OffsetNumberNext(P_FIRSTDATAKEY(opaque)) == maxoff)
-			parent_one_child = true;
-	}
+	rightsib_is_rightmost = P_RIGHTMOST(opaque);
+	*rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
 
 	/*
 	 * If we are deleting the next-to-last page on the target's level, then
@@ -1322,11 +1576,10 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	 * We can safely acquire a lock on the metapage here --- see comments for
 	 * _bt_newroot().
 	 */
-	if (leftsib == P_NONE && !parent_half_dead)
+	if (leftsib == P_NONE && rightsib_is_rightmost)
 	{
 		page = BufferGetPage(rbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-		Assert(opaque->btpo.level == targetlevel);
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
@@ -1349,38 +1602,6 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	}
 
 	/*
-	 * Check that the parent-page index items we're about to delete/overwrite
-	 * contain what we expect.	This can fail if the index has become corrupt
-	 * for some reason.  We want to throw any error before entering the
-	 * critical section --- otherwise it'd be a PANIC.
-	 *
-	 * The test on the target item is just an Assert because _bt_getstackbuf
-	 * should have guaranteed it has the expected contents.  The test on the
-	 * next-child downlink is known to sometimes fail in the field, though.
-	 */
-	page = BufferGetPage(pbuf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-#ifdef USE_ASSERT_CHECKING
-	itemid = PageGetItemId(page, poffset);
-	itup = (IndexTuple) PageGetItem(page, itemid);
-	Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
-#endif
-
-	if (!parent_half_dead)
-	{
-		OffsetNumber nextoffset;
-
-		nextoffset = OffsetNumberNext(poffset);
-		itemid = PageGetItemId(page, nextoffset);
-		itup = (IndexTuple) PageGetItem(page, itemid);
-		if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
-			elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
-				 rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
-				 parent, RelationGetRelationName(rel));
-	}
-
-	/*
 	 * Here we begin doing the deletion.
 	 */
 
@@ -1388,29 +1609,6 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	START_CRIT_SECTION();
 
 	/*
-	 * Update parent.  The normal case is a tad tricky because we want to
-	 * delete the target's downlink and the *following* key.  Easiest way is
-	 * to copy the right sibling's downlink over the target downlink, and then
-	 * delete the following item.
-	 */
-	if (parent_half_dead)
-	{
-		PageIndexTupleDelete(page, poffset);
-		opaque->btpo_flags |= BTP_HALF_DEAD;
-	}
-	else
-	{
-		OffsetNumber nextoffset;
-
-		itemid = PageGetItemId(page, poffset);
-		itup = (IndexTuple) PageGetItem(page, itemid);
-		ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
-
-		nextoffset = OffsetNumberNext(poffset);
-		PageIndexTupleDelete(page, nextoffset);
-	}
-
-	/*
 	 * Update siblings' side-links.  Note the target page's side-links will
 	 * continue to point to the siblings.  Asserts here are just rechecking
 	 * things we already verified above.
@@ -1426,7 +1624,19 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->btpo_prev == target);
 	opaque->btpo_prev = leftsib;
-	rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
+
+	/*
+	 * If we deleted a parent of the targeted leaf page, instead of the leaf
+	 * itself, update the leaf to point to the next remaining child in the
+	 * branch.
+	 */
+	if (target != leafblkno)
+	{
+		if (nextchild == leafblkno)
+			ItemPointerSetInvalid(leafhikey);
+		else
+			ItemPointerSet(leafhikey, nextchild, P_HIKEY);
+	}
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
@@ -1453,31 +1663,39 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	}
 
 	/* Must mark buffers dirty before XLogInsert */
-	MarkBufferDirty(pbuf);
 	MarkBufferDirty(rbuf);
 	MarkBufferDirty(buf);
 	if (BufferIsValid(lbuf))
 		MarkBufferDirty(lbuf);
+	if (target != leafblkno)
+		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
-		xl_btree_delete_page xlrec;
+		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[5];
+		XLogRecData rdata[4];
 		XLogRecData *nextrdata;
 
-		xlrec.target.node = rel->rd_node;
-		ItemPointerSet(&(xlrec.target.tid), parent, poffset);
+		xlrec.node = rel->rd_node;
+
+		/* information on the unlinked block */
 		xlrec.deadblk = target;
-		xlrec.leftblk = leftsib;
-		xlrec.rightblk = rightsib;
+		xlrec.leftsib = leftsib;
+		xlrec.rightsib = rightsib;
 		xlrec.btpo_xact = opaque->btpo.xact;
 
+		/* information needed to recreate the leaf block (if not the target) */
+		xlrec.leafblk = leafblkno;
+		xlrec.leafleftsib = leafleftsib;
+		xlrec.leafrightsib = leafrightsib;
+		xlrec.downlink = nextchild;
+
 		rdata[0].data = (char *) &xlrec;
-		rdata[0].len = SizeOfBtreeDeletePage;
+		rdata[0].len = SizeOfBtreeUnlinkPage;
 		rdata[0].buffer = InvalidBuffer;
 		rdata[0].next = nextrdata = &(rdata[1]);
 
@@ -1493,19 +1711,10 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			nextrdata->buffer = InvalidBuffer;
 			nextrdata->next = nextrdata + 1;
 			nextrdata++;
-			xlinfo = XLOG_BTREE_DELETE_PAGE_META;
+			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
 		}
-		else if (parent_half_dead)
-			xlinfo = XLOG_BTREE_DELETE_PAGE_HALF;
 		else
-			xlinfo = XLOG_BTREE_DELETE_PAGE;
-
-		nextrdata->data = NULL;
-		nextrdata->len = 0;
-		nextrdata->next = nextrdata + 1;
-		nextrdata->buffer = pbuf;
-		nextrdata->buffer_std = true;
-		nextrdata++;
+			xlinfo = XLOG_BTREE_UNLINK_PAGE;
 
 		nextrdata->data = NULL;
 		nextrdata->len = 0;
@@ -1530,8 +1739,6 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 		{
 			PageSetLSN(metapg, recptr);
 		}
-		page = BufferGetPage(pbuf);
-		PageSetLSN(page, recptr);
 		page = BufferGetPage(rbuf);
 		PageSetLSN(page, recptr);
 		page = BufferGetPage(buf);
@@ -1541,6 +1748,11 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			page = BufferGetPage(lbuf);
 			PageSetLSN(page, recptr);
 		}
+		if (target != leafblkno)
+		{
+			page = BufferGetPage(leafbuf);
+			PageSetLSN(page, recptr);
+		}
 	}
 
 	END_CRIT_SECTION();
@@ -1551,46 +1763,17 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 		CacheInvalidateRelcache(rel);
 		_bt_relbuf(rel, metabuf);
 	}
-	/* can always release leftsib immediately */
+	/* release siblings */
 	if (BufferIsValid(lbuf))
 		_bt_relbuf(rel, lbuf);
+	_bt_relbuf(rel, rbuf);
 
 	/*
-	 * If parent became half dead, recurse to delete it. Otherwise, if right
-	 * sibling is empty and is now the last child of the parent, recurse to
-	 * try to delete it.  (These cases cannot apply at the same time, though
-	 * the second case might itself recurse to the first.)
-	 *
-	 * When recursing to parent, we hold the lock on the target page until
-	 * done.  This delays any insertions into the keyspace that was just
-	 * effectively reassigned to the parent's right sibling.  If we allowed
-	 * that, and there were enough such insertions before we finish deleting
-	 * the parent, page splits within that keyspace could lead to inserting
-	 * out-of-order keys into the grandparent level.  It is thought that that
-	 * wouldn't have any serious consequences, but it still seems like a
-	 * pretty bad idea.
+	 * Release the target, if it was not the leaf block. The leaf is always
+	 * kept locked.
 	 */
-	if (parent_half_dead)
-	{
-		/* recursive call will release pbuf */
-		_bt_relbuf(rel, rbuf);
-		result = _bt_pagedel(rel, pbuf, stack->bts_parent) + 1;
+	if (target != leafblkno)
 		_bt_relbuf(rel, buf);
-	}
-	else if (parent_one_child && rightsib_empty)
-	{
-		_bt_relbuf(rel, pbuf);
-		_bt_relbuf(rel, buf);
-		/* recursive call will release rbuf */
-		result = _bt_pagedel(rel, rbuf, stack) + 1;
-	}
-	else
-	{
-		_bt_relbuf(rel, pbuf);
-		_bt_relbuf(rel, buf);
-		_bt_relbuf(rel, rbuf);
-		result = 1;
-	}
 
-	return result;
+	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b7f9e2e..aab4644 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1090,7 +1090,7 @@ restart:
 		MemoryContextReset(vstate->pagedelcontext);
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
-		ndel = _bt_pagedel(rel, buf, NULL);
+		ndel = _bt_pagedel(rel, buf);
 
 		/* count only this page, else may double-count parent */
 		if (ndel)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 98d9763..e7317c4 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -25,41 +25,32 @@
  * them manually if they are not seen in the WAL log during replay.  This
  * makes it safe for page insertion to be a multiple-WAL-action process.
  *
- * Similarly, deletion of an only child page and deletion of its parent page
- * form multiple WAL log entries, and we have to be prepared to follow through
- * with the deletion if the log ends between.
- *
  * The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split or multi deletion to remain incomplete
- * for long.  In any case we need to respect the order of operations.
+ * since we don't expect a page split to remain incomplete for long.  In any
+ * case we need to respect the order of operations.
  */
-typedef struct bt_incomplete_action
+typedef struct bt_incomplete_split
 {
 	RelFileNode node;			/* the index */
-	bool		is_split;		/* T = pending split, F = pending delete */
-	/* these fields are for a split: */
 	bool		is_root;		/* we split the root */
 	BlockNumber leftblk;		/* left half of split */
 	BlockNumber rightblk;		/* right half of split */
-	/* these fields are for a delete: */
-	BlockNumber delblk;			/* parent block to be deleted */
-} bt_incomplete_action;
+} bt_incomplete_split;
 
-static List *incomplete_actions;
+static List *incomplete_splits;
 
 
 static void
 log_incomplete_split(RelFileNode node, BlockNumber leftblk,
 					 BlockNumber rightblk, bool is_root)
 {
-	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
+	bt_incomplete_split *action = palloc(sizeof(bt_incomplete_split));
 
 	action->node = node;
-	action->is_split = true;
 	action->is_root = is_root;
 	action->leftblk = leftblk;
 	action->rightblk = rightblk;
-	incomplete_actions = lappend(incomplete_actions, action);
+	incomplete_splits = lappend(incomplete_splits, action);
 }
 
 static void
@@ -67,49 +58,17 @@ forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
 {
 	ListCell   *l;
 
-	foreach(l, incomplete_actions)
+	foreach(l, incomplete_splits)
 	{
-		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
+		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
 
 		if (RelFileNodeEquals(node, action->node) &&
-			action->is_split &&
 			downlink == action->rightblk)
 		{
 			if (is_root != action->is_root)
 				elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
 					 action->is_root, is_root);
-			incomplete_actions = list_delete_ptr(incomplete_actions, action);
-			pfree(action);
-			break;				/* need not look further */
-		}
-	}
-}
-
-static void
-log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
-{
-	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
-
-	action->node = node;
-	action->is_split = false;
-	action->delblk = delblk;
-	incomplete_actions = lappend(incomplete_actions, action);
-}
-
-static void
-forget_matching_deletion(RelFileNode node, BlockNumber delblk)
-{
-	ListCell   *l;
-
-	foreach(l, incomplete_actions)
-	{
-		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
-		if (RelFileNodeEquals(node, action->node) &&
-			!action->is_split &&
-			delblk == action->delblk)
-		{
-			incomplete_actions = list_delete_ptr(incomplete_actions, action);
+			incomplete_splits = list_delete_ptr(incomplete_splits, action);
 			pfree(action);
 			break;				/* need not look further */
 		}
@@ -798,21 +757,16 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
 }
 
 static void
-btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
+btree_xlog_mark_page_halfdead(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 {
-	xl_btree_delete_page *xlrec = (xl_btree_delete_page *) XLogRecGetData(record);
+	xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) XLogRecGetData(record);
 	BlockNumber parent;
-	BlockNumber target;
-	BlockNumber leftsib;
-	BlockNumber rightsib;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque pageop;
+	IndexTupleData trunctuple;
 
 	parent = ItemPointerGetBlockNumber(&(xlrec->target.tid));
-	target = xlrec->deadblk;
-	leftsib = xlrec->leftblk;
-	rightsib = xlrec->rightblk;
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
@@ -832,49 +786,90 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 		{
 			page = (Page) BufferGetPage(buffer);
 			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-			if (lsn <= PageGetLSN(page))
-			{
-				UnlockReleaseBuffer(buffer);
-			}
-			else
+			if (lsn > PageGetLSN(page))
 			{
 				OffsetNumber poffset;
+				ItemId		itemid;
+				IndexTuple	itup;
+				OffsetNumber nextoffset;
+				BlockNumber	rightsib;
 
 				poffset = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
-				if (poffset >= PageGetMaxOffsetNumber(page))
-				{
-					Assert(info == XLOG_BTREE_DELETE_PAGE_HALF);
-					Assert(poffset == P_FIRSTDATAKEY(pageop));
-					PageIndexTupleDelete(page, poffset);
-					pageop->btpo_flags |= BTP_HALF_DEAD;
-				}
-				else
-				{
-					ItemId		itemid;
-					IndexTuple	itup;
-					OffsetNumber nextoffset;
-
-					Assert(info != XLOG_BTREE_DELETE_PAGE_HALF);
-					itemid = PageGetItemId(page, poffset);
-					itup = (IndexTuple) PageGetItem(page, itemid);
-					ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
-					nextoffset = OffsetNumberNext(poffset);
-					PageIndexTupleDelete(page, nextoffset);
-				}
+
+				nextoffset = OffsetNumberNext(poffset);
+				itemid = PageGetItemId(page, nextoffset);
+				itup = (IndexTuple) PageGetItem(page, itemid);
+				rightsib = ItemPointerGetBlockNumber(&itup->t_tid);
+
+				itemid = PageGetItemId(page, poffset);
+				itup = (IndexTuple) PageGetItem(page, itemid);
+				ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
+				nextoffset = OffsetNumberNext(poffset);
+				PageIndexTupleDelete(page, nextoffset);
 
 				PageSetLSN(page, lsn);
 				MarkBufferDirty(buffer);
-				UnlockReleaseBuffer(buffer);
 			}
+			UnlockReleaseBuffer(buffer);
 		}
 	}
 
+	/* Rewrite the leaf page as a halfdead page */
+	buffer = XLogReadBuffer(xlrec->target.node, xlrec->leafblk, true);
+	Assert(BufferIsValid(buffer));
+	page = (Page) BufferGetPage(buffer);
+
+	_bt_pageinit(page, BufferGetPageSize(buffer));
+	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	pageop->btpo_prev = xlrec->leftblk;
+	pageop->btpo_next = xlrec->rightblk;
+	pageop->btpo.level = 0;
+	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
+	pageop->btpo_cycleid = 0;
+
+	/* a dummy hikey that points to the next parent to be deleted (if any) */
+	trunctuple.t_info = sizeof(IndexTupleData);
+	ItemPointerSet(&trunctuple.t_tid, xlrec->downlink, P_HIKEY);
+	if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
+					false, false) == InvalidOffsetNumber)
+		elog(ERROR, "could not add dummy high key to half-dead page");
+
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+}
+
+
+static void
+btree_xlog_unlink_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
+	BlockNumber target;
+	BlockNumber leftsib;
+	BlockNumber rightsib;
+	Buffer		buffer;
+	Page		page;
+	BTPageOpaque pageop;
+
+	target = xlrec->deadblk;
+	leftsib = xlrec->leftsib;
+	rightsib = xlrec->rightsib;
+
+	/*
+	 * In normal operation, we would lock all the pages this WAL record
+	 * touches before changing any of them.  In WAL replay, it should be okay
+	 * to lock just one page at a time, since no concurrent index updates can
+	 * be happening, and readers should not care whether they arrive at the
+	 * target page or not (since it's surely empty).
+	 */
+
 	/* Fix left-link of right sibling */
-	if (record->xl_info & XLR_BKP_BLOCK(1))
-		(void) RestoreBackupBlock(lsn, record, 1, false, false);
+	if (record->xl_info & XLR_BKP_BLOCK(0))
+		(void) RestoreBackupBlock(lsn, record, 0, false, false);
 	else
 	{
-		buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
+		buffer = XLogReadBuffer(xlrec->node, rightsib, false);
 		if (BufferIsValid(buffer))
 		{
 			page = (Page) BufferGetPage(buffer);
@@ -895,13 +890,13 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 	}
 
 	/* Fix right-link of left sibling, if any */
-	if (record->xl_info & XLR_BKP_BLOCK(2))
-		(void) RestoreBackupBlock(lsn, record, 2, false, false);
+	if (record->xl_info & XLR_BKP_BLOCK(1))
+		(void) RestoreBackupBlock(lsn, record, 1, false, false);
 	else
 	{
 		if (leftsib != P_NONE)
 		{
-			buffer = XLogReadBuffer(xlrec->target.node, leftsib, false);
+			buffer = XLogReadBuffer(xlrec->node, leftsib, false);
 			if (BufferIsValid(buffer))
 			{
 				page = (Page) BufferGetPage(buffer);
@@ -923,7 +918,7 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 	}
 
 	/* Rewrite target page as empty deleted page */
-	buffer = XLogReadBuffer(xlrec->target.node, target, true);
+	buffer = XLogReadBuffer(xlrec->node, target, true);
 	Assert(BufferIsValid(buffer));
 	page = (Page) BufferGetPage(buffer);
 
@@ -940,24 +935,54 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 	MarkBufferDirty(buffer);
 	UnlockReleaseBuffer(buffer);
 
+	/*
+	 * If we deleted a parent of the targeted leaf page, instead of the leaf
+	 * itself, update the leaf to point to the next remaining child in the
+	 * branch.
+	 */
+	if (target != xlrec->leafblk)
+	{
+		ItemId		itemid;
+		ItemPointer leafhikey;
+
+		/*
+		 * There is no real data on the page, so we just re-create it from
+		 * scratch using the information from the WAL record.
+		 */
+		buffer = XLogReadBuffer(xlrec->node, xlrec->leafblk, true);
+		Assert(BufferIsValid(buffer));
+		page = (Page) BufferGetPage(buffer);
+
+		_bt_pageinit(page, BufferGetPageSize(buffer));
+		pageop->btpo_flags = BTP_LEAF;
+		pageop->btpo_prev = xlrec->leafleftsib;
+		pageop->btpo_next = xlrec->leafrightsib;
+		pageop->btpo.level = 0;
+		pageop->btpo_cycleid = 0;
+
+		itemid = PageGetItemId(page, P_HIKEY);
+		leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
+
+		if (xlrec->downlink == xlrec->leafblk)
+			ItemPointerSetInvalid(leafhikey);
+		else
+			ItemPointerSet(leafhikey, xlrec->downlink, P_HIKEY);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+
 	/* Update metapage if needed */
-	if (info == XLOG_BTREE_DELETE_PAGE_META)
+	if (info == XLOG_BTREE_UNLINK_PAGE_META)
 	{
 		xl_btree_metadata md;
 
-		memcpy(&md, (char *) xlrec + SizeOfBtreeDeletePage,
+		memcpy(&md, (char *) xlrec + SizeOfBtreeUnlinkPage,
 			   sizeof(xl_btree_metadata));
-		_bt_restore_meta(xlrec->target.node, lsn,
+		_bt_restore_meta(xlrec->node, lsn,
 						 md.root, md.level,
 						 md.fastroot, md.fastlevel);
 	}
-
-	/* Forget any completed deletion */
-	forget_matching_deletion(xlrec->target.node, target);
-
-	/* If parent became half-dead, remember it for deletion */
-	if (info == XLOG_BTREE_DELETE_PAGE_HALF)
-		log_incomplete_deletion(xlrec->target.node, parent);
 }
 
 static void
@@ -1072,10 +1097,12 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_BTREE_DELETE:
 			btree_xlog_delete(lsn, record);
 			break;
-		case XLOG_BTREE_DELETE_PAGE:
-		case XLOG_BTREE_DELETE_PAGE_META:
-		case XLOG_BTREE_DELETE_PAGE_HALF:
-			btree_xlog_delete_page(info, lsn, record);
+		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
+			btree_xlog_mark_page_halfdead(info, lsn, record);
+			break;
+		case XLOG_BTREE_UNLINK_PAGE:
+		case XLOG_BTREE_UNLINK_PAGE_META:
+			btree_xlog_unlink_page(info, lsn, record);
 			break;
 		case XLOG_BTREE_NEWROOT:
 			btree_xlog_newroot(lsn, record);
@@ -1091,7 +1118,7 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 void
 btree_xlog_startup(void)
 {
-	incomplete_actions = NIL;
+	incomplete_splits = NIL;
 }
 
 void
@@ -1099,67 +1126,47 @@ btree_xlog_cleanup(void)
 {
 	ListCell   *l;
 
-	foreach(l, incomplete_actions)
+	foreach(l, incomplete_splits)
 	{
-		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
-		if (action->is_split)
-		{
-			/* finish an incomplete split */
-			Buffer		lbuf,
-						rbuf;
-			Page		lpage,
-						rpage;
-			BTPageOpaque lpageop,
-						rpageop;
-			bool		is_only;
-			Relation	reln;
-
-			lbuf = XLogReadBuffer(action->node, action->leftblk, false);
-			/* failure is impossible because we wrote this page earlier */
-			if (!BufferIsValid(lbuf))
-				elog(PANIC, "btree_xlog_cleanup: left block unfound");
-			lpage = (Page) BufferGetPage(lbuf);
-			lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
-			rbuf = XLogReadBuffer(action->node, action->rightblk, false);
-			/* failure is impossible because we wrote this page earlier */
-			if (!BufferIsValid(rbuf))
-				elog(PANIC, "btree_xlog_cleanup: right block unfound");
-			rpage = (Page) BufferGetPage(rbuf);
-			rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
-			/* if the pages are all of their level, it's a only-page split */
-			is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
-			reln = CreateFakeRelcacheEntry(action->node);
-			_bt_insert_parent(reln, lbuf, rbuf, NULL,
-							  action->is_root, is_only);
-			FreeFakeRelcacheEntry(reln);
-		}
-		else
-		{
-			/* finish an incomplete deletion (of a half-dead page) */
-			Buffer		buf;
-
-			buf = XLogReadBuffer(action->node, action->delblk, false);
-			if (BufferIsValid(buf))
-			{
-				Relation	reln;
-
-				reln = CreateFakeRelcacheEntry(action->node);
-				if (_bt_pagedel(reln, buf, NULL) == 0)
-					elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
-				FreeFakeRelcacheEntry(reln);
-			}
-		}
+		/* finish an incomplete split */
+		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
+		Buffer		lbuf,
+					rbuf;
+		Page		lpage,
+					rpage;
+		BTPageOpaque lpageop,
+					rpageop;
+		bool		is_only;
+		Relation	reln;
+
+		lbuf = XLogReadBuffer(action->node, action->leftblk, false);
+		/* failure is impossible because we wrote this page earlier */
+		if (!BufferIsValid(lbuf))
+			elog(PANIC, "btree_xlog_cleanup: left block unfound");
+		lpage = (Page) BufferGetPage(lbuf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+		rbuf = XLogReadBuffer(action->node, action->rightblk, false);
+		/* failure is impossible because we wrote this page earlier */
+		if (!BufferIsValid(rbuf))
+			elog(PANIC, "btree_xlog_cleanup: right block unfound");
+		rpage = (Page) BufferGetPage(rbuf);
+		rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+		/* if the pages are all of their level, it's a only-page split */
+		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
+
+		reln = CreateFakeRelcacheEntry(action->node);
+		_bt_insert_parent(reln, lbuf, rbuf, NULL,
+						  action->is_root, is_only);
+		FreeFakeRelcacheEntry(reln);
 	}
-	incomplete_actions = NIL;
+	incomplete_splits = NIL;
 }
 
 bool
 btree_safe_restartpoint(void)
 {
-	if (incomplete_actions)
+	if (incomplete_splits)
 		return false;
 	return true;
 }
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 57ac021..c1b0018 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -124,16 +124,27 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
 								 xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
 				break;
 			}
-		case XLOG_BTREE_DELETE_PAGE:
-		case XLOG_BTREE_DELETE_PAGE_META:
-		case XLOG_BTREE_DELETE_PAGE_HALF:
+		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
 			{
-				xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
+				xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) rec;
 
-				appendStringInfoString(buf, "delete_page: ");
+				appendStringInfoString(buf, "mark_page_halfdead: ");
 				out_target(buf, &(xlrec->target));
-				appendStringInfo(buf, "; dead %u; left %u; right %u",
-							xlrec->deadblk, xlrec->leftblk, xlrec->rightblk);
+				appendStringInfo(buf, "; downlink %u; leaf %u; left %u; right %u",
+								 xlrec->leafblk, xlrec->leftblk, xlrec->rightblk, xlrec->downlink);
+				break;
+			}
+		case XLOG_BTREE_UNLINK_PAGE_META:
+		case XLOG_BTREE_UNLINK_PAGE:
+			{
+				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
+
+				appendStringInfo(buf, "unlink_page: rel %u/%u/%u; ",
+								 xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode);
+				appendStringInfo(buf, "dead %u; left %u; right %u; btpo_xact %u",
+								 xlrec->deadblk, xlrec->leftsib, xlrec->rightsib, xlrec->btpo_xact);
+				appendStringInfo(buf, "leaf %u; leafleft %u; leafright %u; downlink %u",
+								 xlrec->leafblk, xlrec->leafleftsib, xlrec->leafrightsib, xlrec->downlink);
 				break;
 			}
 		case XLOG_BTREE_NEWROOT:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6628886..a1fb948 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -215,11 +215,10 @@ typedef struct BTMetaPageData
 #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
 #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
-#define XLOG_BTREE_DELETE_PAGE	0x80	/* delete an entire page */
-#define XLOG_BTREE_DELETE_PAGE_META 0x90		/* same, and update metapage */
+#define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
+#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
 #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
-#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0		/* page deletion that makes
-												 * parent half-dead */
+#define XLOG_BTREE_MARK_PAGE_HALFDEAD 0xB0 /* mark a leaf as half-dead */
 #define XLOG_BTREE_VACUUM		0xC0	/* delete entries on a page during
 										 * vacuum */
 #define XLOG_BTREE_REUSE_PAGE	0xD0	/* old page is about to be reused from
@@ -370,23 +369,49 @@ typedef struct xl_btree_vacuum
 #define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
 
 /*
- * This is what we need to know about deletion of a btree page.  The target
- * identifies the tuple removed from the parent page (note that we remove
- * this tuple's downlink and the *following* tuple's key).	Note we do not
- * store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about marking an empty branch for deletion.
+ * The target identifies the tuple removed from the parent page (note that we
+ * remove this tuple's downlink and the *following* tuple's key).  Note that
+ * the leaf page is empty, so we don't need to store its content --- it is
+ * just reinitialized during recovery using the rest of the fields.
  */
-typedef struct xl_btree_delete_page
+typedef struct xl_btree_mark_page_halfdead
 {
 	xl_btreetid target;			/* deleted tuple id in parent page */
-	BlockNumber deadblk;		/* child block being deleted */
-	BlockNumber leftblk;		/* child block's left sibling, if any */
-	BlockNumber rightblk;		/* child block's right sibling */
+	BlockNumber leafblk;		/* leaf block ultimately being deleted */
+	BlockNumber leftblk;		/* leaf block's left sibling, if any */
+	BlockNumber rightblk;		/* leaf block's right sibling */
+	BlockNumber downlink;		/* next child down in the branch */
+} xl_btree_mark_page_halfdead;
+
+#define SizeOfBtreeMarkPageHalfDead	(offsetof(xl_btree_mark_page_halfdead, leafblk) + sizeof(BlockNumber))
+
+/*
+ * This is what we need to know about deletion of a btree page.  Note we do
+ * not store any content for the deleted page --- it is just rewritten as empty
+ * during recovery, apart from resetting the btpo.xact.
+ */
+typedef struct xl_btree_unlink_page
+{
+	RelFileNode	node;
+	BlockNumber deadblk;		/* target block being deleted */
+	BlockNumber leftsib;		/* taregt block's left sibling, if any */
+	BlockNumber rightsib;		/* target block's right sibling */
+
+	/*
+	 * Information needed to recreate the leaf page, when target is an internal
+	 * page.
+	 */
+	BlockNumber	leafblk;
+	BlockNumber	leafleftsib;
+	BlockNumber	leafrightsib;
+	BlockNumber	downlink;		/* next child down in the branch */
+
 	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
-	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */
-} xl_btree_delete_page;
+	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
+} xl_btree_unlink_page;
 
-#define SizeOfBtreeDeletePage	(offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
 
 /*
  * New root log record.  There are zero tuples if this is to establish an
@@ -639,7 +664,7 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
 					BlockNumber lastBlockVacuumed);
-extern int	_bt_pagedel(Relation rel, Buffer buf, BTStack stack);
+extern int	_bt_pagedel(Relation rel, Buffer buf);
 
 /*
  * prototypes for functions in nbtsearch.c

#18

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Jim Nasby (#12)

Re: Race condition in b-tree page deletion

On 12/17/2013 04:55 AM, Jim Nasby wrote:

On 11/9/13, 10:02 AM, Heikki Linnakangas wrote:

3. Another approach would be to get rid of the "can't delete
rightmost child" limitation. We currently have that limitation
because it ensures that we never need to change the high-key of a
page. If we delete a page that is the rightmost child of its
parent, we transfer the deleted keyspace from the parent page to
its right sibling. To do that, we need to update the high key of
the parent, as well as the downlink of the right sibling at the
grandparent level. That's a bit complicated, because updating the
high key might require splitting the page.

Is the rightmost child issue likely to affect indexes on increasing
values, like a queue table?

No. In a FIFO queue, you keep adding new items to the right-end of the
index, so old pages become non-rightmost fairly quickly.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Peter Geoghegan

pg@heroku.com

almost 12 years ago

In reply to: Kevin Grittner (#16)

Re: Race condition in b-tree page deletion

On Sat, Jan 18, 2014 at 11:45 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

Here's a patch implementing this scheme.

I thought I'd weigh in here too, since this is closely tied to the
page split patch, which is dependent on this patch.

The basic approach is sound. The papers on which we based our
btree implementation did not discuss entry deletion, and our
current approach introduces new possible states into the on-disk
index structure which are not covered by the papers and are not
always handled correctly by the current code.

Lehman and Yao don't discuss deletion. But V. Lanin and D. Shasha. do,
in the paper "A Symmetric Concurrent B-Tree Algorithm"[1]L&S: https://archive.org/stream/symmetricconcurr00lani/symmetricconcurr00lani_djvu.txt, as the
README notes - we currently follow a "simplified" version of that.
Apparently a number of people proposed solutions to address the
omission of L&Y, as one survey of those approaches notes [2]http://www.dtic.mil/get-tr-doc/pdf?AD=ADA232287&Location=U2&doc=GetTRDoc.pdf (Chapter 3, "The Coherent Shared Memory Algorithm"). One of
the approaches appearing in that survey is L&S.

The classic L&Y algorithm only has right-links, and not left-links.
The same applies to L&S. But I'd describe this patch as bringing what
we do closer to L&S, which is more or less a refinement of B-Link
trees (that is, L&Y B-Trees). L&S does have a concept of an outlink,
crucial to their "symmetric deletion approach", which is something
that we don't need, because we already have left-links (we have them
primarily to support backwards index scans). L&S says "In general, the
link technique calls for processes that change sections of a data
structure in a way that would cause other processes to fail to provide
a link to another section of the data structure where recovery is
possible". So loosely speaking we're doing a "conceptual
_bt_moveleft()" for deletion as the inverse of a literal _bt_moveright
for insertion.

L&S says "a deletion needs no more than two write locks at a time
during its ascent" (the descent is just a garden variety _bt_search()
insertion scankey descent). However, we currently may obtain as many
as 4 write locks for "page deletion". As the README notes:

"""
To delete an empty page, we acquire write lock on its left sibling (if
any), the target page itself, the right sibling (there must be one), and
the parent page, in that order.
"""

We obtain 2 write locks in the first phase (on target and parent). In
the second phase, we do what is described above: lock 4 pages in that
order. However, the reason for this difference from the L&S paper is
made clear in "2.5 Freeing Empty nodes", which kind of cops out of
addressing how to *unlink* empty/half-dead pages, with a couple of
brief descriptions of approaches that others have come up with that
are not at all applicable to us.

For that reason, might I suggest that you change this in the README:

+ Page Deletion
+ -------------

I suggest that it should read "Page Deletion and Unlinking" to
emphasize the distinction.

The general feeling is that this patch is not suitable for
back-patching. I agree with this, but this is a bug which could
lead to index corruption which exists in all supported branches.
If nobody can suggest an alternative which we can back-patch, we
might want to reconsider this after this has been in production
releases for several months.

That was what I thought. Let's revisit the question of back-patching
this when 9.4 has been released for 6 months or so.

Other than that, I have not found any problems.

Incidentally, during my research of these techniques, I discovered
that Johnson and Shasha argue that "for most concurrent B-tree
applications, merge-at-empty produces significantly lower
restructuring rates, and only a slightly lower space utilization, than
merge-at-half". This is covered in the paper "B-trees with inserts,
deletes, and modifies: Why Free-at-empty is Better than Merge-at-half"
[3]: http://cs.nyu.edu/shasha/papers/bjournal.ps -- Peter Geoghegan
theoretical basis for that behavior, even if the worst-case bloat can
be unpleasant. Though having said that, that paper doesn't weigh what
we do in the event of non-HOT updates, and that could change the
equation. That isn't an argument against merge-at-empty; it's an
argument against the current handling of non-HOT updates.

Currently, the comments above _bt_doinsert() say at one point:

* This routine is called by the public interface routines, btbuild
* and btinsert.

This is obsolete; since commit 89bda95d, only the insert public
interface routine calls _bt_doinsert(). Perhaps fix this in passing.

[1]: L&S: https://archive.org/stream/symmetricconcurr00lani/symmetricconcurr00lani_djvu.txt

[2]: http://www.dtic.mil/get-tr-doc/pdf?AD=ADA232287&Location=U2&doc=GetTRDoc.pdf (Chapter 3, "The Coherent Shared Memory Algorithm")
(Chapter 3, "The Coherent Shared Memory Algorithm")

[3]: http://cs.nyu.edu/shasha/papers/bjournal.ps -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Heikki Linnakangas (#17)

1 attachment(s)

Re: Race condition in b-tree page deletion

On 01/27/2014 04:34 PM, Heikki Linnakangas wrote:

While fixing the above issues, and reviewing this again in general, I
noticed that btree_xlog_unlink_page() was a few bricks shy of a load.
When unlinking an internal page, it didn't update the pointer to the new
topmost page in the to-be-deleted chain. Fixed that.

Attached is a new version. I also re-worded the README changes; I hope
it reads better now.

I dusted off the scripts I posted yesterday, to visualize b-tree
structure using graphviz, and tried it on this patch. I found a number
of bugs. Attached is a new version with those bugs fixed.

- Heikki

Attachments:

fix-btree-page-deletion-4.patchtext/x-diff; name=fix-btree-page-deletion-4.patchDownload

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 40f09e3..03efc29 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -168,26 +168,33 @@ parent item does still exist and can't have been deleted.  Also, because
 we are matching downlink page numbers and not data keys, we don't have any
 problem with possibly misidentifying the parent item.
 
+Page Deletion
+-------------
+
 We consider deleting an entire page from the btree only when it's become
 completely empty of items.  (Merging partly-full pages would allow better
 space reuse, but it seems impractical to move existing data items left or
 right to make this happen --- a scan moving in the opposite direction
 might miss the items if so.)  Also, we *never* delete the rightmost page
 on a tree level (this restriction simplifies the traversal algorithms, as
-explained below).
-
-To delete an empty page, we acquire write lock on its left sibling (if
-any), the target page itself, the right sibling (there must be one), and
-the parent page, in that order.  The parent page must be found using the
-same type of search as used to find the parent during an insertion split.
-Then we update the side-links in the siblings, mark the target page
-deleted, and remove the downlink from the parent, as well as the parent's
-upper bounding key for the target (the one separating it from its right
-sibling).  This causes the target page's key space to effectively belong
-to its right sibling.  (Neither the left nor right sibling pages need to
-change their "high key" if any; so there is no problem with possibly not
-having enough space to replace a high key.)  The side-links in the target
-page are not changed.
+explained below).  Page deletion always begins from an empty leaf page. An
+internal page can only be deleted as part of a branch leading to a leaf
+page, where each internal page has only one child and that child is also to
+be deleted.
+
+Deleting a leaf page is a two-stage process. In the first stage, the page is
+unlinked from its parent, by removing its downlink, and marked as half-dead.
+The parent page must be found using the same type of search as used to find
+the parent during an insertion split.  We lock the target and the parent
+pages, change the target's downlink to point to the right sibling, and remove
+its old downlink.  This causes the target page's key space to effectively
+belong to its right sibling.  (Neither the left nor right sibling pages need
+to change their "high key" if any; so there is no problem with possibly not
+having enough space to replace a high key.)  At the same time, we mark the
+target page as half-dead, which causes any subsequent searches to ignore it
+and move right (or left, in a backwards scan).  This leaves the tree in a
+similar state as during a page split: the the page has no downlink pointing
+to it, but it's still linked to its siblings.
 
 (Note: Lanin and Shasha prefer to make the key space move left, but their
 argument for doing so hinges on not having left-links, which we have
@@ -199,31 +206,44 @@ the same parent --- otherwise, the parent's key space assignment changes
 too, meaning we'd have to make bounding-key updates in its parent, and
 perhaps all the way up the tree.  Since we can't possibly do that
 atomically, we forbid this case.  That means that the rightmost child of a
-parent node can't be deleted unless it's the only remaining child.
-
-When we delete the last remaining child of a parent page, we mark the
-parent page "half-dead" as part of the atomic update that deletes the
-child page.  This implicitly transfers the parent's key space to its right
-sibling (which it must have, since we never delete the overall-rightmost
-page of a level).  Searches ignore the half-dead page and immediately move
-right.  We need not worry about insertions into a half-dead page --- insertions
-into upper tree levels happen only as a result of splits of child pages, and
-the half-dead page no longer has any children that could split.  Therefore
-the page stays empty even when we don't have lock on it, and we can complete
-its deletion in a second atomic action.
-
-The notion of a half-dead page means that the key space relationship between
-the half-dead page's level and its parent's level may be a little out of
-whack: key space that appears to belong to the half-dead page's parent on the
-parent level may really belong to its right sibling.  To prevent any possible
-problems, we hold lock on the deleted child page until we have finished
-deleting any now-half-dead parent page(s).  This prevents any insertions into
-the transferred keyspace until the operation is complete.  The reason for
-doing this is that a sufficiently large number of insertions into the
-transferred keyspace, resulting in multiple page splits, could propagate keys
-from that keyspace into the parent level, resulting in transiently
-out-of-order keys in that level.  It is thought that that wouldn't cause any
-serious problem, but it seems too risky to allow.
+parent node can't be deleted unless it's the only remaining child, in which
+case we will delete the parent too (see below).
+
+In the second-stage, the half-dead leaf page is unlinked from its siblings.
+We first lock the left sibling (if any) of the target, the target page itself,
+and its right sibling (there must be one) in that order. Then we update the
+side-links in the siblings, and mark the target page deleted.
+
+When we're about to delete the last remaining child of a parent page, things
+are slightly more complicated.  In the first stage, we leave the immediate
+parent of the leaf page alone, and remove the downlink to the parent page
+instead, from the grandparent.  If it's the last child of the grandparent too,
+we recurse up until we find a parent with more than one child, and remove the
+downlink of that page.  The leaf page is marked as half-dead, and the block
+number of the page whose downlink was removed is stashed in the half-dead leaf
+page.  This leaves us with a chain of internal pages, with one downlink each,
+leading to the half-dead leaf page, and no downlink pointing to the topmost
+page in the chain.
+
+While we recurse up to find the topmost parent in the chain, we keep the
+leaf page locked, but don't need to hold locks on the intermediate pages
+between the leaf and the topmost parent -- insertions into upper tree levels
+happen only as a result of splits of child pages, and that can't happen as
+long as we're keeping the leaf locked. The internal pages in the chain cannot
+acquire new children afterwards either, because the leaf page is marked as
+half-dead and won't be split.
+
+Removing the downlink to the top of the to-be-deleted chain effectively
+transfers the key space to the right sibling for all the intermediate levels
+too, in one atomic operation.  A concurrent search might still visit the
+intermediate pages, but it will move right when it reaches the half-dead page
+at the leaf level.
+
+In the second stage, the topmost page in the chain is unlinked from its
+siblings, and the half-dead leaf page is updated to point to the next page
+down in the chain.  This is repeated until there are no internal pages left
+in the chain.  Finally, the half-dead leaf page itself is unlinked from its
+siblings.
 
 A deleted page cannot be reclaimed immediately, since there may be other
 processes waiting to reference it (ie, search processes that just left the
@@ -396,12 +416,11 @@ metapage update (of the "fast root" link) is performed and logged as part
 of the insertion into the parent level.  When splitting the root page, the
 metapage update is handled as part of the "new root" action.
 
-A page deletion is logged as a single WAL entry covering all four
-required page updates (target page, left and right siblings, and parent)
-as an atomic event.  (Any required fast-root link update is also part
-of the WAL entry.)  If the parent page becomes half-dead but is not
-immediately deleted due to a subsequent crash, there is no loss of
-consistency, and the empty page will be picked up by the next VACUUM.
+Each step in page deletion are logged as separate WAL entries: marking the
+leaf as half-dead and removing the downlink is one record, and unlinking a
+page is a second record.  If vacuum is interrupted for some reason, or the
+system crashes, the tree is consistent for searches and insertions.  The next
+VACUUM will find the half-dead leaf page and continue the deletion.
 
 Scans during Recovery
 ---------------------
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index c7c65a6..c38bc1c 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -31,6 +31,14 @@
 #include "utils/inval.h"
 #include "utils/snapmgr.h"
 
+static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
+static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
+						 bool *rightsib_empty);
+static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
+					   BTStack stack, Buffer *topparent, OffsetNumber *topoff,
+					   BlockNumber *target, BlockNumber *rightsib);
+static void  _bt_log_reuse_page(Relation rel, BlockNumber blkno,
+				   TransactionId latestRemovedXid);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -954,24 +962,36 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 }
 
 /*
- * Subroutine to pre-check whether a page deletion is safe, that is, its
- * parent page would be left in a valid or deletable state.
+ * Subroutine to find the parent of the branch we're deleting.  This climbs
+ * up the tree until it finds a page with more than one child, ie. a page
+ * that will not be totally emptied by the deletion.  The chain of pages below
+ * it, with one downlink each, will be part of the branch that we need to
+ * delete.
  *
- * "target" is the page we wish to delete, and "stack" is a search stack
+ * If we cannot remove the downlink from the parent, because it's the
+ * rightmost entry, returns false.  On success, *topparent and *topoff are set
+ * to the buffer holding the parent, and the offset of the downlink in it.
+ * *topparent is write-locked, the caller is responsible for releasing it when
+ * done.  *target is set to the topmost page in the branch to-be-deleted, ie.
+ * the page whose downlink *topparent / *topoff point to, and *rightsib to its
+ * right sibling.
+ *
+ * "child" is the leaf page we wish to delete, and "stack" is a search stack
  * leading to it (approximately).  Note that we will update the stack
  * entry(s) to reflect current downlink positions --- this is harmless and
- * indeed saves later search effort in _bt_pagedel.
+ * indeed saves later search effort in _bt_pagedel.  The caller should
+ * initialize *target and *rightsib to the leaf page and its right sibling.
  *
- * Note: it's OK to release page locks after checking, because a safe
- * deletion can't become unsafe due to concurrent activity.  A non-rightmost
- * page cannot become rightmost unless there's a concurrent page deletion,
- * but only VACUUM does page deletion and we only allow one VACUUM on an index
- * at a time.  An only child could acquire a sibling (of the same parent) only
- * by being split ... but that would make it a non-rightmost child so the
- * deletion is still safe.
+ * Note: it's OK to release page locks on any internal pages between the leaf
+ * and *topparent, because a safe deletion can't become unsafe due to
+ * concurrent activity.  An internal page can only acquire an entry if the
+ * child is split, but that cannot happen as long as we hold a lock on the
+ * leaf.
  */
 static bool
-_bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
+_bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
+					   Buffer *topparent, OffsetNumber *topoff,
+					   BlockNumber *target, BlockNumber *rightsib)
 {
 	BlockNumber parent;
 	OffsetNumber poffset,
@@ -980,20 +1000,12 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
 	Page		page;
 	BTPageOpaque opaque;
 
-	/*
-	 * In recovery mode, assume the deletion being replayed is valid.  We
-	 * can't always check it because we won't have a full search stack, and we
-	 * should complain if there's a problem, anyway.
-	 */
-	if (InRecovery)
-		return true;
-
 	/* Locate the parent's downlink (updating the stack entry if needed) */
-	ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
+	ItemPointerSet(&(stack->bts_btentry.t_tid), child, P_HIKEY);
 	pbuf = _bt_getstackbuf(rel, stack, BT_READ);
 	if (pbuf == InvalidBuffer)
 		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
-			 RelationGetRelationName(rel), target);
+			 RelationGetRelationName(rel), child);
 	parent = stack->bts_blkno;
 	poffset = stack->bts_offset;
 
@@ -1021,8 +1033,12 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
 				return false;
 			}
 
+			*target = parent;
+			*rightsib = opaque->btpo_next;
+
 			_bt_relbuf(rel, pbuf);
-			return _bt_parent_deletion_safe(rel, parent, stack->bts_parent);
+			return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
+										  topparent, topoff, target, rightsib);
 		}
 		else
 		{
@@ -1034,7 +1050,8 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
 	else
 	{
 		/* Not rightmost child, so safe to delete */
-		_bt_relbuf(rel, pbuf);
+		*topparent = pbuf;
+		*topoff = poffset;
 		return true;
 	}
 }
@@ -1051,158 +1068,420 @@ _bt_parent_deletion_safe(Relation rel, BlockNumber target, BTStack stack)
  * On entry, the target buffer must be pinned and locked (either read or write
  * lock is OK).  This lock and pin will be dropped before exiting.
  *
- * The "stack" argument can be a search stack leading (approximately) to the
- * target page, or NULL --- outside callers typically pass NULL since they
- * have not done such a search, but internal recursion cases pass the stack
- * to avoid duplicated search effort.
- *
  * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent pages were deleted too).
+ * be deleted now; could be more than one if parent or sibling pages were
+ * deleted too).
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
 int
-_bt_pagedel(Relation rel, Buffer buf, BTStack stack)
+_bt_pagedel(Relation rel, Buffer buf)
 {
-	int			result;
-	BlockNumber target,
-				leftsib,
-				rightsib,
-				parent;
-	OffsetNumber poffset,
-				maxoff;
-	uint32		targetlevel,
-				ilevel;
-	ItemId		itemid;
-	IndexTuple	targetkey,
-				itup;
-	ScanKey		itup_scankey;
-	Buffer		lbuf,
-				rbuf,
-				pbuf;
-	bool		parent_half_dead;
-	bool		parent_one_child;
+	int			ndeleted = 0;
+	BlockNumber rightsib;
 	bool		rightsib_empty;
-	Buffer		metabuf = InvalidBuffer;
-	Page		metapg = NULL;
-	BTMetaPageData *metad = NULL;
 	Page		page;
 	BTPageOpaque opaque;
-
 	/*
-	 * We can never delete rightmost pages nor root pages.	While at it, check
-	 * that page is not already deleted and is empty.
+	 * "stack" is a search stack leading (approximately) to the target page.
+	 * It is initially NULL, but when iterating, we keep it to avoid
+	 * duplicated search effort.
 	 */
-	page = BufferGetPage(buf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
-		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+	BTStack		stack = NULL;
+
+	for (;;)
 	{
-		/* Should never fail to delete a half-dead page */
-		Assert(!P_ISHALFDEAD(opaque));
+		page = BufferGetPage(buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
-		_bt_relbuf(rel, buf);
-		return 0;
-	}
+		/*
+		 * Internal pages are never deleted directly, only as part of deleting
+		 * the whole branch all the way down to leaf level.
+		 */
+		if (!P_ISLEAF(opaque))
+		{
+			/*
+			 * Pre-9.4 page deletion only marked internal pages as half-dead,
+			 * but now we only use that flag on leaf pages. The old algorithm
+			 * was never supposed to leave half-dead pages in the tree, it was
+			 * just a transient state, but it was nevertheless possible in
+			 * error scenarios. We don't know how to deal with them here. They
+			 * are harmless as far as searches are considered, but inserts into
+			 * the deleted keyspace could add out-of-order downlinks in the
+			 * upper levels. Log a notice, hopefully the admin will notice and
+			 * reindex.
+			 */
+			if (P_ISHALFDEAD(opaque))
+				ereport(LOG,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("index \"%s\" contains a half-dead internal page",
+								RelationGetRelationName(rel)),
+						 errhint("This can be caused by an interrupt VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
+			_bt_relbuf(rel, buf);
+			return ndeleted;
+		}
 
-	/*
-	 * Save info about page, including a copy of its high key (it must have
-	 * one, being non-rightmost).
-	 */
-	target = BufferGetBlockNumber(buf);
-	targetlevel = opaque->btpo.level;
-	leftsib = opaque->btpo_prev;
-	itemid = PageGetItemId(page, P_HIKEY);
-	targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
+		/*
+		 * We can never delete rightmost pages nor root pages.	While at it,
+		 * check that page is not already deleted and is empty.
+		 */
+		if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
+			P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+		{
+			/* Should never fail to delete a half-dead page */
+			Assert(!P_ISHALFDEAD(opaque));
 
-	/*
-	 * To avoid deadlocks, we'd better drop the target page lock before going
-	 * further.
-	 */
-	_bt_relbuf(rel, buf);
+			_bt_relbuf(rel, buf);
+			return ndeleted;
+		}
 
-	/*
-	 * We need an approximate pointer to the page's parent page.  We use the
-	 * standard search mechanism to search for the page's high key; this will
-	 * give us a link to either the current parent or someplace to its left
-	 * (if there are multiple equal high keys).  In recursion cases, the
-	 * caller already generated a search stack and we can just re-use that
-	 * work.
-	 */
-	if (stack == NULL)
-	{
-		if (!InRecovery)
+		/*
+		 * First, remove downlink pointing to the page (or a parent of the page,
+		 * if we are going to delete a taller branch), and mark the page as
+		 * half-dead.
+		 */
+		if (!P_ISHALFDEAD(opaque))
 		{
-			/* we need an insertion scan key to do our search, so build one */
-			itup_scankey = _bt_mkscankey(rel, targetkey);
-			/* find the leftmost leaf page containing this key */
-			stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey, false,
-							   &lbuf, BT_READ);
-			/* don't need a pin on that either */
-			_bt_relbuf(rel, lbuf);
-
 			/*
-			 * If we are trying to delete an interior page, _bt_search did
-			 * more than we needed.  Locate the stack item pointing to our
-			 * parent level.
+			 * We need an approximate pointer to the page's parent page.  We
+			 * use the standard search mechanism to search for the page's high
+			 * key; this will give us a link to either the current parent or
+			 * someplace to its left (if there are multiple equal high keys).
 			 */
-			ilevel = 0;
-			for (;;)
+			if (!stack)
+			{
+				ScanKey		itup_scankey;
+				ItemId		itemid;
+				IndexTuple	targetkey;
+				Buffer		lbuf;
+
+				itemid = PageGetItemId(page, P_HIKEY);
+				targetkey = CopyIndexTuple((IndexTuple) PageGetItem(page, itemid));
+
+				/*
+				 * To avoid deadlocks, we'd better drop the leaf page lock
+				 * before going further.
+				 */
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+
+				/* we need an insertion scan key for the search, so build one */
+				itup_scankey = _bt_mkscankey(rel, targetkey);
+				/* find the leftmost leaf page containing this key */
+				stack = _bt_search(rel, rel->rd_rel->relnatts, itup_scankey,
+								   false, &lbuf, BT_READ);
+				/* don't need a pin on the page */
+				_bt_relbuf(rel, lbuf);
+
+				/*
+				 * Re-lock the leaf page, and start over, to re-check that the
+				 * page can still be deleted.
+				 */
+				LockBuffer(buf, BT_WRITE);
+				continue;
+			}
+
+			if (!_bt_mark_page_halfdead(rel, buf, stack))
 			{
-				if (stack == NULL)
-					elog(ERROR, "not enough stack items");
-				if (ilevel == targetlevel)
-					break;
-				stack = stack->bts_parent;
-				ilevel++;
+				_bt_relbuf(rel, buf);
+				return ndeleted;
 			}
 		}
-		else
+
+		/*
+		 * Then unlink it from its siblings. Each call to
+		 *_bt_unlink_halfdead_page unlinks the topmost page from the branch,
+		 * making it shallower. Iterate until the leaf page is gone.
+		 */
+		rightsib_empty = false;
+		while (P_ISHALFDEAD(opaque))
 		{
-			/*
-			 * During WAL recovery, we can't use _bt_search (for one reason,
-			 * it might invoke user-defined comparison functions that expect
-			 * facilities not available in recovery mode).	Instead, just set
-			 * up a dummy stack pointing to the left end of the parent tree
-			 * level, from which _bt_getstackbuf will walk right to the parent
-			 * page.  Painful, but we don't care too much about performance in
-			 * this scenario.
-			 */
-			pbuf = _bt_get_endpoint(rel, targetlevel + 1, false);
-			stack = (BTStack) palloc(sizeof(BTStackData));
-			stack->bts_blkno = BufferGetBlockNumber(pbuf);
-			stack->bts_offset = InvalidOffsetNumber;
-			/* bts_btentry will be initialized below */
-			stack->bts_parent = NULL;
-			_bt_relbuf(rel, pbuf);
+			if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
+			{
+				_bt_relbuf(rel, buf);
+				return ndeleted;
+			}
+			ndeleted++;
 		}
+
+		rightsib = opaque->btpo_next;
+
+		_bt_relbuf(rel, buf);
+
+		/*
+		 * The page has now been deleted. If its right sibling is completely
+		 * empty, it's possible that the reason we haven't deleted it earlier
+		 * is that it was the rightmost child of the parent. Now that we
+		 * removed the downlink for this page, the right sibling might now be
+		 * the only child of the parent, and could be removed. It would be
+		 * picked up by the next vacuum anyway, but might as well try to remove
+		 * it now, so loop back to process the right sibling.
+		 */
+		if (!rightsib_empty)
+			break;
+
+		buf = _bt_getbuf(rel, rightsib, BT_WRITE);
 	}
 
+	return ndeleted;
+}
+
+static bool
+_bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
+{
+	BlockNumber	leafblkno;
+	BlockNumber leafrightsib;
+	BlockNumber	target;
+	BlockNumber rightsib;
+	ItemId		itemid;
+	Page		page;
+	BTPageOpaque opaque;
+	Buffer		topparent;
+	OffsetNumber topoff;
+	OffsetNumber nextoffset;
+	IndexTuple	itup;
+
+	page = BufferGetPage(leafbuf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(!P_RIGHTMOST(opaque) && !P_ISROOT(opaque) && !P_ISDELETED(opaque) &&
+		   !P_ISHALFDEAD(opaque) && P_ISLEAF(opaque) &&
+		   P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
+
+	/*
+	 * Save info about the leaf page.
+	 */
+	leafblkno = BufferGetBlockNumber(leafbuf);
+	leafrightsib = opaque->btpo_next;
+
 	/*
 	 * We cannot delete a page that is the rightmost child of its immediate
 	 * parent, unless it is the only child --- in which case the parent has to
 	 * be deleted too, and the same condition applies recursively to it. We
-	 * have to check this condition all the way up before trying to delete. We
-	 * don't need to re-test when deleting a non-leaf page, though.
+	 * have to check this condition all the way up before trying to delete,
+	 * and lock the final parent of the to-be-deleted branch.
+	 */
+	rightsib = leafrightsib;
+	target = leafblkno;
+	if (!_bt_lock_branch_parent(rel, leafblkno, stack,
+								&topparent, &topoff, &target, &rightsib))
+		return false;
+
+	/*
+	 * Check that the parent-page index items we're about to delete/overwrite
+	 * contain what we expect.	This can fail if the index has become corrupt
+	 * for some reason.  We want to throw any error before entering the
+	 * critical section --- otherwise it'd be a PANIC.
+	 *
+	 * The test on the target item is just an Assert because
+	 * _bt_lock_branch_parent should have guaranteed it has the expected
+	 * contents.  The test on the next-child downlink is known to sometimes
+	 * fail in the field, though.
+	 */
+	page = BufferGetPage(topparent);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+#ifdef USE_ASSERT_CHECKING
+	itemid = PageGetItemId(page, topoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
+#endif
+
+	nextoffset = OffsetNumberNext(topoff);
+	itemid = PageGetItemId(page, nextoffset);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
+		elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
+			 rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
+			 BufferGetBlockNumber(topparent), RelationGetRelationName(rel));
+
+	/*
+	 * Any insert which would have gone on the leaf block will now go to its
+	 * right sibling.
+	 */
+	PredicateLockPageCombine(rel, leafblkno, leafrightsib);
+
+	/* No ereport(ERROR) until changes are logged */
+	START_CRIT_SECTION();
+
+	/*
+	 * Update parent.  The normal case is a tad tricky because we want to
+	 * delete the target's downlink and the *following* key.  Easiest way is
+	 * to copy the right sibling's downlink over the target downlink, and then
+	 * delete the following item.
 	 */
-	if (targetlevel == 0 &&
-		!_bt_parent_deletion_safe(rel, target, stack))
-		return 0;
+	page = BufferGetPage(topparent);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	itemid = PageGetItemId(page, topoff);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
+
+	nextoffset = OffsetNumberNext(topoff);
+	PageIndexTupleDelete(page, nextoffset);
+
+	/*
+	 * Mark the leaf page as half-dead, and stamp it with a pointer to the
+	 * highest internal page in the branch we're deleting. We use the tid of
+	 * the high key to store it.
+	 */
+	page = BufferGetPage(leafbuf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	opaque->btpo_flags |= BTP_HALF_DEAD;
+
+	itemid = PageGetItemId(page, P_HIKEY);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (target == leafblkno)
+		ItemPointerSetInvalid(&(itup->t_tid));
+	else
+		ItemPointerSet(&(itup->t_tid), target, P_HIKEY);
+
+	/* Must mark buffers dirty before XLogInsert */
+	MarkBufferDirty(topparent);
+	MarkBufferDirty(leafbuf);
+
+	/* XLOG stuff */
+	if (RelationNeedsWAL(rel))
+	{
+		xl_btree_mark_page_halfdead xlrec;
+		XLogRecPtr	recptr;
+		XLogRecData rdata[2];
+
+		xlrec.target.node = rel->rd_node;
+		ItemPointerSet(&(xlrec.target.tid), BufferGetBlockNumber(topparent), topoff);
+		xlrec.leafblk = leafblkno;
+		xlrec.downlink = target;
+
+		page = BufferGetPage(leafbuf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		xlrec.leftblk = opaque->btpo_prev;
+		xlrec.rightblk = opaque->btpo_next;
+
+		rdata[0].data = (char *) &xlrec;
+		rdata[0].len = SizeOfBtreeMarkPageHalfDead;
+		rdata[0].buffer = InvalidBuffer;
+		rdata[0].next = &(rdata[1]);
+
+		rdata[1].data = NULL;
+		rdata[1].len = 0;
+		rdata[1].buffer = topparent;
+		rdata[1].buffer_std = true;
+		rdata[1].next = NULL;
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_MARK_PAGE_HALFDEAD, rdata);
+
+		page = BufferGetPage(topparent);
+		PageSetLSN(page, recptr);
+		page = BufferGetPage(leafbuf);
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	_bt_relbuf(rel, topparent);
+	return true;
+}
+
+/*
+ * Unlink a page in a branch of half-dead pages from its siblings.
+ *
+ * If the leaf page still has a downlink pointing to it, unlinks the highest
+ * parent in the to-be-deleted branch instead of the leaf page. To get rid
+ * of the whole branch, including the leaf page itself, iterate until the
+ * leaf page is deleted.
+ *
+ * Returns 'false' if the page could not be unlinked (shouldn't happen).
+ * If the (new) right sibling of the page is empty, *rightsib_empty is set
+ * to true.
+ */
+static bool
+_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
+{
+	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
+	BlockNumber leafleftsib;
+	BlockNumber leafrightsib;
+	BlockNumber	target;
+	BlockNumber	leftsib;
+	BlockNumber	rightsib;
+	Buffer		lbuf = InvalidBuffer;
+	Buffer		buf;
+	Buffer		rbuf;
+	Buffer		metabuf = InvalidBuffer;
+	Page		metapg = NULL;
+	BTMetaPageData *metad = NULL;
+	ItemId		itemid;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		rightsib_is_rightmost;
+	int			targetlevel;
+	ItemPointer leafhikey;
+	BlockNumber nextchild;
+	BlockNumber	topblkno;
+
+	page = BufferGetPage(leafbuf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
+
+	/*
+	 * Remember some information about the leaf page.
+	 */
+	itemid = PageGetItemId(page, P_HIKEY);
+	leafhikey = &((IndexTuple) PageGetItem(page, itemid))->t_tid;
+	leafleftsib = opaque->btpo_prev;
+	leafrightsib = opaque->btpo_next;
+
+	LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
+
+	/*
+	 * If the leaf page still has a parent pointing to it (or a chain of
+	 * parents), we don't unlink the leaf page yet, but the topmost remaining
+	 * parent in the branch.
+	 */
+	if (ItemPointerIsValid(leafhikey))
+	{
+		topblkno = ItemPointerGetBlockNumber(leafhikey);
+		target = topblkno;
+
+		/* fetch the block number of the topmost parent's left sibling */
+		buf = _bt_getbuf(rel, topblkno, BT_READ);
+		page = BufferGetPage(buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		leftsib = opaque->btpo_prev;
+		targetlevel = opaque->btpo.level;
+
+		/*
+		 * To avoid deadlocks, we'd better drop the target page lock before
+		 * going further.
+		 */
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+	}
+	else
+	{
+		topblkno = InvalidBlockNumber;
+		target = leafblkno;
+
+		buf = leafbuf;
+		leftsib = opaque->btpo_prev;
+		targetlevel = 0;
+	}
 
 	/*
 	 * We have to lock the pages we need to modify in the standard order:
 	 * moving right, then up.  Else we will deadlock against other writers.
 	 *
-	 * So, we need to find and write-lock the current left sibling of the
-	 * target page.  The sibling that was current a moment ago could have
-	 * split, so we may have to move right.  This search could fail if either
-	 * the sibling or the target page was deleted by someone else meanwhile;
-	 * if so, give up.	(Right now, that should never happen, since page
-	 * deletion is only done in VACUUM and there shouldn't be multiple VACUUMs
-	 * concurrently on the same table.)
+	 * So, first lock the leaf page, if it's not the target.  Then find and
+	 * write-lock the current left sibling of the target page.  The sibling
+	 * that was current a moment ago could have split, so we may have to move
+	 * right.  This search could fail if either the sibling or the target page
+	 * was deleted by someone else meanwhile; if so, give up.  (Right now,
+	 * that should never happen, since page deletion is only done in VACUUM
+	 * and there shouldn't be multiple VACUUMs concurrently on the same
+	 * table.)
 	 */
+	if (target != leafblkno)
+		LockBuffer(leafbuf, BT_WRITE);
 	if (leftsib != P_NONE)
 	{
 		lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
@@ -1217,7 +1496,7 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			{
 				elog(LOG, "no left sibling (concurrent deletion?) in \"%s\"",
 					 RelationGetRelationName(rel));
-				return 0;
+				return false;
 			}
 			lbuf = _bt_getbuf(rel, leftsib, BT_WRITE);
 			page = BufferGetPage(lbuf);
@@ -1232,27 +1511,44 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	 * a write lock not a superexclusive lock, since no scans would stop on an
 	 * empty page.
 	 */
-	buf = _bt_getbuf(rel, target, BT_WRITE);
+	LockBuffer(buf, BT_WRITE);
 	page = BufferGetPage(buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	/*
-	 * Check page is still empty etc, else abandon deletion.  The empty check
-	 * is necessary since someone else might have inserted into it while we
-	 * didn't have it locked; the others are just for paranoia's sake.
+	 * Check page is still empty etc, else abandon deletion.  This is just
+	 * for paranoia's sake; a half-dead page cannot resurrect because there
+	 * can be only one vacuum process running at a time.
 	 */
-	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
-		P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page))
+	if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque))
 	{
-		_bt_relbuf(rel, buf);
-		if (BufferIsValid(lbuf))
-			_bt_relbuf(rel, lbuf);
-		return 0;
+		elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+			 target, RelationGetRelationName(rel));
 	}
 	if (opaque->btpo_prev != leftsib)
 		elog(ERROR, "left link changed unexpectedly in block %u of index \"%s\"",
 			 target, RelationGetRelationName(rel));
 
+	if (target == leafblkno)
+	{
+		if (P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
+			!P_ISLEAF(opaque) || !P_ISHALFDEAD(opaque))
+			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+				 target, RelationGetRelationName(rel));
+		nextchild = InvalidBlockNumber;
+	}
+	else
+	{
+		if (P_FIRSTDATAKEY(opaque) != PageGetMaxOffsetNumber(page) ||
+			P_ISLEAF(opaque))
+			elog(ERROR, "half-dead page changed status unexpectedly in block %u of index \"%s\"",
+				 target, RelationGetRelationName(rel));
+
+		/* remember the next child down in the branch. */
+		itemid = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+		nextchild = ItemPointerGetBlockNumber(&((IndexTuple) PageGetItem(page, itemid))->t_tid);
+	}
+
 	/*
 	 * And next write-lock the (current) right sibling.
 	 */
@@ -1265,50 +1561,8 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			 "block %u links to %u instead of expected %u in index \"%s\"",
 			 rightsib, opaque->btpo_prev, target,
 			 RelationGetRelationName(rel));
-
-	/*
-	 * Any insert which would have gone on the target block will now go to the
-	 * right sibling block.
-	 */
-	PredicateLockPageCombine(rel, target, rightsib);
-
-	/*
-	 * Next find and write-lock the current parent of the target page. This is
-	 * essentially the same as the corresponding step of splitting.
-	 */
-	ItemPointerSet(&(stack->bts_btentry.t_tid), target, P_HIKEY);
-	pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
-	if (pbuf == InvalidBuffer)
-		elog(ERROR, "failed to re-find parent key in index \"%s\" for deletion target page %u",
-			 RelationGetRelationName(rel), target);
-	parent = stack->bts_blkno;
-	poffset = stack->bts_offset;
-
-	/*
-	 * If the target is the rightmost child of its parent, then we can't
-	 * delete, unless it's also the only child --- in which case the parent
-	 * changes to half-dead status.  The "can't delete" case should have been
-	 * detected by _bt_parent_deletion_safe, so complain if we see it now.
-	 */
-	page = BufferGetPage(pbuf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-	maxoff = PageGetMaxOffsetNumber(page);
-	parent_half_dead = false;
-	parent_one_child = false;
-	if (poffset >= maxoff)
-	{
-		if (poffset == P_FIRSTDATAKEY(opaque))
-			parent_half_dead = true;
-		else
-			elog(ERROR, "failed to delete rightmost child %u of block %u in index \"%s\"",
-				 target, parent, RelationGetRelationName(rel));
-	}
-	else
-	{
-		/* Will there be exactly one child left in this parent? */
-		if (OffsetNumberNext(P_FIRSTDATAKEY(opaque)) == maxoff)
-			parent_one_child = true;
-	}
+	rightsib_is_rightmost = P_RIGHTMOST(opaque);
+	*rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
 
 	/*
 	 * If we are deleting the next-to-last page on the target's level, then
@@ -1322,11 +1576,10 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	 * We can safely acquire a lock on the metapage here --- see comments for
 	 * _bt_newroot().
 	 */
-	if (leftsib == P_NONE && !parent_half_dead)
+	if (leftsib == P_NONE && rightsib_is_rightmost)
 	{
 		page = BufferGetPage(rbuf);
 		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-		Assert(opaque->btpo.level == targetlevel);
 		if (P_RIGHTMOST(opaque))
 		{
 			/* rightsib will be the only one left on the level */
@@ -1349,38 +1602,6 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	}
 
 	/*
-	 * Check that the parent-page index items we're about to delete/overwrite
-	 * contain what we expect.	This can fail if the index has become corrupt
-	 * for some reason.  We want to throw any error before entering the
-	 * critical section --- otherwise it'd be a PANIC.
-	 *
-	 * The test on the target item is just an Assert because _bt_getstackbuf
-	 * should have guaranteed it has the expected contents.  The test on the
-	 * next-child downlink is known to sometimes fail in the field, though.
-	 */
-	page = BufferGetPage(pbuf);
-	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
-
-#ifdef USE_ASSERT_CHECKING
-	itemid = PageGetItemId(page, poffset);
-	itup = (IndexTuple) PageGetItem(page, itemid);
-	Assert(ItemPointerGetBlockNumber(&(itup->t_tid)) == target);
-#endif
-
-	if (!parent_half_dead)
-	{
-		OffsetNumber nextoffset;
-
-		nextoffset = OffsetNumberNext(poffset);
-		itemid = PageGetItemId(page, nextoffset);
-		itup = (IndexTuple) PageGetItem(page, itemid);
-		if (ItemPointerGetBlockNumber(&(itup->t_tid)) != rightsib)
-			elog(ERROR, "right sibling %u of block %u is not next child %u of block %u in index \"%s\"",
-				 rightsib, target, ItemPointerGetBlockNumber(&(itup->t_tid)),
-				 parent, RelationGetRelationName(rel));
-	}
-
-	/*
 	 * Here we begin doing the deletion.
 	 */
 
@@ -1388,29 +1609,6 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	START_CRIT_SECTION();
 
 	/*
-	 * Update parent.  The normal case is a tad tricky because we want to
-	 * delete the target's downlink and the *following* key.  Easiest way is
-	 * to copy the right sibling's downlink over the target downlink, and then
-	 * delete the following item.
-	 */
-	if (parent_half_dead)
-	{
-		PageIndexTupleDelete(page, poffset);
-		opaque->btpo_flags |= BTP_HALF_DEAD;
-	}
-	else
-	{
-		OffsetNumber nextoffset;
-
-		itemid = PageGetItemId(page, poffset);
-		itup = (IndexTuple) PageGetItem(page, itemid);
-		ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
-
-		nextoffset = OffsetNumberNext(poffset);
-		PageIndexTupleDelete(page, nextoffset);
-	}
-
-	/*
 	 * Update siblings' side-links.  Note the target page's side-links will
 	 * continue to point to the siblings.  Asserts here are just rechecking
 	 * things we already verified above.
@@ -1426,7 +1624,19 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	Assert(opaque->btpo_prev == target);
 	opaque->btpo_prev = leftsib;
-	rightsib_empty = (P_FIRSTDATAKEY(opaque) > PageGetMaxOffsetNumber(page));
+
+	/*
+	 * If we deleted a parent of the targeted leaf page, instead of the leaf
+	 * itself, update the leaf to point to the next remaining child in the
+	 * branch.
+	 */
+	if (target != leafblkno)
+	{
+		if (nextchild == leafblkno)
+			ItemPointerSetInvalid(leafhikey);
+		else
+			ItemPointerSet(leafhikey, nextchild, P_HIKEY);
+	}
 
 	/*
 	 * Mark the page itself deleted.  It can be recycled when all current
@@ -1453,31 +1663,39 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 	}
 
 	/* Must mark buffers dirty before XLogInsert */
-	MarkBufferDirty(pbuf);
 	MarkBufferDirty(rbuf);
 	MarkBufferDirty(buf);
 	if (BufferIsValid(lbuf))
 		MarkBufferDirty(lbuf);
+	if (target != leafblkno)
+		MarkBufferDirty(leafbuf);
 
 	/* XLOG stuff */
 	if (RelationNeedsWAL(rel))
 	{
-		xl_btree_delete_page xlrec;
+		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
 		uint8		xlinfo;
 		XLogRecPtr	recptr;
-		XLogRecData rdata[5];
+		XLogRecData rdata[4];
 		XLogRecData *nextrdata;
 
-		xlrec.target.node = rel->rd_node;
-		ItemPointerSet(&(xlrec.target.tid), parent, poffset);
+		xlrec.node = rel->rd_node;
+
+		/* information on the unlinked block */
 		xlrec.deadblk = target;
-		xlrec.leftblk = leftsib;
-		xlrec.rightblk = rightsib;
+		xlrec.leftsib = leftsib;
+		xlrec.rightsib = rightsib;
 		xlrec.btpo_xact = opaque->btpo.xact;
 
+		/* information needed to recreate the leaf block (if not the target) */
+		xlrec.leafblk = leafblkno;
+		xlrec.leafleftsib = leafleftsib;
+		xlrec.leafrightsib = leafrightsib;
+		xlrec.downlink = nextchild;
+
 		rdata[0].data = (char *) &xlrec;
-		rdata[0].len = SizeOfBtreeDeletePage;
+		rdata[0].len = SizeOfBtreeUnlinkPage;
 		rdata[0].buffer = InvalidBuffer;
 		rdata[0].next = nextrdata = &(rdata[1]);
 
@@ -1493,19 +1711,10 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			nextrdata->buffer = InvalidBuffer;
 			nextrdata->next = nextrdata + 1;
 			nextrdata++;
-			xlinfo = XLOG_BTREE_DELETE_PAGE_META;
+			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
 		}
-		else if (parent_half_dead)
-			xlinfo = XLOG_BTREE_DELETE_PAGE_HALF;
 		else
-			xlinfo = XLOG_BTREE_DELETE_PAGE;
-
-		nextrdata->data = NULL;
-		nextrdata->len = 0;
-		nextrdata->next = nextrdata + 1;
-		nextrdata->buffer = pbuf;
-		nextrdata->buffer_std = true;
-		nextrdata++;
+			xlinfo = XLOG_BTREE_UNLINK_PAGE;
 
 		nextrdata->data = NULL;
 		nextrdata->len = 0;
@@ -1530,8 +1739,6 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 		{
 			PageSetLSN(metapg, recptr);
 		}
-		page = BufferGetPage(pbuf);
-		PageSetLSN(page, recptr);
 		page = BufferGetPage(rbuf);
 		PageSetLSN(page, recptr);
 		page = BufferGetPage(buf);
@@ -1541,6 +1748,11 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 			page = BufferGetPage(lbuf);
 			PageSetLSN(page, recptr);
 		}
+		if (target != leafblkno)
+		{
+			page = BufferGetPage(leafbuf);
+			PageSetLSN(page, recptr);
+		}
 	}
 
 	END_CRIT_SECTION();
@@ -1551,46 +1763,17 @@ _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
 		CacheInvalidateRelcache(rel);
 		_bt_relbuf(rel, metabuf);
 	}
-	/* can always release leftsib immediately */
+	/* release siblings */
 	if (BufferIsValid(lbuf))
 		_bt_relbuf(rel, lbuf);
+	_bt_relbuf(rel, rbuf);
 
 	/*
-	 * If parent became half dead, recurse to delete it. Otherwise, if right
-	 * sibling is empty and is now the last child of the parent, recurse to
-	 * try to delete it.  (These cases cannot apply at the same time, though
-	 * the second case might itself recurse to the first.)
-	 *
-	 * When recursing to parent, we hold the lock on the target page until
-	 * done.  This delays any insertions into the keyspace that was just
-	 * effectively reassigned to the parent's right sibling.  If we allowed
-	 * that, and there were enough such insertions before we finish deleting
-	 * the parent, page splits within that keyspace could lead to inserting
-	 * out-of-order keys into the grandparent level.  It is thought that that
-	 * wouldn't have any serious consequences, but it still seems like a
-	 * pretty bad idea.
+	 * Release the target, if it was not the leaf block. The leaf is always
+	 * kept locked.
 	 */
-	if (parent_half_dead)
-	{
-		/* recursive call will release pbuf */
-		_bt_relbuf(rel, rbuf);
-		result = _bt_pagedel(rel, pbuf, stack->bts_parent) + 1;
+	if (target != leafblkno)
 		_bt_relbuf(rel, buf);
-	}
-	else if (parent_one_child && rightsib_empty)
-	{
-		_bt_relbuf(rel, pbuf);
-		_bt_relbuf(rel, buf);
-		/* recursive call will release rbuf */
-		result = _bt_pagedel(rel, rbuf, stack) + 1;
-	}
-	else
-	{
-		_bt_relbuf(rel, pbuf);
-		_bt_relbuf(rel, buf);
-		_bt_relbuf(rel, rbuf);
-		result = 1;
-	}
 
-	return result;
+	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b7f9e2e..aab4644 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1090,7 +1090,7 @@ restart:
 		MemoryContextReset(vstate->pagedelcontext);
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
-		ndel = _bt_pagedel(rel, buf, NULL);
+		ndel = _bt_pagedel(rel, buf);
 
 		/* count only this page, else may double-count parent */
 		if (ndel)
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 98d9763..bc376fd 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -25,41 +25,32 @@
  * them manually if they are not seen in the WAL log during replay.  This
  * makes it safe for page insertion to be a multiple-WAL-action process.
  *
- * Similarly, deletion of an only child page and deletion of its parent page
- * form multiple WAL log entries, and we have to be prepared to follow through
- * with the deletion if the log ends between.
- *
  * The data structure is a simple linked list --- this should be good enough,
- * since we don't expect a page split or multi deletion to remain incomplete
- * for long.  In any case we need to respect the order of operations.
+ * since we don't expect a page split to remain incomplete for long.  In any
+ * case we need to respect the order of operations.
  */
-typedef struct bt_incomplete_action
+typedef struct bt_incomplete_split
 {
 	RelFileNode node;			/* the index */
-	bool		is_split;		/* T = pending split, F = pending delete */
-	/* these fields are for a split: */
 	bool		is_root;		/* we split the root */
 	BlockNumber leftblk;		/* left half of split */
 	BlockNumber rightblk;		/* right half of split */
-	/* these fields are for a delete: */
-	BlockNumber delblk;			/* parent block to be deleted */
-} bt_incomplete_action;
+} bt_incomplete_split;
 
-static List *incomplete_actions;
+static List *incomplete_splits;
 
 
 static void
 log_incomplete_split(RelFileNode node, BlockNumber leftblk,
 					 BlockNumber rightblk, bool is_root)
 {
-	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
+	bt_incomplete_split *action = palloc(sizeof(bt_incomplete_split));
 
 	action->node = node;
-	action->is_split = true;
 	action->is_root = is_root;
 	action->leftblk = leftblk;
 	action->rightblk = rightblk;
-	incomplete_actions = lappend(incomplete_actions, action);
+	incomplete_splits = lappend(incomplete_splits, action);
 }
 
 static void
@@ -67,49 +58,17 @@ forget_matching_split(RelFileNode node, BlockNumber downlink, bool is_root)
 {
 	ListCell   *l;
 
-	foreach(l, incomplete_actions)
+	foreach(l, incomplete_splits)
 	{
-		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
+		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
 
 		if (RelFileNodeEquals(node, action->node) &&
-			action->is_split &&
 			downlink == action->rightblk)
 		{
 			if (is_root != action->is_root)
 				elog(LOG, "forget_matching_split: fishy is_root data (expected %d, got %d)",
 					 action->is_root, is_root);
-			incomplete_actions = list_delete_ptr(incomplete_actions, action);
-			pfree(action);
-			break;				/* need not look further */
-		}
-	}
-}
-
-static void
-log_incomplete_deletion(RelFileNode node, BlockNumber delblk)
-{
-	bt_incomplete_action *action = palloc(sizeof(bt_incomplete_action));
-
-	action->node = node;
-	action->is_split = false;
-	action->delblk = delblk;
-	incomplete_actions = lappend(incomplete_actions, action);
-}
-
-static void
-forget_matching_deletion(RelFileNode node, BlockNumber delblk)
-{
-	ListCell   *l;
-
-	foreach(l, incomplete_actions)
-	{
-		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
-		if (RelFileNodeEquals(node, action->node) &&
-			!action->is_split &&
-			delblk == action->delblk)
-		{
-			incomplete_actions = list_delete_ptr(incomplete_actions, action);
+			incomplete_splits = list_delete_ptr(incomplete_splits, action);
 			pfree(action);
 			break;				/* need not look further */
 		}
@@ -798,21 +757,16 @@ btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
 }
 
 static void
-btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
+btree_xlog_mark_page_halfdead(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 {
-	xl_btree_delete_page *xlrec = (xl_btree_delete_page *) XLogRecGetData(record);
+	xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) XLogRecGetData(record);
 	BlockNumber parent;
-	BlockNumber target;
-	BlockNumber leftsib;
-	BlockNumber rightsib;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque pageop;
+	IndexTupleData trunctuple;
 
 	parent = ItemPointerGetBlockNumber(&(xlrec->target.tid));
-	target = xlrec->deadblk;
-	leftsib = xlrec->leftblk;
-	rightsib = xlrec->rightblk;
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
@@ -832,49 +786,97 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 		{
 			page = (Page) BufferGetPage(buffer);
 			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-			if (lsn <= PageGetLSN(page))
-			{
-				UnlockReleaseBuffer(buffer);
-			}
-			else
+			if (lsn > PageGetLSN(page))
 			{
 				OffsetNumber poffset;
+				ItemId		itemid;
+				IndexTuple	itup;
+				OffsetNumber nextoffset;
+				BlockNumber	rightsib;
 
 				poffset = ItemPointerGetOffsetNumber(&(xlrec->target.tid));
-				if (poffset >= PageGetMaxOffsetNumber(page))
-				{
-					Assert(info == XLOG_BTREE_DELETE_PAGE_HALF);
-					Assert(poffset == P_FIRSTDATAKEY(pageop));
-					PageIndexTupleDelete(page, poffset);
-					pageop->btpo_flags |= BTP_HALF_DEAD;
-				}
-				else
-				{
-					ItemId		itemid;
-					IndexTuple	itup;
-					OffsetNumber nextoffset;
-
-					Assert(info != XLOG_BTREE_DELETE_PAGE_HALF);
-					itemid = PageGetItemId(page, poffset);
-					itup = (IndexTuple) PageGetItem(page, itemid);
-					ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
-					nextoffset = OffsetNumberNext(poffset);
-					PageIndexTupleDelete(page, nextoffset);
-				}
+
+				nextoffset = OffsetNumberNext(poffset);
+				itemid = PageGetItemId(page, nextoffset);
+				itup = (IndexTuple) PageGetItem(page, itemid);
+				rightsib = ItemPointerGetBlockNumber(&itup->t_tid);
+
+				itemid = PageGetItemId(page, poffset);
+				itup = (IndexTuple) PageGetItem(page, itemid);
+				ItemPointerSet(&(itup->t_tid), rightsib, P_HIKEY);
+				nextoffset = OffsetNumberNext(poffset);
+				PageIndexTupleDelete(page, nextoffset);
 
 				PageSetLSN(page, lsn);
 				MarkBufferDirty(buffer);
-				UnlockReleaseBuffer(buffer);
 			}
+			UnlockReleaseBuffer(buffer);
 		}
 	}
 
+	/* Rewrite the leaf page as a halfdead page */
+	buffer = XLogReadBuffer(xlrec->target.node, xlrec->leafblk, true);
+	Assert(BufferIsValid(buffer));
+	page = (Page) BufferGetPage(buffer);
+
+	_bt_pageinit(page, BufferGetPageSize(buffer));
+	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	pageop->btpo_prev = xlrec->leftblk;
+	pageop->btpo_next = xlrec->rightblk;
+	pageop->btpo.level = 0;
+	pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
+	pageop->btpo_cycleid = 0;
+
+	/*
+	 * Construct a dummy hikey item that points to the next parent to be
+	 * deleted (if any).
+	 */
+	MemSet(&trunctuple, 0, sizeof(IndexTupleData));
+	trunctuple.t_info = sizeof(IndexTupleData);
+	if (xlrec->downlink != InvalidBlockNumber)
+		ItemPointerSet(&trunctuple.t_tid, xlrec->downlink, P_HIKEY);
+	else
+		ItemPointerSetInvalid(&trunctuple.t_tid);
+	if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
+					false, false) == InvalidOffsetNumber)
+		elog(ERROR, "could not add dummy high key to half-dead page");
+
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+}
+
+
+static void
+btree_xlog_unlink_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
+	BlockNumber target;
+	BlockNumber leftsib;
+	BlockNumber rightsib;
+	Buffer		buffer;
+	Page		page;
+	BTPageOpaque pageop;
+
+	target = xlrec->deadblk;
+	leftsib = xlrec->leftsib;
+	rightsib = xlrec->rightsib;
+
+	/*
+	 * In normal operation, we would lock all the pages this WAL record
+	 * touches before changing any of them.  In WAL replay, it should be okay
+	 * to lock just one page at a time, since no concurrent index updates can
+	 * be happening, and readers should not care whether they arrive at the
+	 * target page or not (since it's surely empty).
+	 */
+
 	/* Fix left-link of right sibling */
-	if (record->xl_info & XLR_BKP_BLOCK(1))
-		(void) RestoreBackupBlock(lsn, record, 1, false, false);
+	if (record->xl_info & XLR_BKP_BLOCK(0))
+		(void) RestoreBackupBlock(lsn, record, 0, false, false);
 	else
 	{
-		buffer = XLogReadBuffer(xlrec->target.node, rightsib, false);
+		buffer = XLogReadBuffer(xlrec->node, rightsib, false);
 		if (BufferIsValid(buffer))
 		{
 			page = (Page) BufferGetPage(buffer);
@@ -895,13 +897,13 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 	}
 
 	/* Fix right-link of left sibling, if any */
-	if (record->xl_info & XLR_BKP_BLOCK(2))
-		(void) RestoreBackupBlock(lsn, record, 2, false, false);
+	if (record->xl_info & XLR_BKP_BLOCK(1))
+		(void) RestoreBackupBlock(lsn, record, 1, false, false);
 	else
 	{
 		if (leftsib != P_NONE)
 		{
-			buffer = XLogReadBuffer(xlrec->target.node, leftsib, false);
+			buffer = XLogReadBuffer(xlrec->node, leftsib, false);
 			if (BufferIsValid(buffer))
 			{
 				page = (Page) BufferGetPage(buffer);
@@ -923,7 +925,7 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 	}
 
 	/* Rewrite target page as empty deleted page */
-	buffer = XLogReadBuffer(xlrec->target.node, target, true);
+	buffer = XLogReadBuffer(xlrec->node, target, true);
 	Assert(BufferIsValid(buffer));
 	page = (Page) BufferGetPage(buffer);
 
@@ -940,24 +942,58 @@ btree_xlog_delete_page(uint8 info, XLogRecPtr lsn, XLogRecord *record)
 	MarkBufferDirty(buffer);
 	UnlockReleaseBuffer(buffer);
 
+	/*
+	 * If we deleted a parent of the targeted leaf page, instead of the leaf
+	 * itself, update the leaf to point to the next remaining child in the
+	 * branch.
+	 */
+	if (target != xlrec->leafblk)
+	{
+		/*
+		 * There is no real data on the page, so we just re-create it from
+		 * scratch using the information from the WAL record.
+		 */
+		IndexTupleData trunctuple;
+
+		buffer = XLogReadBuffer(xlrec->node, xlrec->leafblk, true);
+		Assert(BufferIsValid(buffer));
+		page = (Page) BufferGetPage(buffer);
+		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		_bt_pageinit(page, BufferGetPageSize(buffer));
+		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
+		pageop->btpo_prev = xlrec->leafleftsib;
+		pageop->btpo_next = xlrec->leafrightsib;
+		pageop->btpo.level = 0;
+		pageop->btpo_cycleid = 0;
+
+		/* Add a dummy hikey item */
+		MemSet(&trunctuple, 0, sizeof(IndexTupleData));
+		trunctuple.t_info = sizeof(IndexTupleData);
+		if (xlrec->downlink != InvalidBlockNumber)
+			ItemPointerSet(&trunctuple.t_tid, xlrec->downlink, P_HIKEY);
+		else
+			ItemPointerSetInvalid(&trunctuple.t_tid);
+		if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "could not add dummy high key to half-dead page");
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+		UnlockReleaseBuffer(buffer);
+	}
+
 	/* Update metapage if needed */
-	if (info == XLOG_BTREE_DELETE_PAGE_META)
+	if (info == XLOG_BTREE_UNLINK_PAGE_META)
 	{
 		xl_btree_metadata md;
 
-		memcpy(&md, (char *) xlrec + SizeOfBtreeDeletePage,
+		memcpy(&md, (char *) xlrec + SizeOfBtreeUnlinkPage,
 			   sizeof(xl_btree_metadata));
-		_bt_restore_meta(xlrec->target.node, lsn,
+		_bt_restore_meta(xlrec->node, lsn,
 						 md.root, md.level,
 						 md.fastroot, md.fastlevel);
 	}
-
-	/* Forget any completed deletion */
-	forget_matching_deletion(xlrec->target.node, target);
-
-	/* If parent became half-dead, remember it for deletion */
-	if (info == XLOG_BTREE_DELETE_PAGE_HALF)
-		log_incomplete_deletion(xlrec->target.node, parent);
 }
 
 static void
@@ -1072,10 +1108,12 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_BTREE_DELETE:
 			btree_xlog_delete(lsn, record);
 			break;
-		case XLOG_BTREE_DELETE_PAGE:
-		case XLOG_BTREE_DELETE_PAGE_META:
-		case XLOG_BTREE_DELETE_PAGE_HALF:
-			btree_xlog_delete_page(info, lsn, record);
+		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
+			btree_xlog_mark_page_halfdead(info, lsn, record);
+			break;
+		case XLOG_BTREE_UNLINK_PAGE:
+		case XLOG_BTREE_UNLINK_PAGE_META:
+			btree_xlog_unlink_page(info, lsn, record);
 			break;
 		case XLOG_BTREE_NEWROOT:
 			btree_xlog_newroot(lsn, record);
@@ -1091,7 +1129,7 @@ btree_redo(XLogRecPtr lsn, XLogRecord *record)
 void
 btree_xlog_startup(void)
 {
-	incomplete_actions = NIL;
+	incomplete_splits = NIL;
 }
 
 void
@@ -1099,67 +1137,47 @@ btree_xlog_cleanup(void)
 {
 	ListCell   *l;
 
-	foreach(l, incomplete_actions)
+	foreach(l, incomplete_splits)
 	{
-		bt_incomplete_action *action = (bt_incomplete_action *) lfirst(l);
-
-		if (action->is_split)
-		{
-			/* finish an incomplete split */
-			Buffer		lbuf,
-						rbuf;
-			Page		lpage,
-						rpage;
-			BTPageOpaque lpageop,
-						rpageop;
-			bool		is_only;
-			Relation	reln;
-
-			lbuf = XLogReadBuffer(action->node, action->leftblk, false);
-			/* failure is impossible because we wrote this page earlier */
-			if (!BufferIsValid(lbuf))
-				elog(PANIC, "btree_xlog_cleanup: left block unfound");
-			lpage = (Page) BufferGetPage(lbuf);
-			lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
-			rbuf = XLogReadBuffer(action->node, action->rightblk, false);
-			/* failure is impossible because we wrote this page earlier */
-			if (!BufferIsValid(rbuf))
-				elog(PANIC, "btree_xlog_cleanup: right block unfound");
-			rpage = (Page) BufferGetPage(rbuf);
-			rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
-
-			/* if the pages are all of their level, it's a only-page split */
-			is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
-
-			reln = CreateFakeRelcacheEntry(action->node);
-			_bt_insert_parent(reln, lbuf, rbuf, NULL,
-							  action->is_root, is_only);
-			FreeFakeRelcacheEntry(reln);
-		}
-		else
-		{
-			/* finish an incomplete deletion (of a half-dead page) */
-			Buffer		buf;
-
-			buf = XLogReadBuffer(action->node, action->delblk, false);
-			if (BufferIsValid(buf))
-			{
-				Relation	reln;
-
-				reln = CreateFakeRelcacheEntry(action->node);
-				if (_bt_pagedel(reln, buf, NULL) == 0)
-					elog(PANIC, "btree_xlog_cleanup: _bt_pagedel failed");
-				FreeFakeRelcacheEntry(reln);
-			}
-		}
+		/* finish an incomplete split */
+		bt_incomplete_split *action = (bt_incomplete_split *) lfirst(l);
+		Buffer		lbuf,
+					rbuf;
+		Page		lpage,
+					rpage;
+		BTPageOpaque lpageop,
+					rpageop;
+		bool		is_only;
+		Relation	reln;
+
+		lbuf = XLogReadBuffer(action->node, action->leftblk, false);
+		/* failure is impossible because we wrote this page earlier */
+		if (!BufferIsValid(lbuf))
+			elog(PANIC, "btree_xlog_cleanup: left block unfound");
+		lpage = (Page) BufferGetPage(lbuf);
+		lpageop = (BTPageOpaque) PageGetSpecialPointer(lpage);
+		rbuf = XLogReadBuffer(action->node, action->rightblk, false);
+		/* failure is impossible because we wrote this page earlier */
+		if (!BufferIsValid(rbuf))
+			elog(PANIC, "btree_xlog_cleanup: right block unfound");
+		rpage = (Page) BufferGetPage(rbuf);
+		rpageop = (BTPageOpaque) PageGetSpecialPointer(rpage);
+
+		/* if the pages are all of their level, it's a only-page split */
+		is_only = P_LEFTMOST(lpageop) && P_RIGHTMOST(rpageop);
+
+		reln = CreateFakeRelcacheEntry(action->node);
+		_bt_insert_parent(reln, lbuf, rbuf, NULL,
+						  action->is_root, is_only);
+		FreeFakeRelcacheEntry(reln);
 	}
-	incomplete_actions = NIL;
+	incomplete_splits = NIL;
 }
 
 bool
 btree_safe_restartpoint(void)
 {
-	if (incomplete_actions)
+	if (incomplete_splits)
 		return false;
 	return true;
 }
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 57ac021..c1b0018 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -124,16 +124,27 @@ btree_desc(StringInfo buf, uint8 xl_info, char *rec)
 								 xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
 				break;
 			}
-		case XLOG_BTREE_DELETE_PAGE:
-		case XLOG_BTREE_DELETE_PAGE_META:
-		case XLOG_BTREE_DELETE_PAGE_HALF:
+		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
 			{
-				xl_btree_delete_page *xlrec = (xl_btree_delete_page *) rec;
+				xl_btree_mark_page_halfdead *xlrec = (xl_btree_mark_page_halfdead *) rec;
 
-				appendStringInfoString(buf, "delete_page: ");
+				appendStringInfoString(buf, "mark_page_halfdead: ");
 				out_target(buf, &(xlrec->target));
-				appendStringInfo(buf, "; dead %u; left %u; right %u",
-							xlrec->deadblk, xlrec->leftblk, xlrec->rightblk);
+				appendStringInfo(buf, "; downlink %u; leaf %u; left %u; right %u",
+								 xlrec->leafblk, xlrec->leftblk, xlrec->rightblk, xlrec->downlink);
+				break;
+			}
+		case XLOG_BTREE_UNLINK_PAGE_META:
+		case XLOG_BTREE_UNLINK_PAGE:
+			{
+				xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) rec;
+
+				appendStringInfo(buf, "unlink_page: rel %u/%u/%u; ",
+								 xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode);
+				appendStringInfo(buf, "dead %u; left %u; right %u; btpo_xact %u",
+								 xlrec->deadblk, xlrec->leftsib, xlrec->rightsib, xlrec->btpo_xact);
+				appendStringInfo(buf, "leaf %u; leafleft %u; leafright %u; downlink %u",
+								 xlrec->leafblk, xlrec->leafleftsib, xlrec->leafrightsib, xlrec->downlink);
 				break;
 			}
 		case XLOG_BTREE_NEWROOT:
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 6628886..7b26f98 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -215,11 +215,10 @@ typedef struct BTMetaPageData
 #define XLOG_BTREE_SPLIT_L_ROOT 0x50	/* add tuple with split of root */
 #define XLOG_BTREE_SPLIT_R_ROOT 0x60	/* as above, new item on right */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
-#define XLOG_BTREE_DELETE_PAGE	0x80	/* delete an entire page */
-#define XLOG_BTREE_DELETE_PAGE_META 0x90		/* same, and update metapage */
+#define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
+#define XLOG_BTREE_UNLINK_PAGE_META 0x90 /* same, and update metapage */
 #define XLOG_BTREE_NEWROOT		0xA0	/* new root page */
-#define XLOG_BTREE_DELETE_PAGE_HALF 0xB0		/* page deletion that makes
-												 * parent half-dead */
+#define XLOG_BTREE_MARK_PAGE_HALFDEAD 0xB0 /* mark a leaf as half-dead */
 #define XLOG_BTREE_VACUUM		0xC0	/* delete entries on a page during
 										 * vacuum */
 #define XLOG_BTREE_REUSE_PAGE	0xD0	/* old page is about to be reused from
@@ -370,23 +369,49 @@ typedef struct xl_btree_vacuum
 #define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
 
 /*
- * This is what we need to know about deletion of a btree page.  The target
- * identifies the tuple removed from the parent page (note that we remove
- * this tuple's downlink and the *following* tuple's key).	Note we do not
- * store any content for the deleted page --- it is just rewritten as empty
- * during recovery, apart from resetting the btpo.xact.
+ * This is what we need to know about marking an empty branch for deletion.
+ * The target identifies the tuple removed from the parent page (note that we
+ * remove this tuple's downlink and the *following* tuple's key).  Note that
+ * the leaf page is empty, so we don't need to store its content --- it is
+ * just reinitialized during recovery using the rest of the fields.
  */
-typedef struct xl_btree_delete_page
+typedef struct xl_btree_mark_page_halfdead
 {
 	xl_btreetid target;			/* deleted tuple id in parent page */
-	BlockNumber deadblk;		/* child block being deleted */
-	BlockNumber leftblk;		/* child block's left sibling, if any */
-	BlockNumber rightblk;		/* child block's right sibling */
+	BlockNumber leafblk;		/* leaf block ultimately being deleted */
+	BlockNumber leftblk;		/* leaf block's left sibling, if any */
+	BlockNumber rightblk;		/* leaf block's right sibling */
+	BlockNumber downlink;		/* next child down in the branch */
+} xl_btree_mark_page_halfdead;
+
+#define SizeOfBtreeMarkPageHalfDead	(offsetof(xl_btree_mark_page_halfdead, downlink) + sizeof(BlockNumber))
+
+/*
+ * This is what we need to know about deletion of a btree page.  Note we do
+ * not store any content for the deleted page --- it is just rewritten as empty
+ * during recovery, apart from resetting the btpo.xact.
+ */
+typedef struct xl_btree_unlink_page
+{
+	RelFileNode	node;
+	BlockNumber deadblk;		/* target block being deleted */
+	BlockNumber leftsib;		/* taregt block's left sibling, if any */
+	BlockNumber rightsib;		/* target block's right sibling */
+
+	/*
+	 * Information needed to recreate the leaf page, when target is an internal
+	 * page.
+	 */
+	BlockNumber	leafblk;
+	BlockNumber	leafleftsib;
+	BlockNumber	leafrightsib;
+	BlockNumber	downlink;		/* next child down in the branch */
+
 	TransactionId btpo_xact;	/* value of btpo.xact for use in recovery */
-	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_DELETE_PAGE_META */
-} xl_btree_delete_page;
+	/* xl_btree_metadata FOLLOWS IF XLOG_BTREE_UNLINK_PAGE_META */
+} xl_btree_unlink_page;
 
-#define SizeOfBtreeDeletePage	(offsetof(xl_btree_delete_page, btpo_xact) + sizeof(TransactionId))
+#define SizeOfBtreeUnlinkPage	(offsetof(xl_btree_unlink_page, btpo_xact) + sizeof(TransactionId))
 
 /*
  * New root log record.  There are zero tuples if this is to establish an
@@ -639,7 +664,7 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
 					BlockNumber lastBlockVacuumed);
-extern int	_bt_pagedel(Relation rel, Buffer buf, BTStack stack);
+extern int	_bt_pagedel(Relation rel, Buffer buf);
 
 /*
  * prototypes for functions in nbtsearch.c