9.4 btree index corruption

Started by Jeff Janesover 11 years ago5 messages
#1Jeff Janes
jeff.janes@gmail.com

Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.

28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR: right sibling 4044 of
block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx"
28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT: VACUUM;

It took ~8 hours on 8 cores to encounter this problem. This is a single
occurrence, it has not yet been reproduced.

I don't know that the partial-writes, or the crash recovery, or the foreign
key, parts of this test are important--it could be a more generic problem
that only happened to be observed here. Nor do I know yet if it occurs in
9_3_STABLE.

Below is the testing harness and the data directory (massively bloated at
3.7GB once uncompressed). It is currently in wrap-around shutdown, but
that is the effect of persistent vacuum failures, not the cause of them.
You can restart the data directory and it will repeat the above sibling
error once autovac kicks in. I don't know if the bloat is due to the
vacuum failure or if it was already in process before the failures started.
I've cranked up the logging on that front future efforts.

I'm using some fast-foward code on the xid consumption so that freezing
occurs more often, and some people have expressed reservations that the
code might be imperfect, and I can't rule that out as the cause (but I've
never traced any other problems back to that code). But it did make it
through 4 complete wraps before this problem was encountered, so if that is
the problem it must be probabilistic rather than deterministic.

https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing

Cheers,

Jeff

#2Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Jeff Janes (#1)
1 attachment(s)
Re: 9.4 btree index corruption

On 05/21/2014 10:22 PM, Jeff Janes wrote:

Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.

28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR: right sibling 4044 of
block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx"
28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT: VACUUM;

It took ~8 hours on 8 cores to encounter this problem. This is a single
occurrence, it has not yet been reproduced.

I don't know that the partial-writes, or the crash recovery, or the foreign
key, parts of this test are important--it could be a more generic problem
that only happened to be observed here. Nor do I know yet if it occurs in
9_3_STABLE.

Below is the testing harness and the data directory (massively bloated at
3.7GB once uncompressed). It is currently in wrap-around shutdown, but
that is the effect of persistent vacuum failures, not the cause of them.
You can restart the data directory and it will repeat the above sibling
error once autovac kicks in. I don't know if the bloat is due to the
vacuum failure or if it was already in process before the failures started.
I've cranked up the logging on that front future efforts.

I'm using some fast-foward code on the xid consumption so that freezing
occurs more often, and some people have expressed reservations that the
code might be imperfect, and I can't rule that out as the cause (but I've
never traced any other problems back to that code). But it did make it
through 4 complete wraps before this problem was encountered, so if that is
the problem it must be probabilistic rather than deterministic.

https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing

I downloaded the data directory and investigated. I got this message
when I started it up:

20392 2014-05-25 05:51:37.835 PDT:ERROR: right sibling 4044 of block
460 is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392 2014-05-25 05:51:37.835 PDT:CONTEXT: automatic vacuum of table
"jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yours
claimed it was 1264. I don't know why; perhaps a huge number of
insertions happened after that error, causing the parent level pages to
be split, moving the downlinks it complains about to the right. Did you
continue running the test after that error occurred?

This is what the tree looks like around those pages:

Level 1:
+-------------+     +-------------+     +-------------+
| Blk 1264    |     | Blk 160180  |     | Blk 86458   |
|             |     |             |     |             |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ...         |     | ...         |     |     1269    |
|             |     |             |     |      460    |
|             |     |             |     |    23513    |
+-------------+     +-------------+     +-------------+
Leaf level:
+-------------+     +---------+     +-----------+     +-------------+
| Blk 1269    |     | Blk 460 |     | Blk 4044  |     | Blk 23513   |
|             | <-> |         | <-> | HALF_DEAD | <-> |             |
| (314 items) |     | (empty) |     |           |     | (212 items) |
+-------------+     +---------+     +-----------+     +-------------+

Leaf block 4044 is marked as half-dead, and there is no downlink to it
on the parent block, 86458. That is normal, when a vacuum is interrupted
during page deletion (in 9.4). The next vacuum ought to continue it.
However, because block 460 is empty, the next vacuum that comes along
will try to start the page deletion process on that block. Page deletion
does a cross-check at the parent level, checking that the next downlink
on the parent page points to the right sibling of the page we're about
to delete. However, because the right sibling is already marked as
half-dead, and the downlink pointing to it has already been removed, the
cross-check fails.

So, the sanity check is a bit overzealous. It's normal that the right
sibling has no downlink, if it's marked as half-dead. The trivial fix is
to just remove the cross-check, but I'd rather not do that because
generally-speaking more sanity checks are good.

I think we could fetch the right sibling page to see if it's marked as
half-dead, and if it is, just give up on deleting the current page. This
vacuum will eventually get to the right sibling page and finish the
deletion, and the next vacuum will then be able to delete the empty
page. Or we could finish the deletion of the right page immediately, but
I'd like to keep the rarely-occurring corner-case codepaths to the minimum.

With the attached patch, I was able to get past that error, but when
VACUUM reaches the end, I got this:

jjanes=# vacuum foo;
ERROR: database is not accepting commands to avoid wraparound data loss
in database "jjanes"
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

That seems like an unrelated issue, I'll start a new thread about that.

Many thanks for the testing, Jeff!

- Heikki

Attachments:

fix-deletion-with-half-dead-right-sibling-1.patchtext/x-diff; name=fix-deletion-with-half-dead-right-sibling-1.patchDownload
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d357b33..fd7e9b5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -954,18 +954,36 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+
+static bool
+_bt_is_page_halfdead(Relation rel, BlockNumber blk)
+{
+	Buffer		buf;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		result;
+
+	buf = _bt_getbuf(rel, blk, BT_READ);
+	page = BufferGetPage(buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	result = P_ISHALFDEAD(opaque);
+	_bt_relbuf(rel, buf);
+
+	return result;
+}
+
 /*
  * Subroutine to find the parent of the branch we're deleting.  This climbs
  * up the tree until it finds a page with more than one child, i.e. a page
  * that will not be totally emptied by the deletion.  The chain of pages below
- * it, with one downlink each, will be part of the branch that we need to
- * delete.
+ * it, with one downlink each, will form the branch that we need to delete.
  *
  * If we cannot remove the downlink from the parent, because it's the
  * rightmost entry, returns false.  On success, *topparent and *topoff are set
  * to the buffer holding the parent, and the offset of the downlink in it.
  * *topparent is write-locked, the caller is responsible for releasing it when
- * done.  *target is set to the topmost page in the branch to-be-deleted, ie.
+ * done.  *target is set to the topmost page in the branch to-be-deleted, i.e.
  * the page whose downlink *topparent / *topoff point to, and *rightsib to its
  * right sibling.
  *
@@ -994,7 +1012,7 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BTPageOpaque opaque;
 	BlockNumber leftsib;
 
-	/* Locate the parent's downlink (updating the stack entry if needed) */
+	/* Locate the downlink in the parent (updating the stack entry if needed) */
 	ItemPointerSet(&(stack->bts_btentry.t_tid), child, P_HIKEY);
 	pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
 	if (pbuf == InvalidBuffer)
@@ -1066,6 +1084,13 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 				_bt_relbuf(rel, lbuf);
 			}
 
+			if (_bt_is_page_halfdead(rel, *rightsib))
+			{
+				elog(LOG, "could not delete page %u because its right sibling %u is half-dead",
+					 parent, *rightsib);
+				return false;
+			}
+
 			return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
 										topparent, topoff, target, rightsib);
 		}
@@ -1234,6 +1259,13 @@ _bt_pagedel(Relation rel, Buffer buf)
 					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
+					/*
+					 * If the left sibling is split again by another backend,
+					 * after we released the lock, we know that the first
+					 * split must be finished, because we don't allow an
+					 * incompletely-split page to be split again. So we don't
+					 * need to walk right here.
+					 */
 					if (lopaque->btpo_next == BufferGetBlockNumber(buf) &&
 						P_INCOMPLETE_SPLIT(lopaque))
 					{
@@ -1338,6 +1370,13 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	leafblkno = BufferGetBlockNumber(leafbuf);
 	leafrightsib = opaque->btpo_next;
 
+	if (_bt_is_page_halfdead(rel, leafrightsib))
+	{
+		elog(LOG, "could not delete page %u because its right sibling %u is half-dead",
+			 leafblkno, leafrightsib);
+		return false;
+	}
+
 	/*
 	 * We cannot delete a page that is the rightmost child of its immediate
 	 * parent, unless it is the only child --- in which case the parent has to
#3Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: 9.4 btree index corruption

On Sun, May 25, 2014 at 7:16 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 05/21/2014 10:22 PM, Jeff Janes wrote:

Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.

...

https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&amp;
usp=sharing

I downloaded the data directory and investigated. I got this message when
I started it up:

20392 2014-05-25 05:51:37.835 PDT:ERROR: right sibling 4044 of block 460
is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392 2014-05-25 05:51:37.835 PDT:CONTEXT: automatic vacuum of table
"jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yours
claimed it was 1264. I don't know why; perhaps a huge number of insertions
happened after that error, causing the parent level pages to be split,
moving the downlinks it complains about to the right. Did you continue
running the test after that error occurred?

Yes, I didn't set it up to abort upon vacuum errors, so it continued to run
until it reached the 1,000,000 wrap around limit.

Between when the vacuums started failing and when it reached wrap around
limit, it seemed to run normally (other than the increasing bloat and
non-advancement of frozenxid).

I only noticed the ERROR when I was looking through the log file in
postmortem. Looking back, I see 10 errors with 1264, then it switched to
86458. I didn't notice the change and reported only the first ERROR
message.

I'll apply your patch and see what happens, but 90 further hours of testing
gave no more occurrences of the error so I'm afraid that not seeing errors
is not much evidence that the error was fixed. So I'll have to rely on
your knowledge of the code.

This is what the tree looks like around those pages:

Level 1:
+-------------+     +-------------+     +-------------+
| Blk 1264    |     | Blk 160180  |     | Blk 86458   |
|             |     |             |     |             |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ...         |     | ...         |     |     1269    |
|             |     |             |     |      460    |
|             |     |             |     |    23513    |
+-------------+     +-------------+     +-------------+

Is this done with the pageinspect extension?

With the attached patch, I was able to get past that error, but when VACUUM

reaches the end, I got this:

jjanes=# vacuum foo;
ERROR: database is not accepting commands to avoid wraparound data loss
in database "jjanes"
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

That seems like an unrelated issue, I'll start a new thread about that.

Yeah, that's expected. Since vacuum failures don't abort the test, it ran
on until something else stopped it.

Many thanks for the testing, Jeff!

You're welcome. If only I was as good at fixing things as at breaking
them.

Cheers,

Jeff

#4Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Jeff Janes (#3)
Re: 9.4 btree index corruption

On 05/25/2014 05:45 PM, Jeff Janes wrote:

On Sun, May 25, 2014 at 7:16 AM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 05/21/2014 10:22 PM, Jeff Janes wrote:

Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.

...

https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&amp;
usp=sharing

I downloaded the data directory and investigated. I got this message when
I started it up:

20392 2014-05-25 05:51:37.835 PDT:ERROR: right sibling 4044 of block 460
is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392 2014-05-25 05:51:37.835 PDT:CONTEXT: automatic vacuum of table
"jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yours
claimed it was 1264. I don't know why; perhaps a huge number of insertions
happened after that error, causing the parent level pages to be split,
moving the downlinks it complains about to the right. Did you continue
running the test after that error occurred?

Yes, I didn't set it up to abort upon vacuum errors, so it continued to run
until it reached the 1,000,000 wrap around limit.

Between when the vacuums started failing and when it reached wrap around
limit, it seemed to run normally (other than the increasing bloat and
non-advancement of frozenxid).

I only noticed the ERROR when I was looking through the log file in
postmortem. Looking back, I see 10 errors with 1264, then it switched to
86458. I didn't notice the change and reported only the first ERROR
message.

Ok, that explains the difference in the error message.

I'll apply your patch and see what happens, but 90 further hours of testing
gave no more occurrences of the error so I'm afraid that not seeing errors
is not much evidence that the error was fixed. So I'll have to rely on
your knowledge of the code.

I've pushed the fix.

This is what the tree looks like around those pages:

Level 1:
+-------------+     +-------------+     +-------------+
| Blk 1264    |     | Blk 160180  |     | Blk 86458   |
|             |     |             |     |             |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ...         |     | ...         |     |     1269    |
|             |     |             |     |      460    |
|             |     |             |     |    23513    |
+-------------+     +-------------+     +-------------+

Is this done with the pageinspect extension?

Yeah, with some more tooling on top of it. The elegant diagram I drew by
hand.

Thanks again! BTW, it would be useful to keep the WAL logs for as long
as you have disk space. I'm not sure if that would've helped in this
case, but it can be really useful to see all the actions that were done
to the page before something goes wrong.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@2ndquadrant.com
In reply to: Jeff Janes (#3)
Re: 9.4 btree index corruption

Hi,

On 2014-05-25 14:45:38 -0700, Jeff Janes wrote:

You're welcome. If only I was as good at fixing things as at breaking
them.

At the moment there seems to be a bigger shortage of folks being good at
breaking stuff - before the release! - than of people fixing the
resulting breakage...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers