amcheck (B-Tree integrity checking tool)

Started by Peter Geogheganabout 10 years ago89 messageshackers

pg@bowt.ie

about 10 years ago

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

Invariants
========

nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
=======

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

Locking
======

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

Performance
==========

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.

--
Peter Geoghegan

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 10 years ago

In reply to: Peter Geoghegan (#1)

Re: amcheck (B-Tree integrity checking tool)

Hi,

I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.

On Mon, 2016-02-29 at 16:09 -0800, Peter Geoghegan wrote:

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck.

...

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

I'm afraid I don't understand what "target the heapam" means. Could you
elaborate?

Invariants
========

...

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

I agree. It'd be nice to have a tool that also checks for data
corruption the a lower level (e.g. that varlena headers are not
corrupted etc.) but that seems like a task for some other tool.

Interface
=======

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.

This also applies to the name of the contrib module - it's not
particularly obvious what "amcheck" unless you're familiar with the
concept of access methods. Which is quite unlikely for regular users.
Maybe something like "check_index" would be more illustrative?

Locking
======

...

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

Well, I wouldn't count myself as an expert here, but do we actually need
such protection? I mean, we only do such checks when holding an
exclusive lock on the parent relation, no? And even if we don't vacuum
can only remove entries from the pages - why should that cause
violations of any invariants?

A few minor comments:

1) This should probably say "one block per level":

/*
* A B-Tree cannot possibly have this many levels, since there must be
* one block per page, and that is bound by the range of BlockNumber:
*/

2) comment before bt_check_every_level() says:

... assumed to prevent any kind of physically modification ...

probably should be "physical modification" instead.

3) s/targecontext/targetcontext/ in BtreeCheckState comment

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

about 10 years ago

In reply to: Peter Geoghegan (#1)

Re: amcheck (B-Tree integrity checking tool)

01.03.2016 03:09, Peter Geoghegan:

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

Hi,
thank you for the contrib module. The patch itself looks great.
It was really useful to have such a tool while working on b-tree patches.

Invariants
========

nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
=======

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

I've done some testing. And I can say that both announced functions work
well.
BTW, if you know a good way to corrupt index (and do it reproducible)
I'd be very glad to see it.

I created an index, then opened the file and changed the order of elements.
And bt_index_check() found the error successfully.

/* Ensure that the data is changed. */
SELECT * FROM bt_page_items('idx', 1);

itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
3 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
(3 rows)

/* Great, amcheck found an error. */
select bt_index_check('idx');
ERROR: page order invariant violated for index "idx"
DETAIL: Lower index tid=(1,1) (points to heap tid=(0,3)) higher index
tid=(1,2) (points to heap tid=(0,2)) page lsn=0/17800D0.

Locking
======

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

I'm not an expert, but I promise to take a closer look at locking.
I will send another email soon.

Performance
==========

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.

I do not see any performance issues.
I'm sure that if someone wants to check whether the index is corrupted,
he certainly can wait a minute (.

But I have some concerns about compatibility with my patches.
I've tried to call bt_index_check() over my "including" patch [1]https://commitfest.postgresql.org/9/433/ and
caught a segfault.

LOG: server process (PID 31794) was terminated by signal 11:
Segmentation fault
DETAIL: Failed process was running: select bt_index_check('idx');

I do hope that my patch will be accepted in 9.6, so this conflict looks
really bad.
I think that error is caused by changes in pages layout. To save some
space nonkey attributes are truncated
when btree copies the indexed value into inner pages or into high key.
You can look at index_reform_tuple() calls.

I wonder, whether you can add some check of catalog version to your
module or this case requires more changes?

With second patch which implements compression [2]https://commitfest.postgresql.org/9/494/, amcheck causes
another error.
postgres=# insert into tbl select 1 from generate_series(0,5);
INSERT 0 6
postgres=# SELECT * FROM bt_page_items('idx', 1);
itemoffset | ctid | itemlen | nulls | vars | data

------------+----------------+---------+-------+------+----------------------------------------------------------------------------------------
---------------------------------------------------------
1 | (2147483664,6) | 56 | f | f | 01 00 00 00 00
00 00 00 00 00 00 00 01 00 00 00 00 00 02 00 00 00 00 00 03 00 00 00 00
00 04 00 00 00 00 00 05 00 00 00 00 00 06 00 00 00 00 00

postgres=# select bt_index_check('idx');
ERROR: cannot check index "idx"
DETAIL: index is not yet ready for insertions

But I'm sure that it's a problem of my patch. So I'll fix it and try again.

[1]: https://commitfest.postgresql.org/9/433/
[2]: https://commitfest.postgresql.org/9/494/

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

about 10 years ago

In reply to: Anastasia Lubennikova (#3)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:

BTW, if you know a good way to corrupt index (and do it reproducible) I'd be
very glad to see it.

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 10 years ago

In reply to: Michael Paquier (#4)

Re: amcheck (B-Tree integrity checking tool)

Hi,

On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:

On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:

BTW, if you know a good way to corrupt index (and do it
reproducible) I'd be
very glad to see it.

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc

I guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.

I think the main goal of the amcheck module is to identify issues that
are much more subtle - perhaps a coding error or unhandled cornercase
which results in page with a valid structure and checksum, but broken
page. Or perhaps issues coming from outside of PostgreSQL, like a
subtle change in the locales.

Simulating those issues requires a tool that allows minor tweaks to the
page, dd is a bit too dull for that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

amcheck (B-Tree integrity checking tool)

Attachments: