amcheck (B-Tree integrity checking tool)
I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.
To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.
Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).
Invariants
========
nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:
* That all items are in the correct, opclass order on each page.
* That the page "high key", if any, actually bounds the items on the page.
* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.
With an ExclusiveLock + ShareLock, some addition invariants are verified:
* That child pages actually have their parent's downlink as a lower bound.
* Sane right links and left links at each level.
Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.
Interface
=======
There are only 2 SQL callable functions in the extension, which are
very similar:
bt_index_check(index regclass)
bt_index_parent_check(index regclass)
The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.
Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.
I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.
Locking
======
We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.
We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.
Performance
==========
Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.
--
Peter Geoghegan
Attachments:
0001-Add-amcheck-extension-to-contrib.patchtext/x-patch; charset=US-ASCII; name=0001-Add-amcheck-extension-to-contrib.patchDownload+1407-1
Hi,
I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.
On Mon, 2016-02-29 at 16:09 -0800, Peter Geoghegan wrote:
I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck.
...
To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).
I'm afraid I don't understand what "target the heapam" means. Could you
elaborate?
Invariants
========
...
Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.
I agree. It'd be nice to have a tool that also checks for data
corruption the a lower level (e.g. that varlena headers are not
corrupted etc.) but that seems like a task for some other tool.
Interface
=======There are only 2 SQL callable functions in the extension, which are
very similar:bt_index_check(index regclass)
bt_index_parent_check(index regclass)
Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.
This also applies to the name of the contrib module - it's not
particularly obvious what "amcheck" unless you're familiar with the
concept of access methods. Which is quite unlikely for regular users.
Maybe something like "check_index" would be more illustrative?
Locking
======
...
We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.
Well, I wouldn't count myself as an expert here, but do we actually need
such protection? I mean, we only do such checks when holding an
exclusive lock on the parent relation, no? And even if we don't vacuum
can only remove entries from the pages - why should that cause
violations of any invariants?
A few minor comments:
1) This should probably say "one block per level":
/*
* A B-Tree cannot possibly have this many levels, since there must be
* one block per page, and that is bound by the range of BlockNumber:
*/
2) comment before bt_check_every_level() says:
... assumed to prevent any kind of physically modification ...
probably should be "physical modification" instead.
3) s/targecontext/targetcontext/ in BtreeCheckState comment
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
01.03.2016 03:09, Peter Geoghegan:
I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).
Hi,
thank you for the contrib module. The patch itself looks great.
It was really useful to have such a tool while working on b-tree patches.
Invariants
========nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:* That all items are in the correct, opclass order on each page.
* That the page "high key", if any, actually bounds the items on the page.
* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.With an ExclusiveLock + ShareLock, some addition invariants are verified:
* That child pages actually have their parent's downlink as a lower bound.
* Sane right links and left links at each level.
Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.Interface
=======There are only 2 SQL callable functions in the extension, which are
very similar:bt_index_check(index regclass)
bt_index_parent_check(index regclass)
The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.
I've done some testing. And I can say that both announced functions work
well.
BTW, if you know a good way to corrupt index (and do it reproducible)
I'd be very glad to see it.
I created an index, then opened the file and changed the order of elements.
And bt_index_check() found the error successfully.
/* Ensure that the data is changed. */
SELECT * FROM bt_page_items('idx', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
3 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
(3 rows)
/* Great, amcheck found an error. */
select bt_index_check('idx');
ERROR: page order invariant violated for index "idx"
DETAIL: Lower index tid=(1,1) (points to heap tid=(0,3)) higher index
tid=(1,2) (points to heap tid=(0,2)) page lsn=0/17800D0.
Locking
======We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.
I'm not an expert, but I promise to take a closer look at locking.
I will send another email soon.
Performance
==========Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.
I do not see any performance issues.
I'm sure that if someone wants to check whether the index is corrupted,
he certainly can wait a minute (.
But I have some concerns about compatibility with my patches.
I've tried to call bt_index_check() over my "including" patch [1]https://commitfest.postgresql.org/9/433/ and
caught a segfault.
LOG: server process (PID 31794) was terminated by signal 11:
Segmentation fault
DETAIL: Failed process was running: select bt_index_check('idx');
I do hope that my patch will be accepted in 9.6, so this conflict looks
really bad.
I think that error is caused by changes in pages layout. To save some
space nonkey attributes are truncated
when btree copies the indexed value into inner pages or into high key.
You can look at index_reform_tuple() calls.
I wonder, whether you can add some check of catalog version to your
module or this case requires more changes?
With second patch which implements compression [2]https://commitfest.postgresql.org/9/494/, amcheck causes
another error.
postgres=# insert into tbl select 1 from generate_series(0,5);
INSERT 0 6
postgres=# SELECT * FROM bt_page_items('idx', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+----------------+---------+-------+------+----------------------------------------------------------------------------------------
---------------------------------------------------------
1 | (2147483664,6) | 56 | f | f | 01 00 00 00 00
00 00 00 00 00 00 00 01 00 00 00 00 00 02 00 00 00 00 00 03 00 00 00 00
00 04 00 00 00 00 00 05 00 00 00 00 00 06 00 00 00 00 00
postgres=# select bt_index_check('idx');
ERROR: cannot check index "idx"
DETAIL: index is not yet ready for insertions
But I'm sure that it's a problem of my patch. So I'll fix it and try again.
[1]: https://commitfest.postgresql.org/9/433/
[2]: https://commitfest.postgresql.org/9/494/
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
BTW, if you know a good way to corrupt index (and do it reproducible) I'd be
very glad to see it.
You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc
--
Michael
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Hi,
On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:
On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
BTW, if you know a good way to corrupt index (and do it
reproducible) I'd be
very glad to see it.You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc
I guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.
I think the main goal of the amcheck module is to identify issues that
are much more subtle - perhaps a coding error or unhandled cornercase
which results in page with a valid structure and checksum, but broken
page. Or perhaps issues coming from outside of PostgreSQL, like a
subtle change in the locales.
Simulating those issues requires a tool that allows minor tweaks to the
page, dd is a bit too dull for that.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 8:24 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc
Sure, but that would probably fail at the first hurdle -- the page
header would be corrupt. Which is a valid test, but not all that
interesting.
One testing workflow I tried is overwriting some page in a B-Tree
relfilenode with some other page in the same file:
$:~/pgdata/base/12413$ dd if=somefile of=somefile conv=notrunc bs=8192
count=1 skip=2 seek=3
That should fail due to the key space not being in order across pages,
which is slightly interesting. Or, you could selectively change one
item with a hex editor, as Anastasia did.
Or, you could add code like this to comparetup_index_btree(), to
simulate a broken opclass:
diff --git a/src/backend/utils/sort/tuplesort.c
b/src/backend/utils/sort/tuplesort.c
index 67d86ed..23712ff 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
SortTuple *b,
compare = ApplySortComparator(a->datum1, a->isnull1,
b->datum1, b->isnull1,
sortKey);
+
+ if (random() <= (MAX_RANDOM_VALUE / 1000))
+ compare = -compare;
if (compare != 0)
return compare;
There are many options when you want to produce a corrupt B-Tree index!
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 1:09 PM, Peter Geoghegan <pg@heroku.com> wrote:
Or, you could add code like this to comparetup_index_btree(), to
simulate a broken opclass:diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 67d86ed..23712ff 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b, compare = ApplySortComparator(a->datum1, a->isnull1,b->datum1, b->isnull1, sortKey); + + if (random() <= (MAX_RANDOM_VALUE / 1000)) + compare = -compare; if (compare != 0) return compare;
Note that this patch that I sketched would make CREATE INDEX produce
corrupt indexes, but the tool's verification itself would not be
affected. Although even if it was (even if _bt_compare() gave the same
wrong answers), it would still very likely detect corruption.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.
Thanks. I try. This patch is something I've been working on slowly and
steadily for over 2 years. It was time to see it through.
Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).I'm afraid I don't understand what "target the heapam" means. Could you
elaborate?
Just that that's how I'd make amcheck verify that the heap was sane,
if I was going to undertake that project. Or, I'd start there.
I agree. It'd be nice to have a tool that also checks for data
corruption the a lower level (e.g. that varlena headers are not
corrupted etc.) but that seems like a task for some other tool.
I'm not sure that that's a task for another tool. I think that this
tool is scoped at detecting corruption, and that could work well for
the heap, too. There might be fewer interesting things to test about
the heap when indexes aren't involved. Following HOT chains through an
index, and verifying things like that about the heap as we go could
detect a lot of problematic cases.
Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.
Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.
This also applies to the name of the contrib module - it's not
particularly obvious what "amcheck" unless you're familiar with the
concept of access methods. Which is quite unlikely for regular users.
Maybe something like "check_index" would be more illustrative?
I think that given the heap may be targeted in the future, amcheck is
more future-proof. I think Robert might have liked that name a year
ago or something. In general, I'm not too worried about what the
contrib module is called in the end.
Well, I wouldn't count myself as an expert here, but do we actually need
such protection? I mean, we only do such checks when holding an
exclusive lock on the parent relation, no? And even if we don't vacuum
can only remove entries from the pages - why should that cause
violations of any invariants?
I think that a more worked out explanation for why the ExclusiveLock
is needed is appropriate. I meant to do that.
Basically, a heavier lock is needed because of page deletion by
VACUUM, which is the major source of complexity (much more than page
splits, I'd say). In general, the key space can be consolidated by
VACUUM in a way that breaks child page checking because the downlink
we followed from our target page is no longer the current downlink.
Page deletion deletes the right sibling's downlink, and the deleted
page's downlink is used for its sibling. You could have a race, where
there was a concurrent page deletion of the left sibling of the child
page, then a concurrent insertion into the newly expanded keyspace of
the parent. Therefore, the downlink in the parent (which is the
"target", to use the patch's terminology) would not be a lower bound
on items in the page.
That's a very low probability race, because it involves deletion a
page representing a range in the keyspace that has no live items, but
then immediately getting one just in time for our check. But, I'm
pretty determined that false positives like that need to be impossible
(or else it's a bug).
I have a paranoid feeling that there is a similar very low probability
race with the left-right keyspace checks, which don't have relation
ExclusiveLock protection (IOW, I think that that might be buggy). I
need to think about that some more, but current thinking is that it
would hardly matter if we used the highkey from right page rather than
the first data item, which definitely would be correct. And it would
be simpler.
A few minor comments:
Thanks for catching those issues. Will fix.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 1:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
You could have a race, where
there was a concurrent page deletion of the left sibling of the child
page, then a concurrent insertion into the newly expanded keyspace of
the parent. Therefore, the downlink in the parent (which is the
"target", to use the patch's terminology) would not be a lower bound
on items in the page.
Excuse me: I meant the newly expanded keyspace of the *child*. (The
parent's keyspace would have covered everything. It's naturally far
larger than either child's keyspace, since it typically has several
hundred pages.)
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
I do hope that my patch will be accepted in 9.6, so this conflict looks
really bad.
I hope so too. I'll have to look into this issue.
I think that error is caused by changes in pages layout. To save some space
nonkey attributes are truncated
when btree copies the indexed value into inner pages or into high key. You
can look at index_reform_tuple() calls.
That seems like the kind of problem that would be expected when things
like that change. I think it's going to be hard to add new B-Tree
features without a tool like this, which was a big reason to work on
it; to make these new projects possible to test and review. I see many
opportunities to improve the B-Tree code, just as I imagine you do.
These projects are quite strategically important, because B-Trees are
so frequently used. I think that Postgres B-Trees produce excessive
random I/O for checkpoints for various reasons, and this disadvantages
Postgres when it is compared to other major systems. As things get
better in other areas of Postgres, these hard problems become more
important to solve.
I wonder, whether you can add some check of catalog version to your module
or this case requires more changes?
I think that it's just going to be tied to the Postgres version. So,
if your B-Tree patches are committed first, it's on me to make sure
they're handled correctly. Or vice-versa. Not worried that that will
be a problem.
We already take special steps to avoid the "minus infinity" item on
internal pages. I think that in the future, if Postgres B-Trees get
suffix truncation for internal page items, there are new problems for
amcheck (suffix truncation remove unneeded information from internal
page items, sometimes greatly increasing B-Tree fan-out. Internal page
items need only be sufficient to guide index scans correctly.).
Specially, with suffix truncation there might be "minus infinity"
*attributes*, too (these could make it safe to completely remove
attributes/columns past the first distinguishing/distinct attribute on
each item on internal pages). That's a case that amcheck then needs to
care about, just like it currently cares about the existing concept of
minus infinity items. That's how it goes for amcheck.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/11/16 3:31 PM, Peter Geoghegan wrote:
Can we come up with names that more clearly identify the difference
between those two functions? I mean,_parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.
First, thanks for your work on this. I've wanted it in the past.
I agree the name isn't very clear. Perhaps _recurse?
I also agree that the nmodule name isn't very clear. If this is meant to
be the start of a generic consistency checker, lets call it that.
Otherwise, it should be marked as being specific to btrees, because
presumably we might eventually want similar tools for GIN, etc. (FWIW
I'd vote for a general consistency checker).
I know the vacuum race condition would be very rare, but I don't think
it can be ignored. Last thing you want out of a consistency checker is
false negatives/positives. I do think it would be reasonable to just
wholesale block against concurrent vacuums, but I don't think there's
any reasonable way to do that.
I would prefer the ability to do something other than raising an error
when corruption is found, so that you could find all corruption in an
index. Obviously could log to a different level. Another option would be
SRFs that return info about all the corruption found, but that's
probably overkill.
It'd be nice if you had the option to obey vacuum_cost_delay when
running this, but that's clearly just a nice-to-have (or maybe just obey
it all the time, since it defaults to 0).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 3:50 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
I also agree that the nmodule name isn't very clear. If this is meant to be
the start of a generic consistency checker, lets call it that. Otherwise, it
should be marked as being specific to btrees, because presumably we might
eventually want similar tools for GIN, etc. (FWIW I'd vote for a general
consistency checker).
It's a generic consistency checker -- that's the intent. Robert wanted
to go that way about a year ago, and I think it makes sense (this
module started out life as btreecheck). As I said, I don't really care
what the name ends up being. amcheck seems fine to me, but I'll go
with any reasonable suggestion that reflects the scope of the tool as
a generic consistency checker for AMs, including heapam.
I don't know that I understand the concern about the naming of
bt_index_parent_check(). I called it something that accurately
reflects what it does. That particular SQL-callable function is more
orientated towards helping hackers than towards helping users detect
routine problems, as the docs say, but it could help users working
with hackers, and it might even help users acting on their own. I
don't want to make it sound scary, because it isn't. I don't know what
problems will be detected by this tool most often, and I don't think
it would be wise to try to predict how that will work out. And so,
concealing the internal workings of each function/check by giving them
all totally non-technical names seems like a bad idea. It's not that
hard to understand that a B-Tree has multiple levels, and checking the
levels against each other requires more restrictive/stronger
heavyweight locks. I've only added 2 SQL-callable functions, each of
which has one argument, and the first of which has a generic name and
very light locking requirements, like a SELECT. It's not that hard to
get the general idea, as an ordinary user.
I know the vacuum race condition would be very rare, but I don't think it
can be ignored. Last thing you want out of a consistency checker is false
negatives/positives.
That's what I said. And the docs also say that there should never be a
false positive. That's definitely an important design goal here,
because it enables routine testing.
I do think it would be reasonable to just wholesale
block against concurrent vacuums, but I don't think there's any reasonable
way to do that.
Actually, that's exactly what bt_index_parent_check() does already.
VACUUM requires a SHARE UPDATE EXCLUSIVE lock on the heap relation,
which cannot be concurrently acquired due to bt_index_parent_check()'s
acquisition of a SHARE lock. The locking for bt_index_parent_check()
is almost the same as REINDEX, except that that acquires an ACCESS
EXCLUSIVE lock on the index relation; bt_index_parent_check() only
requires an EXCLUSIVE lock on the index relation.
Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.
If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in amcheck.c.
But having that be dynamically configurable is not really compatible
with the goal of having amcheck be run routinely.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/11/16 6:17 PM, Peter Geoghegan wrote:
Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.
Right, but you still have the option to enable them if you don't want to
swamp your IO system. That's why CIC obeys it too. If I was running a
consistency check on a production system I'd certainly want the option
to throttle it. Without that option, I don't see running this on
production systems as being an option. If that's not a goal then fine,
but if it is a goal I think it needs to be there.
Isn't it just a few extra lines of code to support it?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan <pg@heroku.com> wrote:
If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in amcheck.c.
But having that be dynamically configurable is not really compatible
with the goal of having amcheck be run routinely.
Also, it's just really hard to reason about what remains OK to check
after the first problem is encountered in the general case. It's
"unreasonable" for any of the checks to ever fail. So, by that
standard, assuming that they might fail at all could be called
paranoid. Who can say what "paranoid enough" should be? I think it's
useful to have a low-impact, generic check function for B-Tree
indexes, but I don't think we need to hold back on being exhaustive
elsewhere.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
Right, but you still have the option to enable them if you don't want to
swamp your IO system. That's why CIC obeys it too. If I was running a
consistency check on a production system I'd certainly want the option to
throttle it. Without that option, I don't see running this on production
systems as being an option. If that's not a goal then fine, but if it is a
goal I think it needs to be there.Isn't it just a few extra lines of code to support it?
I see your point.
I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too). Note that the functions already use an appropriate
buffer access strategy (it avoids blowing shared_buffers, much like
VACUUM itself).
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 2016-03-11 at 16:40 -0800, Peter Geoghegan wrote:
On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan <pg@heroku.com>
wrote:If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in
amcheck.c.
But having that be dynamically configurable is not really
compatible
with the goal of having amcheck be run routinely.Also, it's just really hard to reason about what remains OK to check
after the first problem is encountered in the general case. It's
"unreasonable" for any of the checks to ever fail. So, by that
standard, assuming that they might fail at all could be called
paranoid. Who can say what "paranoid enough" should be? I think it's
useful to have a low-impact, generic check function for B-Tree
indexes, but I don't think we need to hold back on being exhaustive
elsewhere.
Right, but isn't there a difference between the two functions in this
respect? Once you find corruption involving relationship between
multiple pages, then I agree it's complicated to do any reasoning about
what additional checks are safe.
But does that problem also apply to bt_index_check, which checks pages
independently?
Admittedly, this also depends on the use case. If all we need to do is
answering a question "Is the index corrupted?" then sure, bailing out
on the first error is perfectly appropriate.
But perhaps this would be useful for some recovery/forensics tasks?
From time to time we need to investigate corruption in a database, i.e.
see how much of the data is actually corrupted, list pages that need to
be zeroed to get the cluster up to salvage as much as possible, etc.
Currently this is tedious because we essentially find/fix the pages one
by one. It'd be very useful to list all broken pages in one go and then
fix all of them.
Obviously, that's about heapam checks, but perhaps it would be useful
for an index too?
regard
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Mar 11, 2016 at 6:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
Right, but isn't there a difference between the two functions in this
respect? Once you find corruption involving relationship between
multiple pages, then I agree it's complicated to do any reasoning about
what additional checks are safe.But does that problem also apply to bt_index_check, which checks pages
independently?
I think so, yes.
Admittedly, this also depends on the use case. If all we need to do is
answering a question "Is the index corrupted?" then sure, bailing out
on the first error is perfectly appropriate.But perhaps this would be useful for some recovery/forensics tasks?
Maybe, but I feel like making it possible to change the CORRUPTION
elevel macro was the right trade-off. I don't want to have to reason
about the universe of possible problems that could occur when the tool
must limp on in the event of corruption. For example, I don't want to
have to deal with infinite loops. In practice, an expert would
probably be fine to change the constant themselves if they needed to.
Indexes can always be rebuilt. The tool is for identifying and
diagnosing corruption, but if you want to diagnose a faulty opclass or
something, then I think you need to get out pageinspect. You need
human judgement for that.
From time to time we need to investigate corruption in a database, i.e.
see how much of the data is actually corrupted, list pages that need to
be zeroed to get the cluster up to salvage as much as possible, etc.
Currently this is tedious because we essentially find/fix the pages one
by one. It'd be very useful to list all broken pages in one go and then
fix all of them.Obviously, that's about heapam checks, but perhaps it would be useful
for an index too?
Only insofar as it helps diagnose the underlying issue, when it is a
more subtle issue. Actually fixing the index is almost certainly a
REINDEX. Once you're into the messy business of diagnosing a
problematic opclass, you have to be an expert, and tweaking amcheck
for your requirements (i.e. rebuilding from source) becomes
reasonable. Part of the reason that the code is so heavily commented
is to make it hackable, because I do not feel optimistic that I can
get an expert-orientated interface right, but I still want to make the
tool as useful as possible to experts.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:
On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:
BTW, if you know a good way to corrupt index (and do it
reproducible) I'd be
very glad to see it.You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notruncI guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.
You can actually pretty easily produce a test case by setting up streaming
replication between servers running two different version of glibc.
I actually wrote a tool that spins up a pair of VMs using vagrant and then
sets them up as streaming replica's using ansible. It provides a nice one
liner to get a streaming replica test environment going and it will easily
provide the cross glibc test case. Technically, though it belongs to Trip
because I wrote it on company time. Let me see if I can open source a
version of it later this week that way you can use it for testing.
- Matt K.
On Sat, Mar 12, 2016 at 7:46 PM, Matt Kelly <mkellycs@gmail.com> wrote:
You can actually pretty easily produce a test case by setting up streaming
replication between servers running two different version of glibc.I actually wrote a tool that spins up a pair of VMs using vagrant and then
sets them up as streaming replica's using ansible. It provides a nice one
liner to get a streaming replica test environment going and it will easily
provide the cross glibc test case. Technically, though it belongs to Trip
because I wrote it on company time. Let me see if I can open source a
version of it later this week that way you can use it for testing.
That could be interesting. The earlier prototypes of this tool are
known to have detected glibc collation incompatibilities in real
production systems.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 3/11/16 6:45 PM, Peter Geoghegan wrote:
I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too).
I thought that had already been overloaded by CIC, but I'm not finding
reference to it... ANALYZE does use it though, so the ship has already
sorta sailed.
I'm actually a bit surprised cost delay isn't used anywhere else. As
more background operations are added I suspect users will want it at
some point.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers