amcheck hardening
Peter,
I'd like your advice on the following observations, if you'd be so kind:
Using the pg_amcheck command committed yesterday (thanks, Robert! thanks Tom!), I have begun investigating segfaults that sometimes occur when running the amcheck routines on intentionally corrupted databases. We've discussed this before, and there are limits to how much hardening is possible, particularly if it comes at the expense of backend performance under normal conditions. There are also serious problems with corruption schemes that differ from what is likely to go wrong in the wild.
These segfaults present a real usage problem for pg_amcheck. We made the decision [3] to not continue checking if we get a FATAL or PANIC error. Getting a segfault in just one database while checking just one index can abort a pg_amcheck run that spans multiple databases, tables and indexes, and therefore is not easy to blow off. Perhaps the decision in [3] was wrong, but if not, some hardening of amcheck might make this problem less common.
The testing strategy I'm using is to corrupt heap and btree pages in schema "public" of the "regression" database created by `make installcheck`, by overwriting random bytes in randomly selected locations on pages after the page header. Then I run `pg_amcheck regression` and see if anything segfaults. Doing this repeatedly, with random bytes and locations within files not the same from one run to the next, I can find the locations of segfaults that are particularly common.
Admittedly, this is not what is likely to be wrong in real-world installations. I had a patch as part of the pg_amcheck series that I ultimately abandoned for v14 given the schedule. It reverts tables and indexes (or portions thereof) to prior states. I'll probably move on to testing with that in a bit.
The first common problem [1] happens at verify_nbtree.c:1422 when a corruption report is being generated. The generation does not seem entirely safe, and the problematic bit can be avoided, though I suspect you could do better than the brute-force solution I'm using locally:
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index c4ca633918..fa8b7d5163 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -1418,9 +1418,13 @@ bt_target_page_check(BtreeCheckState *state)
OffsetNumberNext(offset));
itup = (IndexTuple) PageGetItem(state->target, itemid);
tid = BTreeTupleGetPointsToTID(itup);
+#if 0
nhtid = psprintf("(%u,%u)",
ItemPointerGetBlockNumberNoCheck(tid),
ItemPointerGetOffsetNumberNoCheck(tid));
+#else
+ nhtid = "(UNRESOLVED,UNRESOLVED)";
+#endif
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
The temPointerGetBlockNumberNoCheck(tid) seems to be unsafe here. I get much the same crash at verify_nbtree.c:1136, but it's so similar I'm not bother to include a crash report for it.
The second common problem [2] happens at verify_nbtree.c:2762 where it uses _bt_compare, which ends up calling pg_detoast_datum_packed on a garbage value. I'm not sure we can fix that at the level of _bt_compare, since that would have performance consequences on backends under normal conditions, but perhaps we could have a function that sanity checks datums, and call that from invariant_l_offset() prior to _bt_compare? I have observed a variant of this crash where the text data is not toasted but crashes anyway:
0 libsystem_platform.dylib 0x00007fff738ec929 _platform_memmove$VARIANT$Haswell + 41
1 postgres 0x000000010bf1af34 varstr_cmp + 532 (varlena.c:1678)
2 postgres 0x000000010bf1b6c9 text_cmp + 617 (varlena.c:1770)
3 postgres 0x000000010bf1bfe5 bttextcmp + 69 (varlena.c:1990)
4 postgres 0x000000010bf68c87 FunctionCall2Coll + 167 (fmgr.c:1163)
5 postgres 0x000000010b8a7cb5 _bt_compare + 1445 (nbtsearch.c:721)
6 amcheck.so 0x000000011525eaeb invariant_l_offset + 123 (verify_nbtree.c:2758)
7 amcheck.so 0x000000011525cd92 bt_target_page_check + 4754 (verify_nbtree.c:1398)
[1]:
Attachments:
On Sat, Mar 13, 2021 at 10:35 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
The testing strategy I'm using is to corrupt heap and btree pages in schema "public" of the "regression" database created by `make installcheck`, by overwriting random bytes in randomly selected locations on pages after the page header. Then I run `pg_amcheck regression` and see if anything segfaults. Doing this repeatedly, with random bytes and locations within files not the same from one run to the next, I can find the locations of segfaults that are particularly common.
That seems like a reasonable starting point to me.
The first common problem [1] happens at verify_nbtree.c:1422 when a corruption report is being generated. The generation does not seem entirely safe, and the problematic bit can be avoided
The temPointerGetBlockNumberNoCheck(tid) seems to be unsafe here.
I think that the best way to further harden verify_nbtree.c at this
point is to do the most basic validation of IndexTuples in our own new
variant of a core bufpage.h macro: PageGetItemCareful(). In other
words, we should invent the IndexTuple equivalent of the existing
PageGetItemIdCareful() function (which already does the same thing for
item pointers), and then replace all current PageGetItem() calls with
PageGetItemCareful() calls -- we ban raw PageGetItem() calls from
verify_nbtree.c forever.
Here is some of the stuff that could go in PageGetItemCareful(), just
off the top of my head:
* The existing "if (tupsize != ItemIdGetLength(itemid))" check at the
top of bt_target_page_check() -- make sure the length from the
caller's line pointer agrees with IndexTupleSize().
* Basic validation against the index's tuple descriptor -- in
particular, that varlena headers are basically sane, and that the
apparent range of datums is safely within the space on the page for
the tuple.
* Similarly, BTreeTupleGetHeapTID() should not be able to return a
pointer that doesn't actually point somewhere inside the space that
the target page has for the IndexTuple.
* No external TOAST pointers, since this is an index AM, and so
doesn't allow that.
In general this kind of very basic validation should be pushed down to
the lowest level code, so that it detects the problem as early as
possible, before slightly higher level code has the opportunity to
run. Higher level code is always going to be at risk of making
assumptions about the data not being corrupt, because there is so much
more of it, and also because it tends to roughly look like idiomatic
AM code.
--
Peter Geoghegan
On Mar 13, 2021, at 11:06 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Sat, Mar 13, 2021 at 10:35 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:The testing strategy I'm using is to corrupt heap and btree pages in schema "public" of the "regression" database created by `make installcheck`, by overwriting random bytes in randomly selected locations on pages after the page header. Then I run `pg_amcheck regression` and see if anything segfaults. Doing this repeatedly, with random bytes and locations within files not the same from one run to the next, I can find the locations of segfaults that are particularly common.
That seems like a reasonable starting point to me.
The first common problem [1] happens at verify_nbtree.c:1422 when a corruption report is being generated. The generation does not seem entirely safe, and the problematic bit can be avoided
The temPointerGetBlockNumberNoCheck(tid) seems to be unsafe here.
I think that the best way to further harden verify_nbtree.c at this
point is to do the most basic validation of IndexTuples in our own new
variant of a core bufpage.h macro: PageGetItemCareful(). In other
words, we should invent the IndexTuple equivalent of the existing
PageGetItemIdCareful() function (which already does the same thing for
item pointers), and then replace all current PageGetItem() calls with
PageGetItemCareful() calls -- we ban raw PageGetItem() calls from
verify_nbtree.c forever.Here is some of the stuff that could go in PageGetItemCareful(), just
off the top of my head:* The existing "if (tupsize != ItemIdGetLength(itemid))" check at the
top of bt_target_page_check() -- make sure the length from the
caller's line pointer agrees with IndexTupleSize().* Basic validation against the index's tuple descriptor -- in
particular, that varlena headers are basically sane, and that the
apparent range of datums is safely within the space on the page for
the tuple.* Similarly, BTreeTupleGetHeapTID() should not be able to return a
pointer that doesn't actually point somewhere inside the space that
the target page has for the IndexTuple.* No external TOAST pointers, since this is an index AM, and so
doesn't allow that.In general this kind of very basic validation should be pushed down to
the lowest level code, so that it detects the problem as early as
possible, before slightly higher level code has the opportunity to
run. Higher level code is always going to be at risk of making
assumptions about the data not being corrupt, because there is so much
more of it, and also because it tends to roughly look like idiomatic
AM code.
Excellent write-up! Thanks!
I'll work on this and get a patch set going if time allows.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company