btreecheck extension
As discussed at the developer meeting at pgCon, I think that there is
a lot to be said for a tool that checks nbtree index invariants on
live systems.
Attached prototype patch adds contrib extension, btreecheck. This
extension provides SQL-callable functions for checking these
conditions on nbtree indexes on live systems. This code generally
works by using the same insertion ScanKey infrastructure that the
B-Tree code itself uses in servicing all kinds of index scans.
There are two major goals for this patch. In order of their importance
in my mind:
1. Improve testing around the B-Tree code. Give 9.4 beta testers the
tools needed to detect subtle errors in a semi-automated way. I'm
quite encouraged by recent efforts along these lines by Jeff Janes,
Heikki, and others. Beyond 9.4, I think that stress testing is likely
to become indispensable as more complex performance optimizations are
pursued. Personally, I see some good opportunities to improve the
performance of the B-Tree code, and if those ideas are ever to be put
into action, there is clearly a big need for automated stress-testing.
I see this tool as a start. I'm not just talking about my recent work
on sort support for text, and the idea of generalizing that; I think
that there are many techniques that we could pursue for decreasing the
I/O and CPU overhead of using B-Tree indexes that are just too risky
to pursue today because of the subtle interactions of many moving
parts. I'd like to lower the risk.
2. Allow users to sanity check indexes in production. This is not my
immediate goal here, though; I will not add this patch to the next
CommitFest. That's something to think about in more detail a bit
later. Clearly there is a big demand for this kind of thing, though.
Since I believe the B-Tree code changes of 9.4 are generally thought
to be the most plausible source of serious bugs in 9.4 purely because
of their subtlety and fundamental importance, it seemed valuable to
get this prototype into the hands of beta testers as soon as possible.
For that reason, I have not given some aspects of the design as much
thought as I would have preferred. I haven't really come up with a
plausible plan of attack for finding bugs in the B-Tree code. Rather,
I've more or less just considered what invariant conditions exist that
can be correctly verified without too much effort, and then
implemented checks for each. I've also stuck to a design that only has
the tool share lock a single B-Tree buffer at a time, and copy the
contents into local memory before operating on that (relation level
heavyweight locks are used in various different ways too), in a
similar manner to contrib/pageinspect. Copying the entire page into
local memory is probably the right thing to do even from a performance
perspective, since in each case the entire page is inspected, which
takes much longer than, say, a binary search. Even though it's
supposed to be possible to run this on a live system without causing
too much disruption, subtle buffer locking protocols were avoided
(although there is a rationale for why things are okay in the module,
that considers current buffer locking protocols particular to the
B-Tree code). This is a principle that I think I can stick to, and
would certainly prefer to stick to for obvious reasons.
The checks performed (the entire set of invariant conditions checked
by all SQL-callable functions in the patch) are currently:
* That data items on each B-Tree page are in logical order.
* That non-rightmost pages (which are the only pages that have a high
key) have only data items that respect that high key as an upper
bound.
* That a down-link data item in a parent page points to a child page
where the parent down-link data item is respected as a lower bound.
* That a down-link data item in a parent page points to a child pages
whose high key (if any) is equal to the next data item on the parent
page (if any), while taking into account incomplete page splits and
right-most-ness.
* That a down-link data item in a parent page points to a child page
whose right link (if any) points to the same block/page as pointed to
by the next data item on the parent page (if any). In other words, the
children and their parent are in agreement about the children being
siblings, while taking into account incomplete page splits and
right-most-ness. In other other words, the second phase of a page
split where a downlink is inserted into the parent was sane (so this
is similar to the previous invariant check listed).
The two prior checks may not be sufficient, though, if we want to
operate with weak, non-disruptive locks -- what about cousins and
grandparents, and even great-grandparents and second cousins? It's a
good start, though.
* That on each level, starting at the true root level and working down
the the leaf level (level 0), there is mutual agreement between
siblings that they are siblings (that is, their right and left links
comport).
I won't go into locking protocols and their correctness here. That's
something that I would like others to look over, though - there are
fairly detailed comments. I would also like to coordinate with anyone
who thinks they can come up with a good stress testing workload that
introduces a lot of variability for the B-Tree code to deal with. I
think it's appropriate that that general effort should mostly drive
the development of this tool, and not the other way around.
When you consider the difficulty of isolating the kind of bugs we're
looking for here, a lower lock strength is not just a nice-to-have. I
think it might actually be an important contribution to the
effectiveness of the tool generally. I envisage tests running with a
high amount of concurrency on big databases over days or even longer.
Thoughts?
--
Peter Geoghegan
Attachments:
btreecheck-prototype.2014_06_16.patchtext/x-patch; charset=US-ASCII; name=btreecheck-prototype.2014_06_16.patchDownload
*** a/contrib/btreecheck/Makefile
--- b/contrib/btreecheck/Makefile
***************
*** 0 ****
--- 1,18 ----
+ # contrib/btreecheck/Makefile
+
+ MODULE_big = btreecheck
+ OBJS = btreecheck.o
+
+ EXTENSION = btreecheck
+ DATA = btreecheck--1.0.sql
+
+ ifdef USE_PGXS
+ PG_CONFIG = pg_config
+ PGXS := $(shell $(PG_CONFIG) --pgxs)
+ include $(PGXS)
+ else
+ subdir = contrib/btreecheck
+ top_builddir = ../..
+ include $(top_builddir)/src/Makefile.global
+ include $(top_srcdir)/contrib/contrib-global.mk
+ endif
*** a/contrib/btreecheck/btreecheck--1.0.sql
--- b/contrib/btreecheck/btreecheck--1.0.sql
***************
*** 0 ****
--- 1,44 ----
+ /* contrib/btreecheck/btreecheck--1.0.sql */
+
+ -- complain if script is sourced in psql, rather than via CREATE EXTENSION
+ \echo Use "CREATE EXTENSION btreecheck" to load this file. \quit
+
+ --
+ -- bt_page_verify()
+ --
+ CREATE FUNCTION bt_page_verify(relname text, blkno int4)
+ RETURNS VOID
+ AS 'MODULE_PATHNAME', 'bt_page_verify'
+ LANGUAGE C STRICT;
+
+ --
+ -- bt_parent_page_verify()
+ --
+ CREATE FUNCTION bt_parent_page_verify(relname text, blkno int4)
+ RETURNS VOID
+ AS 'MODULE_PATHNAME', 'bt_parent_page_verify'
+ LANGUAGE C STRICT;
+
+ --
+ -- bt_index_verify()
+ --
+ CREATE FUNCTION bt_index_verify(relname text)
+ RETURNS VOID
+ AS 'MODULE_PATHNAME', 'bt_index_verify'
+ LANGUAGE C STRICT;
+
+ --
+ -- bt_parent_index_verify()
+ --
+ CREATE FUNCTION bt_parent_index_verify(relname text)
+ RETURNS VOID
+ AS 'MODULE_PATHNAME', 'bt_parent_index_verify'
+ LANGUAGE C STRICT;
+
+ --
+ -- bt_leftright_verify()
+ --
+ CREATE FUNCTION bt_leftright_verify(relname text)
+ RETURNS VOID
+ AS 'MODULE_PATHNAME', 'bt_leftright_verify'
+ LANGUAGE C STRICT;
*** a/contrib/btreecheck/btreecheck.c
--- b/contrib/btreecheck/btreecheck.c
***************
*** 0 ****
--- 1,836 ----
+ /*-------------------------------------------------------------------------
+ *
+ * btreecheck.c
+ * Verifies the integrity of nbtree indexes.
+ *
+ * Provides SQL-callable functions for verifying that various invariants in the
+ * structure of nbtree indexes are respected. This includes for example the
+ * invariant that each page with a high key has data items bound by the high
+ * key. Some functions also check invariant conditions that must hold across
+ * multiple pages, such as the requirement each left link within a page (if
+ * any) should point to a page that has as its right link a pointer to the
+ * page.
+ *
+ *
+ * Copyright (c) 2014, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * contrib/btreecheck/btreecheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+ #include "postgres.h"
+
+ #include "access/nbtree.h"
+ #include "catalog/index.h"
+ #include "catalog/namespace.h"
+ #include "commands/tablecmds.h"
+ #include "funcapi.h"
+ #include "miscadmin.h"
+ #include "storage/lmgr.h"
+ #include "utils/builtins.h"
+ #include "utils/lsyscache.h"
+ #include "utils/rel.h"
+
+
+ PG_MODULE_MAGIC;
+
+ #define CHECK_SUPERUSER() { \
+ if (!superuser()) \
+ ereport(ERROR, \
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+ (errmsg("must be superuser to use verification functions")))); }
+ #define CHECK_RELATION_BTREE(r) { \
+ if ((r)->rd_rel->relkind != RELKIND_INDEX || (r)->rd_rel->relam != BTREE_AM_OID) \
+ elog(ERROR, "relation \"%s\" is not a btree index", \
+ RelationGetRelationName(r)); }
+ /* reject attempts to read non-local temporary relations */
+ #define CHECK_RELATION_IS_NOT_OTHER_TEMP(rel) { \
+ if (RELATION_IS_OTHER_TEMP(rel)) \
+ ereport(ERROR, \
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), \
+ errmsg("cannot access temporary tables of other sessions"))); }
+ /* note: BlockNumber is unsigned, hence can't be negative */
+ #define CHECK_RELATION_BLOCK_RANGE(rel, blkno) { \
+ if (blkno == 0) \
+ elog(ERROR, "block 0 is a meta page"); \
+ if ( RelationGetNumberOfBlocks(rel) <= (BlockNumber) (blkno) ) \
+ elog(ERROR, "block number out of range"); }
+
+ /*
+ * Callers to verification functions can reasonably expect to never receive a
+ * warning. Therefore, when using btreecheck functions for stress testing it
+ * may be useful to temporally change CHCKELEVEL to PANIC, to immediately halt
+ * the server in the event of detecting an invariant condition violation. This
+ * may preserve more information about the nature of the underlying problem.
+ */
+ #define CHCKELEVEL WARNING
+
+ PG_FUNCTION_INFO_V1(bt_page_verify);
+ PG_FUNCTION_INFO_V1(bt_parent_page_verify);
+ PG_FUNCTION_INFO_V1(bt_index_verify);
+ PG_FUNCTION_INFO_V1(bt_parent_index_verify);
+ PG_FUNCTION_INFO_V1(bt_leftright_verify);
+
+ static void rangevar_callback_for_childcheck(const RangeVar *relation,
+ Oid relId, Oid oldRelId,
+ void *arg);
+ static void bt_do_page_verify(Relation rel, BlockNumber blockNum,
+ bool childCheck);
+ static ScanKey verify_btree_items_le(Relation rel, Page childPage,
+ ScanKey downLinkParent, IndexTuple pItup);
+ static BlockNumber verify_leftright_level(Relation rel, BlockNumber leftMost);
+ static char *form_btree_data(IndexTuple itup);
+
+ /*
+ * Verify specified B-Tree block/page.
+ *
+ * Only acquires AccessShareLock on index relation.
+ */
+ Datum
+ bt_page_verify(PG_FUNCTION_ARGS)
+ {
+ text *relname = PG_GETARG_TEXT_P(0);
+ uint32 blkno = PG_GETARG_UINT32(1);
+ Relation rel;
+ RangeVar *relrv;
+
+ CHECK_SUPERUSER();
+
+ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+ rel = relation_openrv(relrv, AccessShareLock);
+
+ CHECK_RELATION_BTREE(rel);
+
+ CHECK_RELATION_IS_NOT_OTHER_TEMP(rel);
+
+ CHECK_RELATION_BLOCK_RANGE(rel, blkno);
+
+ bt_do_page_verify(rel, blkno, false);
+
+ relation_close(rel, AccessShareLock);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Verify specified B-Tree parent block/page. Make sure that each child page
+ * has as its right link the page that the parent page has as the next
+ * down-link, and other checks between parent and child pages when examining a
+ * parent page.
+ *
+ * Acquires AccessExclusiveLock on index relation, and ShareLock on the
+ * associated heap relation, much like REINDEX.
+ */
+ Datum
+ bt_parent_page_verify(PG_FUNCTION_ARGS)
+ {
+ text *relname = PG_GETARG_TEXT_P(0);
+ uint32 blkno = PG_GETARG_UINT32(1);
+ Relation iRel, heapRel;
+ RangeVar *relrv;
+ Oid indexId, heapId = InvalidOid;
+
+ CHECK_SUPERUSER();
+
+ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+
+ /*
+ * Open the target index relation and get an exclusive lock on it, to
+ * ensure that no one else is touching this particular index.
+ */
+ indexId = RangeVarGetRelidExtended(relrv, AccessExclusiveLock,
+ false, false,
+ rangevar_callback_for_childcheck,
+ (void *) &heapId);
+
+ /*
+ * Open the target index relations separately (like relation_openrv() but
+ * with heap relation locked first to prevent deadlocking)
+ */
+ heapRel = heap_open(heapId, NoLock);
+ iRel = index_open(indexId, NoLock);
+
+ /* check for active uses of the index in the current transaction */
+ CheckTableNotInUse(iRel, "bt_parent_page_verify");
+
+ CHECK_RELATION_BTREE(iRel);
+
+ CHECK_RELATION_IS_NOT_OTHER_TEMP(iRel);
+
+ CHECK_RELATION_BLOCK_RANGE(iRel, blkno);
+
+ /*
+ * XXX: It might be a good idea to throw an error if this isn't an internal
+ * page for this variant
+ */
+ bt_do_page_verify(iRel, blkno, true);
+
+ index_close(iRel, AccessExclusiveLock);
+ heap_close(heapRel, ShareLock);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Verify entire index, without looking at cross-page invariant conditions.
+ *
+ * Only acquires AccessShareLock on index relation.
+ */
+ Datum
+ bt_index_verify(PG_FUNCTION_ARGS)
+ {
+ text *relname = PG_GETARG_TEXT_P(0);
+ Relation rel;
+ RangeVar *relrv;
+ BlockNumber i;
+ BlockNumber nBlocks;
+
+ CHECK_SUPERUSER();
+
+ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+ rel = relation_openrv(relrv, AccessShareLock);
+
+ CHECK_RELATION_BTREE(rel);
+
+ CHECK_RELATION_IS_NOT_OTHER_TEMP(rel);
+
+ nBlocks = RelationGetNumberOfBlocks(rel);
+
+ /* Verify every page */
+ for (i = 1; i < nBlocks; i++)
+ {
+ bt_do_page_verify(rel, i, false);
+ }
+
+ relation_close(rel, AccessShareLock);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Verify entire index, considering cross-page invariant conditions between
+ * parent and child pages.
+ *
+ * Acquires AccessExclusiveLock on index relation, and ShareLock on the
+ * associated heap relation, much like REINDEX.
+ */
+ Datum
+ bt_parent_index_verify(PG_FUNCTION_ARGS)
+ {
+ text *relname = PG_GETARG_TEXT_P(0);
+ RangeVar *relrv;
+ BlockNumber i;
+ BlockNumber nBlocks;
+ Relation iRel, heapRel;
+ Oid indexId, heapId = InvalidOid;
+
+ CHECK_SUPERUSER();
+
+ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+
+ /*
+ * Open the target index relation and get an exclusive lock on it, to
+ * ensure that no one else is touching this particular index.
+ */
+ indexId = RangeVarGetRelidExtended(relrv, AccessExclusiveLock,
+ false, false,
+ rangevar_callback_for_childcheck,
+ (void *) &heapId);
+
+ /*
+ * Open the target index relations separately (like relation_openrv() but
+ * with heap relation locked first to prevent deadlocking)
+ */
+ heapRel = heap_open(heapId, NoLock);
+ iRel = index_open(indexId, NoLock);
+
+ /* check for active uses of the index in the current transaction */
+ CheckTableNotInUse(iRel, "bt_parent_index_verify");
+
+ CHECK_RELATION_BTREE(iRel);
+
+ CHECK_RELATION_IS_NOT_OTHER_TEMP(iRel);
+
+ nBlocks = RelationGetNumberOfBlocks(iRel);
+
+ /* Verify every page, and parent/child relationships */
+ for (i = 1; i < nBlocks; i++)
+ {
+ bt_do_page_verify(iRel, i, true);
+ }
+
+ index_close(iRel, AccessExclusiveLock);
+ heap_close(heapRel, ShareLock);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Verify left-right link consistency at each level, starting from the true
+ * root.
+ *
+ * Only acquires AccessShareLock on index relation.
+ */
+ Datum
+ bt_leftright_verify(PG_FUNCTION_ARGS)
+ {
+ text *relname = PG_GETARG_TEXT_P(0);
+ Relation rel;
+ RangeVar *relrv;
+ Buffer buffer;
+ Page page;
+ BlockNumber leftMost;
+ BTMetaPageData *metad;
+
+ CHECK_SUPERUSER();
+
+ relrv = makeRangeVarFromNameList(textToQualifiedNameList(relname));
+ rel = relation_openrv(relrv, AccessShareLock);
+
+ CHECK_RELATION_BTREE(rel);
+
+ CHECK_RELATION_IS_NOT_OTHER_TEMP(rel);
+
+ /* Get true root block from meta-page */
+ buffer = ReadBuffer(rel, 0);
+ LockBuffer(buffer, BT_READ);
+
+ page = BufferGetPage(buffer);
+ metad = BTPageGetMeta(page);
+
+ UnlockReleaseBuffer(buffer);
+
+ /*
+ * Verify every level, starting form true root's level. Move left to right.
+ * Process one level at a time.
+ */
+ leftMost = metad->btm_root;
+
+ while (leftMost != P_NONE)
+ {
+ leftMost = verify_leftright_level(rel, leftMost);
+ }
+
+ relation_close(rel, AccessShareLock);
+
+ PG_RETURN_VOID();
+ }
+
+ /*
+ * Worker for bt_page_verify(), bt_index_verify() and "parent" variants of
+ * both.
+ *
+ * It is the caller's responsibility to handle heavyweight locking. Checking
+ * children from down-links has race conditions with only an AccessShareLock,
+ * since by design there is never an attempt to get a consistent view of
+ * multiple pages using multiple concurrent buffer locks.
+ */
+ static void
+ bt_do_page_verify(Relation rel, BlockNumber blockNum, bool childCheck)
+ {
+ Buffer buffer;
+ Page page;
+ OffsetNumber offset, maxoffset;
+ BTPageOpaque opaque;
+ int16 relnatts;
+
+ buffer = ReadBuffer(rel, blockNum);
+ LockBuffer(buffer, BT_READ);
+
+ /*
+ * We copy the page into local storage to avoid holding pin on the
+ * buffer longer than we must, and possibly failing to release it at
+ * all if the calling query doesn't fetch all rows.
+ */
+ page = palloc(BLCKSZ);
+ memcpy(page, BufferGetPage(buffer), BLCKSZ);
+
+ UnlockReleaseBuffer(buffer);
+
+ relnatts = rel->rd_rel->relnatts;
+
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ if (P_ISDELETED(opaque))
+ elog(NOTICE, "page %u of index \"%s\" is deleted", blockNum,
+ RelationGetRelationName(rel));
+
+ maxoffset = PageGetMaxOffsetNumber(page);
+
+ for (offset = P_FIRSTDATAKEY(opaque); offset <= maxoffset; offset++)
+ {
+ ItemId itemId;
+ IndexTuple itup;
+ ScanKey skey;
+ int32 eqs_hikey, eqs_item;
+
+ /*
+ * As noted in comments above _bt_compare(), there is special handling
+ * of the first data item on a non-leaf page. There is clearly no
+ * point in building a ScanKey of whatever data happens to be in this
+ * first data IndexTuple, since its contents are undefined. (This is
+ * because _bt_compare() is hard-coded to specially treat it as "minus
+ * infinity").
+ */
+ if (!P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque))
+ continue;
+
+ itemId = PageGetItemId(page, offset);
+
+ if (!ItemIdIsValid(itemId))
+ elog(Max(CHCKELEVEL, ERROR), "invalid itemId");
+
+ itup = (IndexTuple) PageGetItem(page, itemId);
+
+ /*
+ * Build ScanKey that relates to the current item/offset's value
+ */
+ skey = _bt_mkscankey(rel, itup);
+
+ eqs_hikey = eqs_item = 0;
+
+ /* If there is a high key, compare this item to it */
+ if (!P_RIGHTMOST(opaque))
+ eqs_hikey = _bt_compare(rel, relnatts, skey, page, P_HIKEY);
+
+ /*
+ * Check that items are stored on page in logical order.
+ *
+ * In each iteration, check that next item is equal to or greater than
+ * current item, until there is no next item (i.e. don't do this on the
+ * last iteration).
+ */
+ if (offset < maxoffset)
+ eqs_item = _bt_compare(rel, relnatts, skey, page, offset + 1);
+
+ /*
+ * It is expected that current iteration's ScanKey is always less than
+ * the high key on this page, and always less than the next
+ * item/offset.
+ */
+ if (eqs_hikey > 0 || eqs_item > 0)
+ {
+ char *hblk, *iblk, *niblk, *dump;
+
+ hblk = psprintf("(%u,%u)",
+ BlockIdGetBlockNumber(&(itup->t_tid.ip_blkid)),
+ itup->t_tid.ip_posid);
+
+ iblk = psprintf("(%u,%u)",
+ blockNum,
+ offset);
+
+ niblk = psprintf("(%u,%u)",
+ blockNum,
+ offset + 1);
+
+ dump = form_btree_data(itup);
+
+ if (eqs_hikey > 0)
+ elog(CHCKELEVEL, "high key invariant violated for index \"%s\" %s with data %s (points to: %s)",
+ RelationGetRelationName(rel), iblk, dump, hblk);
+ if (eqs_item > 0)
+ elog(CHCKELEVEL, "page order invariant violated for index \"%s\" (%s and %s) with data %s (points to: %s)",
+ RelationGetRelationName(rel), iblk, niblk, dump, hblk);
+ }
+
+ /*
+ * For internal pages, verify that child pages comport with the
+ * down-link in the parent page. This is safe against concurrent page
+ * splits because caller must have taken appropriate heavyweight locks
+ * for this case. If there is an old incomplete page split, detect
+ * that and handle it appropriately.
+ */
+ if (childCheck && !P_ISLEAF(opaque))
+ {
+ Page childPage;
+ BlockNumber childBlockNum;
+ Buffer childBuffer;
+ ScanKey childHkey;
+
+ childBlockNum = ItemPointerGetBlockNumber(&(itup->t_tid));
+
+ childBuffer = ReadBuffer(rel, childBlockNum);
+ LockBuffer(childBuffer, BT_READ);
+
+ /* Copy the page into local storage too */
+ childPage = palloc(BLCKSZ);
+ memcpy(childPage, BufferGetPage(childBuffer), BLCKSZ);
+
+ UnlockReleaseBuffer(childBuffer);
+
+ /*
+ * Verify child page that this down-link points to has the down-link
+ * key as a lower bound.
+ */
+ childHkey = verify_btree_items_le(rel, childPage, skey, itup);
+
+ if (childHkey && offset < maxoffset)
+ {
+ /* state for child page, and next data item/offset */
+ BlockNumber childRightLinkBlockNum,
+ parentNextOffsetBlockNum;
+ IndexTuple nIndTup;
+ ItemId nItemId;
+ BTPageOpaque childOpaque;
+ int32 res;
+
+ /*
+ * Since this isn't the last iteration, and there can't have
+ * been an incomplete page split on child page just inspected
+ * (since a child high key was returned), further check that
+ * next item on this page (the parent) is equal to child's high
+ * key.
+ */
+ childOpaque = (BTPageOpaque) PageGetSpecialPointer(childPage);
+ Assert(!P_RIGHTMOST(childOpaque));
+ Assert(!P_INCOMPLETE_SPLIT(childOpaque));
+
+ res = _bt_compare(rel, relnatts, childHkey, page, offset + 1);
+
+ if (res != 0)
+ {
+ char *iblk, *piblk;
+
+ /*
+ * Principally blame next item, not this item for problem,
+ * since it was the down-link that completed the split,
+ * which seems most plausible as a proximate cause.
+ */
+ iblk = psprintf("(%u,%u)",
+ blockNum,
+ offset + 1);
+ piblk = psprintf("(%u,%u)",
+ blockNum,
+ offset);
+
+ elog(CHCKELEVEL, "down-link in parent page in index \"%s\" %s is %s (not equal to) immediately prior item's %s child's (block %u) high key",
+ RelationGetRelationName(rel), iblk,
+ res < 0 ? "greater than":"less than",
+ piblk, childBlockNum);
+ }
+
+ /*
+ * Match block number of child's right link to next parent
+ * offset item's block number, too. This is similar to but
+ * slightly distinct from the child high key matching
+ * verification.
+ */
+ nItemId = PageGetItemId(page, offset + 1);
+
+ if (!ItemIdIsValid(nItemId))
+ elog(Max(CHCKELEVEL, ERROR), "invalid nItemId for next item");
+
+ childRightLinkBlockNum = childOpaque->btpo_next;
+ nIndTup = (IndexTuple) PageGetItem(page, nItemId);
+ parentNextOffsetBlockNum =
+ BlockIdGetBlockNumber(&(nIndTup ->t_tid.ip_blkid));
+
+ if (childRightLinkBlockNum != parentNextOffsetBlockNum)
+ elog(CHCKELEVEL, "next item/offset down-link (%u) in parent page does not match child (block %u) right link block number",
+ parentNextOffsetBlockNum, childBlockNum);
+
+ pfree(childHkey);
+ }
+
+ pfree(childPage);
+ }
+ pfree(skey);
+
+ /* check for interrupts while we're not holding any buffer lock */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ pfree(page);
+ }
+
+ /*
+ * Lock the heap before the RangeVarGetRelidExtended takes the index lock, to
+ * avoid deadlocks.
+ *
+ * If there was ever a requirement to check permissions, this would be the
+ * place to do it. Currently, we just insist upon superuser.
+ */
+ static void
+ rangevar_callback_for_childcheck(const RangeVar *relation,
+ Oid relId, Oid oldRelId, void *arg)
+ {
+ Oid *heapOid = (Oid *) arg;
+
+ /*
+ * If we previously locked some other index's heap, and the name we're
+ * looking up no longer refers to that relation, release the now-useless
+ * lock.
+ */
+ if (relId != oldRelId && OidIsValid(oldRelId))
+ {
+ /* lock level here should match reindex_index() heap lock */
+ UnlockRelationOid(*heapOid, ShareLock);
+ *heapOid = InvalidOid;
+ }
+
+ /* If the relation does not exist, there's nothing more to do. */
+ if (!OidIsValid(relId))
+ return;
+
+ /* Lock heap before index to avoid deadlock. */
+ if (relId != oldRelId)
+ {
+ /*
+ * Lock level here should match elsewhere. If the OID isn't valid, it
+ * means the index was concurrently dropped, which is not a problem for
+ * us; just return normally.
+ */
+ *heapOid = IndexGetRelation(relId, true);
+ if (OidIsValid(*heapOid))
+ LockRelationOid(*heapOid, ShareLock);
+ }
+ }
+
+ /*
+ * Verify that each item on page is greater than or equal to the down-link key
+ * in the parent. Don't assume that the items in the child page are in logical
+ * order as generally expected -- exhaustively check each and every data item,
+ * and not just the first.
+ *
+ * XXX: Presently, the checking of every item here is usually somewhat
+ * redundant. Revisit the question of how necessary this is in light of how
+ * the code is actually called from a higher level.
+ *
+ * Returns ScanKey that is initialized with high key value, that can be checked
+ * within parent against next offset/down-link. However, if there was an
+ * incomplete page split (and therefore no down-link in parent can reasonably
+ * be expected yet) or this is the right-most page (and therefore does not have
+ * a high key), just return NULL.
+ */
+ static ScanKey
+ verify_btree_items_le(Relation rel, Page childPage, ScanKey downLinkParent,
+ IndexTuple pItup)
+ {
+ OffsetNumber offset, maxoffset;
+ BTPageOpaque opaque;
+ int16 relnatts = rel->rd_rel->relnatts;
+
+ opaque = (BTPageOpaque) PageGetSpecialPointer(childPage);
+ maxoffset = PageGetMaxOffsetNumber(childPage);
+
+ for (offset = P_FIRSTDATAKEY(opaque); offset <= maxoffset; offset++)
+ {
+ int32 res;
+
+ /*
+ * No point in checking first data key on non-leaf page, since it's
+ * logically "minus infinity". All items on child page should be
+ * greater than or equal to supplied down-link ScanKey (so the first
+ * data item on a non-leaf page is only "minus infinity" in a way that
+ * relies on the very invariant now being verified -- it's not "minus
+ * infinity" in any general sense. It's "minus infinity" with respect
+ * to values less than the first "real" data item, but greater than or
+ * equal to the page's left-link's high key, iff there is a left link).
+ *
+ * Leaving aside discussion of invariant conditions, clearly there is
+ * no point in doing anything with the first data key on a non-leaf
+ * page because its data is undefined, as noted in comments above
+ * _bt_compare().
+ */
+ if (!P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque))
+ continue;
+
+ res = _bt_compare(rel, relnatts, downLinkParent, childPage, offset);
+
+ /*
+ * Parent down-link is lower bound. Report when this invariant
+ * condition has been violated.
+ */
+ if (res > 0)
+ {
+ char *dump;
+
+ dump = form_btree_data(pItup);
+
+ elog(CHCKELEVEL, "down-link low key invariant violated for index \"%s\" with data %s (points to: %u)",
+ RelationGetRelationName(rel), dump,
+ BlockIdGetBlockNumber(&(pItup->t_tid.ip_blkid)));
+ }
+ }
+
+ if (P_INCOMPLETE_SPLIT(opaque))
+ {
+ elog(NOTICE, "incomplete split in child page (block %u) found when following down-link",
+ BlockIdGetBlockNumber(&(pItup->t_tid.ip_blkid)));
+ }
+ else if (!P_RIGHTMOST(opaque))
+ {
+ /*
+ * Useful ScanKey for child page high key exists -- return it to caller
+ * for further checks on parent page
+ */
+ ItemId hKeyItemId;
+ IndexTuple hKeyItup;
+
+ hKeyItemId = PageGetItemId(childPage, P_HIKEY);
+
+ if (!ItemIdIsValid(hKeyItemId))
+ elog(Max(CHCKELEVEL, ERROR), "invalid hKeyItemId");
+
+ hKeyItup = (IndexTuple) PageGetItem(childPage, hKeyItemId);
+
+ return _bt_mkscankey(rel, hKeyItup);
+ }
+
+ return NULL;
+ }
+
+ /*
+ * Given a left-most block at some level, move right, verifying that left/right
+ * sibling links match as pages are passed over. It is the caller's
+ * responsibility to acquire appropriate heavyweight locks on the index
+ * relation.
+ *
+ * Moves from left to right, so there is no need to worry about concurrent page
+ * splits, as after a page split the original/left page has the same left-link
+ * as before. (And, for what it's worth, the page just under consideration has
+ * the same right link as before, so logically nothing is skipped. Everything
+ * moves right, and so is denied the opportunity to be skipped over). Page
+ * deletion receives no special consideration either, because the second stage
+ * of page deletion (where a half-dead leaf page is unlinked from its siblings)
+ * involves acquiring a lock on the left-sibling, on the deletion target
+ * itself, and on the right sibling, all concurrently and in that order.
+ * That's the only one of the two stages of deletion where side-links are
+ * updated.
+ *
+ * Returns left-most block number one level lower that should be passed on next
+ * level/call, or P_NONE when done. Caller should call with the true root page
+ * initially, and work their way down to level 0 (leaf page level).
+ *
+ * The return value may become stale (from, say, concurrent page deletion).
+ * When passed back here, this is handled in a similar though less invasive
+ * manner to _bt_moveright(). It may prove necessary to "move right" one or
+ * more times if the left-most page passed here is deleted or half dead, by
+ * telling caller about a new left-most page on the same level.
+ */
+ static BlockNumber
+ verify_leftright_level(Relation rel, BlockNumber leftMost)
+ {
+ Buffer buffer;
+ BTPageOpaque opaque;
+ Page page;
+ BlockNumber current, currentsLeft, next;
+
+ buffer = ReadBuffer(rel, leftMost);
+ LockBuffer(buffer, BT_READ);
+
+ /* as always, copy the page into local storage */
+ page = palloc(BLCKSZ);
+ memcpy(page, BufferGetPage(buffer), BLCKSZ);
+
+ UnlockReleaseBuffer(buffer);
+
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ if (P_IGNORE(opaque))
+ {
+ /*
+ * Handle race -- tell caller to try again with prior left-most's right
+ * sibling, in a manner not unlike _bt_moveright()
+ */
+ if (opaque->btpo_next == P_NONE)
+ elog(Max(CHCKELEVEL, ERROR), "fell off the end of index \"%s\"",
+ RelationGetRelationName(rel));
+ next = opaque->btpo_next;
+ pfree(page);
+ elog(NOTICE, "left-most page was found deleted or half dead");
+ return next;
+ }
+
+ if (!P_LEFTMOST(opaque))
+ elog(CHCKELEVEL, "page %u of index \"%s\" is not leftmost",
+ leftMost, RelationGetRelationName(rel));
+
+ /* remember down-link block to return */
+ if (P_ISLEAF(opaque))
+ {
+ next = P_NONE;
+ }
+ else
+ {
+ IndexTuple itup;
+ ItemId itemId;
+
+ itemId = PageGetItemId(page, P_FIRSTDATAKEY(opaque));
+
+ if (!ItemIdIsValid(itemId))
+ elog(Max(CHCKELEVEL, ERROR), "invalid itemId");
+
+ itup = (IndexTuple) PageGetItem(page, itemId);
+
+ next = ItemPointerGetBlockNumber(&(itup->t_tid));
+ }
+
+ currentsLeft = leftMost;
+ current = opaque->btpo_next;
+
+ while (!P_RIGHTMOST(opaque))
+ {
+ buffer = ReadBuffer(rel, current);
+ LockBuffer(buffer, BT_READ);
+
+ memcpy(page, BufferGetPage(buffer), BLCKSZ);
+
+ UnlockReleaseBuffer(buffer);
+
+ opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+ if (P_ISDELETED(opaque))
+ elog(NOTICE, "page %u of index \"%s\" is deleted",
+ current, RelationGetRelationName(rel));
+
+ if (currentsLeft != opaque->btpo_prev)
+ {
+ if (!P_ISDELETED(opaque))
+ elog(CHCKELEVEL, "left link/right link pair don't comport at level %u, block %u, last: %u, current left: %u",
+ opaque->btpo.level, current, currentsLeft, opaque->btpo_prev);
+ else /* level unavailable */
+ elog(CHCKELEVEL, "left link/right link pair don't comport, deleted block %u, last: %u, current left: %u",
+ current, currentsLeft, opaque->btpo_prev);
+ }
+
+ currentsLeft = current;
+ current = opaque->btpo_next;
+
+ /* check for interrupts while we're not holding any buffer lock */
+ CHECK_FOR_INTERRUPTS();
+ }
+
+ pfree(page);
+
+ return next;
+ }
+
+ /*
+ * Given an IndexTuple, form hex cstring of data for error reporting purposes
+ */
+ static char *
+ form_btree_data(IndexTuple itup)
+ {
+ char *dump, *odump;
+ char *ptr;
+ int off;
+ int dlen;
+
+ dlen = IndexTupleSize(itup) - IndexInfoFindDataOffset(itup->t_info);
+ ptr = (char *) itup + IndexInfoFindDataOffset(itup->t_info);
+ dump = palloc0(dlen * 3 + 1);
+ odump = dump;
+
+ for (off = 0; off < dlen; off++)
+ {
+ if (off > 0)
+ *dump++ = ' ';
+ sprintf(dump, "%02x", *(ptr + off) & 0xff);
+ dump += 2;
+ }
+
+ return odump;
+ }
*** a/contrib/btreecheck/btreecheck.control
--- b/contrib/btreecheck/btreecheck.control
***************
*** 0 ****
--- 1,5 ----
+ # btreecheck extension
+ comment = 'verify the integrity of btree indexes at a low level'
+ default_version = '1.0'
+ module_pathname = '$libdir/btreecheck'
+ relocatable = true
On Mon, Jun 16, 2014 at 9:47 PM, Peter Geoghegan <pg@heroku.com> wrote:
As discussed at the developer meeting at pgCon, I think that there is
a lot to be said for a tool that checks nbtree index invariants on
live systems.
Me too.
Attached prototype patch adds contrib extension, btreecheck.
I don't feel qualified to comment on any of the substantive issues you
raise, so instead I'd like to bikeshed the name. I suggest that we
create one extension to be a repository for index-checking machinery
(and perhaps also heap-checking machinery) and view this as the first
of possibly several checkers to live there. Otherwise, we may
eventually end up with separate extensions for btreecheck, hashcheck,
gistcheck, gincheck, spgistcheck, minmaxcheck, vodkacheck, heapcheck,
toastcheck, etc. which seems like excessive namespace pollution.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't feel qualified to comment on any of the substantive issues you
raise, so instead I'd like to bikeshed the name. I suggest that we
create one extension to be a repository for index-checking machinery
(and perhaps also heap-checking machinery) and view this as the first
of possibly several checkers to live there. Otherwise, we may
eventually end up with separate extensions for btreecheck, hashcheck,
gistcheck, gincheck, spgistcheck, minmaxcheck, vodkacheck, heapcheck,
toastcheck, etc. which seems like excessive namespace pollution.
I agree.
I hope that we'll eventually be able to move code like this into each
AM, with something like an amverifyintegrity pg_am entry optionally
provided. There'd also be a utility statement that would perform this
kind of verification. It seems natural to do this, as the patch I've
posted arguably adds a big modularity violation. Besides, it seems
worthwhile to pepper the regular regression tests with calls like
these, at least in some places, and putting something in core is the
most convenient way to do that.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 5:10 PM, Peter Geoghegan <pg@heroku.com> wrote:
On Tue, Jun 17, 2014 at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I don't feel qualified to comment on any of the substantive issues you
raise, so instead I'd like to bikeshed the name. I suggest that we
create one extension to be a repository for index-checking machinery
(and perhaps also heap-checking machinery) and view this as the first
of possibly several checkers to live there. Otherwise, we may
eventually end up with separate extensions for btreecheck, hashcheck,
gistcheck, gincheck, spgistcheck, minmaxcheck, vodkacheck, heapcheck,
toastcheck, etc. which seems like excessive namespace pollution.I agree.
I hope that we'll eventually be able to move code like this into each
AM, with something like an amverifyintegrity pg_am entry optionally
provided. There'd also be a utility statement that would perform this
kind of verification. It seems natural to do this, as the patch I've
posted arguably adds a big modularity violation. Besides, it seems
worthwhile to pepper the regular regression tests with calls like
these, at least in some places, and putting something in core is the
most convenient way to do that.
I think there's something to be said for that, but I think at the
moment I like the idea of a functional interface better. The reason
is that I'm not sure we can predict all of the checks we're going to
want to add. For example, maybe somebody will come up with another
btree checker that's different from your btree checker, and maybe
there will be good reasons not to merge the two - e.g. different
locking requirements, or different runtimes, or whatever. Even more
likely, I think there will be things we want to do that fall under the
broad umbrella of integrity checking that we just can't predict now:
scan the table and check whether all the keys are present in every
index, scan the TOAST table and make sure that all the chunks are the
right size, etc. If we have a bunch of functions for this sort of
thing, it's easy and relatively uncontroversial to add more. If we
put it in core and give it real grammar support, then we've got to
fight with keywords and contemplate grammar bloat and, basically, I
think every change will be likely to get a higher level of scrutiny.
I'd rather not go there.
Now, we could. We could come up with an extensible syntax, like this:
CHECK relation [ USING { checktype [ '(' arg [, ...] '}' [, ...] ];
But frankly I'm kind of uncompelled by that. This isn't a feature
that seems to me to really need to be in core. It doesn't
particularly need grammar support, and it doesn't need WAL logging,
and not everyone needs it at all, and especially if we eventually end
up with a robust suite of tools in this area, not everyone may even
want it, if it means a bigger installed footprint or more security
concerns to worry about.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jun 17, 2014 at 5:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think there's something to be said for that, but I think at the
moment I like the idea of a functional interface better. The reason
is that I'm not sure we can predict all of the checks we're going to
want to add.
That's true. Clearly this is very hand-wavy, and as I said I think
it's appropriate that the tool evolve in response to our testing
requirements, but I would like to figure out a way to have
verification tooling available for some kind of semi-standard tests,
possibly even including the standard regression tests. It's probably
too early to discuss, though.
Now, we could. We could come up with an extensible syntax, like this:
CHECK relation [ USING { checktype [ '(' arg [, ...] '}' [, ...] ];
That's what I had in mind. Using the same trick that you came up with
for EXPLAIN, to avoid grammar bloat and let the am figure out for
itself what to name the various check types, with a generic default
check.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Peter Geoghegan (pg@heroku.com) wrote:
On Tue, Jun 17, 2014 at 5:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Now, we could. We could come up with an extensible syntax, like this:
CHECK relation [ USING { checktype [ '(' arg [, ...] '}' [, ...] ];
That's what I had in mind. Using the same trick that you came up with
for EXPLAIN, to avoid grammar bloat and let the am figure out for
itself what to name the various check types, with a generic default
check.
I'm fine with having these start out as external tools which are doing
checks, but I've been specifically asked about (and have desired myself
from time-to-time...) an in-core capability to check index/heap/etc
validity. Folks coming from certain other RDBMS's find it amazing that
we don't have any support for that when what they really want is a
background worker which is just automatically going around doing these
checks.
Now, perhaps we could have the background worker without the syntax for
running these by hand, but I don't particularly like that idea. Being
able to run these checks by hand is extremely useful and I'd much prefer
to be able to force that than to have some mechanism where I have to
submit a request for a check to another process through a queue or
something along those lines.
Thanks,
Stephen
On Wed, Jun 18, 2014 at 4:48 AM, Stephen Frost <sfrost@snowman.net> wrote:
I'm fine with having these start out as external tools which are doing
checks, but I've been specifically asked about (and have desired myself
from time-to-time...) an in-core capability to check index/heap/etc
validity. Folks coming from certain other RDBMS's find it amazing that
we don't have any support for that when what they really want is a
background worker which is just automatically going around doing these
checks.
Yes, I think that being able to verify the integrity of index and heap
relations is an important feature, and I think it's notable that we're
the only major RDBMS that lacks this support. I'm not quite sure what
that should entail just yet, but clearly it should be somewhat like
what I have here. I think the big open questions to make something
like this work are mostly around the cost/benefit ratio of each of the
checks I've outlined. Certainly, for that use case minimizing
disruption on a live system becomes even more important. I'll probably
look at a buffer access strategy, just to give an example of that off
the top of my head.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--On 16. Juni 2014 18:47:30 -0700 Peter Geoghegan <pg@heroku.com> wrote:
Attached prototype patch adds contrib extension, btreecheck. This
extension provides SQL-callable functions for checking these
conditions on nbtree indexes on live systems.
What's the current state of this module? I see you are interested in stress
testing, but i'm not sure how far this all is gone?
This tool actually served a very good job during identifying index
corruption due to collation issues[1]</messages/by-id/CB4D1C6BAA80CF146CB0D4F2@eje.credativ.lan>.
[1]: </messages/by-id/CB4D1C6BAA80CF146CB0D4F2@eje.credativ.lan>
</messages/by-id/CB4D1C6BAA80CF146CB0D4F2@eje.credativ.lan>
--
Thanks
Bernd
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 7, 2015 at 2:59 AM, Bernd Helmle <mailings@oopsware.de> wrote:
What's the current state of this module? I see you are interested in stress
testing, but i'm not sure how far this all is gone?This tool actually served a very good job during identifying index
corruption due to collation issues[1].
I tweaked it a bit. I actually don't really know if I'll pick it up
anytime soon. I'm glad that it helped you, but the goals for the tool
need to be thrashed out a bit more. We should discuss it again.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers