amcheck (B-Tree integrity checking tool)

Started by Peter Geogheganalmost 10 years ago89 messages

pg@heroku.com

almost 10 years ago

1 attachment(s)

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

Invariants
========

nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
=======

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

Locking
======

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

Performance
==========

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.

--
Peter Geoghegan

Attachments:

0001-Add-amcheck-extension-to-contrib.patchtext/x-patch; charset=US-ASCII; name=0001-Add-amcheck-extension-to-contrib.patchDownload

From 573810d8d3c994ce1a16ecffb2f5d208c0ff93e3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.
---
 contrib/amcheck/Makefile         |   19 +
 contrib/amcheck/amcheck--1.0.sql |   20 +
 contrib/amcheck/amcheck.c        | 1117 ++++++++++++++++++++++++++++++++++++++
 contrib/amcheck/amcheck.control  |    5 +
 doc/src/sgml/amcheck.sgml        |  244 +++++++++
 doc/src/sgml/contrib.sgml        |    1 +
 doc/src/sgml/filelist.sgml       |    1 +
 7 files changed, 1407 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.c
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 0000000..7bb29aa
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= amcheck.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 0000000..ebbd6ac
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT;
diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c
new file mode 100644
index 0000000..44bc7bb
--- /dev/null
+++ b/contrib/amcheck/amcheck.c
@@ -0,0 +1,1117 @@
+/*-------------------------------------------------------------------------
+ *
+ * amcheck.c
+ *		Verifies the integrity of access methods based on invariants.
+ *
+ * Provides SQL-callable functions for verifying that various logical
+ * invariants in the structure of index access methods are respected.  This
+ * includes, for example, the invariant that each page in the target B-Tree
+ * index has "real" items in logical order as reported by an insertion scankey
+ * (the insertion scankey sort-wise NULL semantics are useful for
+ * verification).
+ *
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/amcheck/amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/transam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "commands/tablecmds.h"
+#include "miscadmin.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+
+PG_MODULE_MAGIC;
+
+#define CHECK_SUPERUSER() { \
+		if (!superuser()) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+					 (errmsg("must be superuser to use verification functions")))); }
+
+/*
+* As noted in comments above _bt_compare(), there is special handling of the
+* first data item (that is, the first item with a valid downlink -- not the
+* high key item) on a non-leaf (internal) page.  There is clearly no point in
+* having amcheck functions make any comparison of or against these "minus
+* infinity" items, because they contain no actual information other than the
+* downlink.
+*/
+#define OFFSET_IS_MINUS_INFINITY(opaque, offset) \
+	(!P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque))
+
+/*
+ * Callers to verification functions should never receive a false positive
+ * indication of corruption.  Therefore, when using amcheck functions for
+ * stress testing, it may be useful to temporally change the CORRUPTION elevel
+ * to PANIC, to immediately halt the server in the event of detecting an
+ * invariant condition violation.  This may preserve more information about the
+ * nature of the underlying problem.  Note that modifying the CORRUPTION
+ * constant to be an elevel < ERROR is not well tested.
+ */
+#ifdef NOT_USED
+#define CORRUPTION		PANIC
+#define CONCERN			WARNING
+#else
+#define CORRUPTION		ERROR
+#define CONCERN			INFO
+#endif
+
+/*
+ * A B-Tree cannot possibly have this many levels, since there must be one
+ * block per page, and that is bound by the range of BlockNumber:
+ */
+#define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+
+/*
+ * State associated with verifying a B-Tree index
+ */
+typedef struct BtreeCheckState
+{
+	/*
+	 * Unchanging state, established at start of verification:
+	 *
+	 * rel:                 B-Tree Index Relation
+	 * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
+	 * checkstrategy:       Buffer access strategy
+	 * targetcontext:       Per-target-page memory context
+	 */
+	Relation				rel;
+	bool					exclusivelock;
+	BufferAccessStrategy 	checkstrategy;
+	MemoryContext			targetcontext;
+
+	/*
+	 * Mutable state, for verification of particular page:
+	 *
+	 * target:              Main target page
+	 * targetblock:         Main target block number of page
+	 *
+	 * target is the point of reference for a verification operation.
+	 *
+	 * Other B-Tree pages may be allocated, but those are always auxiliary
+	 * (e.g. they are target's child pages).  Conceptually, only the target
+	 * page is checked.  Each page found by verification's left/right,
+	 * top/bottom scan becomes the target once.
+	 *
+	 * Memory is managed by resetting targecontext after verification of some
+	 * target page finishes (possibly including target verification that
+	 * depends on non-target page state).
+	 */
+	Page					target;
+	BlockNumber				targetblock;
+	XLogRecPtr				targetlsn;
+
+} BtreeCheckState;
+
+/*
+ * Starting point for verifying an entire B-Tree index level
+ */
+typedef struct BtreeLevel
+{
+	/* Level number (0 is leaf page level). */
+	uint32			level;
+
+	/* Left most block on level.  Scan of level begins here. */
+	BlockNumber		leftmost;
+
+	/* Is this level reported as "true" root level by meta page? */
+	bool			istruerootlevel;
+} BtreeLevel;
+
+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);
+
+static void btree_index_checkable(Relation rel);
+static void bt_check_every_level(Relation rel, bool exclusivelock);
+static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
+											   BtreeLevel level);
+static void bt_target_page_check(BtreeCheckState *state);
+static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+							  ScanKey targetkey);
+static bool invariant_key_less_than_equal_offset(BtreeCheckState *state,
+												 ScanKey key,
+												 OffsetNumber upperbound);
+static bool invariant_key_greater_than_equal_offset(BtreeCheckState *state,
+													ScanKey key,
+													OffsetNumber lowerbound);
+static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+														   Page other,
+														   ScanKey key,
+														   OffsetNumber upperbound);
+static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+
+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */
+Datum
+bt_index_check(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	indrel;
+
+	CHECK_SUPERUSER();
+
+	indrel = relation_open(relid, AccessShareLock);
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, false);
+
+	relation_close(indrel, AccessShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * bt_index_parent_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Acquires ExclusiveLock on index relation, and ShareLock on the associated
+ * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
+ * are valid lower bounds on child pages.
+ */
+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+	Oid			indrelid = PG_GETARG_OID(0);
+	Oid			heapid;
+	Relation	indrel;
+	Relation	heaprel;
+
+	CHECK_SUPERUSER();
+
+	/*
+	 * We must lock table before index to avoid deadlocks.  However, if the
+	 * passed indrelid isn't an index then IndexGetRelation() will fail.
+	 * Rather than emitting a not-very-helpful error message, postpone
+	 * complaining, expecting that the is-it-an-index test below will fail.
+	 */
+	heapid = IndexGetRelation(indrelid, true);
+	if (OidIsValid(heapid))
+		heaprel = heap_open(heapid, ShareLock);
+	else
+		heaprel = NULL;
+
+	/*
+	 * Open the target index relations separately (like relation_openrv(), but
+	 * with heap relation locked first to prevent deadlocking).  In hot standby
+	 * mode this will raise an error.
+	 */
+	indrel = index_open(indrelid, ExclusiveLock);
+
+	/* Check for active uses of the index in the current transaction */
+	CheckTableNotInUse(indrel, "bt_index_parent_check");
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, true);
+
+	index_close(indrel, ExclusiveLock);
+	if (heaprel)
+		heap_close(heaprel, ShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * btree_index_checkable()
+ *
+ * Basic checks about the suitability of a relation for checking as a B-Tree
+ * index.
+ */
+static void
+btree_index_checkable(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_INDEX ||
+		rel->rd_rel->relam != BTREE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only nbtree access method indexes are supported"),
+				 errdetail("index \"%s\" does not using the nbtree access method.",
+						   RelationGetRelationName(rel))));
+
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions"),
+				 errdetail("index \"%s\" is associated with temporary relation.",
+						   RelationGetRelationName(rel))));
+
+	if (!rel->rd_index->indisready)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("index is not yet ready for insertions")));
+
+	if (!rel->rd_index->indisvalid)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("index is not valid")));
+}
+
+/*
+ * bt_check_every_level()
+ *
+ * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
+ * logical order, verifying invariants as it goes.
+ *
+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physically modification to the index structure, including
+ * modifications that VACUUM may make.
+ */
+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
+{
+	BtreeCheckState	   *state;
+	Page				metapage;
+	BTMetaPageData	   *metad;
+	uint32				previouslevel;
+	BtreeLevel			current;
+
+	/*
+	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
+	 * RecentGlobalXmin/B-Tree page deletion.
+	 */
+	Assert(TransactionIdIsValid(RecentGlobalXmin));
+
+	/*
+	 * Initialize state for entire verification operation
+	 */
+	state = palloc(sizeof(BtreeCheckState));
+	state->rel = rel;
+	state->exclusivelock = exclusivelock;
+	state->checkstrategy = GetAccessStrategy(BAS_BULKREAD);
+	/* Create context for page */
+	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
+												 "amcheck page data",
+												 ALLOCSET_DEFAULT_MINSIZE,
+												 ALLOCSET_DEFAULT_INITSIZE,
+												 ALLOCSET_DEFAULT_MAXSIZE);
+
+	/* Get true root block from meta-page */
+	metapage = palloc_btree_page(state, BTREE_METAPAGE);
+	metad = BTPageGetMeta(metapage);
+
+	/*
+	 * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+	 * the fast root and true root differ.
+	 *
+	 * Start from the true root, not the fast root, unlike conventional index
+	 * scans.  This approach is more thorough, and removes the risk of
+	 * following a stale fast root from the meta page.
+	 */
+	if (metad->btm_fastroot != metad->btm_root)
+		ereport(CONCERN,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("fast root mismatch in index %s",
+						RelationGetRelationName(rel)),
+				 errdetail_internal("Fast block %u (level %u) "
+									"differs from true root "
+									"block %u (level %u).",
+									metad->btm_fastroot, metad->btm_fastlevel,
+									metad->btm_root, metad->btm_level)));
+
+	/*
+	 * Starting at the root, verify every level.  Move left to right, top to
+	 * bottom.  Note that there may be no pages other than the meta page (meta
+	 * page can indicate that root is P_NONE when the index is totally empty).
+	 */
+	previouslevel = InvalidBtreeLevel;
+	current.level = metad->btm_level;
+	current.leftmost = metad->btm_root;
+	current.istruerootlevel =  true;
+	while (current.leftmost != P_NONE)
+	{
+		/*
+		 * Verify this level, and get left most page for next level down, if
+		 * not at leaf level
+		 */
+		current = bt_check_level_from_leftmost(state, current);
+
+		if (current.leftmost == InvalidBlockNumber)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" has no valid pages on level below %u or first level",
+							RelationGetRelationName(rel), previouslevel)));
+
+		/*
+		 * Even when concurrent page splits are possible, we should reliably
+		 * end up on level immediately below that of last iteration.  In
+		 * general, it's impossible for the height of the tree to decrease.
+		 */
+		if (previouslevel != InvalidBtreeLevel &&
+			previouslevel - 1 != current.level)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" is missing expected level %u",
+							RelationGetRelationName(rel),
+							previouslevel - 1)));
+
+		previouslevel = current.level;
+	}
+
+	if (metad->btm_root != P_NONE && current.level != 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("index \"%s\" has internal pages but lacks leaf pages",
+						RelationGetRelationName(rel))));
+
+	/* Be tidy: */
+	MemoryContextDelete(state->targetcontext);
+}
+
+/*
+ * bt_check_level_from_leftmost()
+ *
+ * Given a left-most block at some level, move right, verifying each page
+ * individually (with more verification across pages for "exclusivelock"
+ * callers).  Caller should pass the true root page as the leftmost initially,
+ * working their way down by passing what is returned for the last call here
+ * until level 0 (leaf page level) was reached.
+ *
+ * Returns state for next call, if any.  This includes left-most block number
+ * one level lower that should be passed on next level/call, or P_NONE leaf
+ * level is checked.  Level numbers follow the nbtree convention: higher levels
+ * have higher numbers, because new levels are added only due to a root page
+ * split.  Note that prior to the first root page split, the root is also a
+ * leaf page.  This means that there is always a level 0, and it's always the
+ * last level processed.
+ *
+ * Note on memory management:  State's per-page context is reset here, between
+ * each call to bt_target_page_check().
+ */
+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{
+	/* State to establish early, concerning entire level */
+	BTPageOpaque	opaque;
+	MemoryContext 	oldcontext;
+	BtreeLevel		nextleveldown;
+
+	/* Variables for iterating across level using right links */
+	BlockNumber		leftcurrent = P_NONE;
+	BlockNumber		current = level.leftmost;
+
+	/* Initialize return state */
+	nextleveldown.leftmost = InvalidBlockNumber;
+	nextleveldown.level = InvalidBtreeLevel;
+	nextleveldown.istruerootlevel = false;
+
+	/* Use page-level context for duration of this call */
+	oldcontext = MemoryContextSwitchTo(state->targetcontext);
+
+	do
+	{
+		/* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Initialize state for this iteration */
+		state->targetblock = current;
+		state->target = palloc_btree_page(state, state->targetblock);
+		state->targetlsn = PageGetLSN(state->target);
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+		if (P_IGNORE(opaque))
+		{
+			if (P_RIGHTMOST(opaque))
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("block %u fell off the end of index \"%s\"",
+								current, RelationGetRelationName(state->rel))));
+			else
+				ereport(CONCERN,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("block %u of index \"%s\" ignored",
+								current, RelationGetRelationName(state->rel))));
+			goto nextpage;
+		}
+		else if (nextleveldown.leftmost == InvalidBlockNumber)
+		{
+			/*
+			 * A concurrent page split could make the caller supplied leftmost
+			 * block no longer contain the leftmost page, or no longer be the
+			 * true root, but where that isn't possible due to heavyweight
+			 * locking, check that the first valid page meets caller's
+			 * expectations.
+			 */
+			if (state->exclusivelock)
+			{
+				if (!P_LEFTMOST(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not leftmost in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+
+				if (level.istruerootlevel && !P_ISROOT(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not root in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+			}
+
+			/*
+			 * Before beginning any non-trivial examination of level, establish
+			 * next level down's leftmost block number, which next call here
+			 * will pass as its leftmost (iff this isn't leaf level).
+			 */
+			if (!P_ISLEAF(opaque))
+			{
+				IndexTuple		itup;
+				ItemId			itemid;
+
+				/* Internal page -- downlink gets leftmost on next level */
+				itemid = PageGetItemId(state->target, P_FIRSTDATAKEY(opaque));
+				itup = (IndexTuple) PageGetItem(state->target, itemid);
+				nextleveldown.leftmost = ItemPointerGetBlockNumber(&(itup->t_tid));
+			}
+			else
+			{
+				/*
+				 * Leaf page -- final level caller must process.
+				 *
+				 * Note that this could also be the root page, if there has
+				 * been no root page split yet.
+				 */
+				nextleveldown.leftmost = P_NONE;
+			}
+
+			/*
+			 * Store level, now that non-deleted page was found (there must be
+			 * at least one per level)
+			 */
+			nextleveldown.level = opaque->btpo.level;
+
+			/*
+			 * Finished setting up state for this call/level.  Control will
+			 * never end up back here in any future loop iteration for this
+			 * level.
+			 */
+		}
+
+		if (state->exclusivelock && leftcurrent != opaque->btpo_prev)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("left link/right link pair in index \"%s\" don't comport",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Deleted block=%u left block=%u link reported block=%u.",
+										current, leftcurrent, opaque->btpo_prev)));
+
+		/* Verify invariants for page -- all important checks occur here */
+		bt_target_page_check(state);
+
+nextpage:
+
+		/* Try to detect circular links */
+		if (current == leftcurrent || current == opaque->btpo_prev)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("circular link chain found in block %u of index \"%s\"",
+							current, RelationGetRelationName(state->rel))));
+
+		leftcurrent = current;
+		current = opaque->btpo_next;
+
+		/* Free page and associated memory for this iteration */
+		MemoryContextReset(state->targetcontext);
+	}
+	while (current != P_NONE);
+
+	/* Don't change context for caller */
+	MemoryContextSwitchTo(oldcontext);
+
+	return nextleveldown;
+}
+
+/*
+ * bt_target_page_check()
+ *
+ * Function performs the following checks on target page, or pages ancillary to
+ * target page:
+ *
+ * - That every "real" data item is less than or equal to the high key, which
+ *   is an upper bound on the items on the pages (where there is a high key at
+ *   all -- pages that are rightmost lack one).
+ *
+ * - That within the page, every "real" item is less than or equal to the item
+ *   immediately to its right, if any (i.e., that the items are in order within
+ *   the page, so that the binary searches performed by index scans are sane).
+ *
+ * - That the last item stored on the page is less than or equal to the first
+ *   "real" data item on the page to the right (if such a first item is
+ *   available).
+ *
+ * Furthermore, when state passed shows ExclusiveLock held, function also
+ * checks:
+ *
+ * - That all child pages comport with downlinks (internal pages only).
+ *
+ * Note:  This routine is not especially proactive in freeing memory.  High
+ * watermark memory consumption is bound to some small fixed multiple of
+ * BLCKSZ, though.  Caller should reset the current context between calls here.
+ */
+static void
+bt_target_page_check(BtreeCheckState *state)
+{
+	OffsetNumber	offset;
+	OffsetNumber	max;
+	BTPageOpaque	topaque;
+
+	topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+	max = PageGetMaxOffsetNumber(state->target);
+
+	/*
+	 * Loop over page items, but don't start from P_HIKEY (don't have iteration
+	 * directly considering high key item, if any).  That's something that is
+	 * used as part of verifying all other items, but doesn't get its own
+	 * iteration.
+	 */
+	for (offset = P_FIRSTDATAKEY(topaque); offset <= max; offset++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+		ScanKey		skey;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Don't try to generate scankey using "minus infinity" garbage data */
+		if (OFFSET_IS_MINUS_INFINITY(topaque, offset))
+			continue;
+
+		/* Build insertion scankey for current page offset */
+		itemid = PageGetItemId(state->target, offset);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
+		skey = _bt_mkscankey(state->rel, itup);
+
+		/*
+		 *   ********************
+		 *   * High key check   *
+		 *   ********************
+		 *
+		 * If there is a high key, which there must be for a non-rightmost
+		 * page, check that it actually is upper bound on page items.
+		 */
+		if (!P_RIGHTMOST(topaque) &&
+			!invariant_key_less_than_equal_offset(state, skey, P_HIKEY))
+		{
+			char	   *itid,  *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("high key invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ?  "heap":"index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
+		/*
+		 *   ********************
+		 *   * Page order check *
+		 *   ********************
+		 *
+		 * Check that items are stored on page in logical order, by checking
+		 * current item is less than or equal to next item (if any).
+		 */
+		if (OffsetNumberNext(offset) <= max &&
+			!invariant_key_less_than_equal_offset(state, skey,
+												  OffsetNumberNext(offset)))
+		{
+			char	   *itid,  *htid,
+					   *nitid, *nhtid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			nitid = psprintf("(%u,%u)", state->targetblock,
+							 OffsetNumberNext(offset));
+
+			/* Reuse itup to get pointed-to heap location of second item */
+			itemid = PageGetItemId(state->target, OffsetNumberNext(offset));
+			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			nhtid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("page order invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Lower index tid=%s (points to %s tid=%s) "
+										"higher index tid=%s (points to %s tid=%s) "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ?  "heap":"index",
+										htid,
+										nitid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										nhtid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+		/*
+		 *   ********************
+		 *   * Last item check  *
+		 *   ********************
+		 *
+		 * Check last item against next page when last item on page is reached.
+		 *
+		 * The general idea here is that checking the ordering of items on the
+		 * page should still perform some check on the last item on the page,
+		 * if at all possible.  In other words, this is roughly the same
+		 * process as the page order check that has already been performed for
+		 * every other "real" item on target page by now; we just need to reach
+		 * into the next page to get a scankey to compare against lower bound
+		 * of max.
+		 */
+		else if (offset == max)
+		{
+			ScanKey		rightkey;
+
+			/* Get first item in next/right page */
+			rightkey = bt_right_page_check_scankey(state);
+
+			/*
+			 * Why this is correct for !exclusivelock callers:
+			 *
+			 * Concurrent page splits are not a problem for ordinary index
+			 * scans, since the key space always moves in a way that lets index
+			 * scans not miss things:  they might have to move right, but they
+			 * never have to move left (leaving aside index scans backwards, a
+			 * special case).  A concurrent page split could occur here, but
+			 * just as with index scans we're following the stale right link,
+			 * which will reliably get us further along in the key space, which
+			 * is all we really need to get an item further along in key space
+			 * to check invariant in target page.
+			 *
+			 * (Note that routines like _bt_search() don't require *any* page
+			 * split interlock when descending the tree, including something
+			 * very light like a buffer pin.  That's why it's okay that we
+			 * don't either.)
+			 *
+			 * That just leaves concurrent page deletion.  It's okay if we
+			 * follow a rightlink and find a half-dead or dead page.  Either
+			 * way, there must be a sane further right link to follow for these
+			 * ignorable pages, because page deletion refuses to merge the key
+			 * space between adjacent pages that do not share a common parent
+			 * (that is, merging of the key space has to be among true sibling
+			 * pages, never cousin pages).  We should succeed in finding a page
+			 * to the right that isn't ignorable before too long.  And so, here
+			 * too we can always get further in the key space by moving right
+			 * one or more times.
+			 *
+			 * A deleted page won't actually be recycled by VACUUM early enough
+			 * for us to fail to be able follow its right link, because it
+			 * doesn't do so until it knows that no possible index scan could
+			 * land on the page with the expectation of at least being able to
+			 * move right and eventually find a non-ignorable page.  (see page
+			 * recycling/RecentGlobalXmin notes in nbtree README.)
+			 */
+			if (rightkey &&
+				!invariant_key_greater_than_equal_offset(state, rightkey, max))
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("cross page order invariant violated for index \"%s\"",
+								RelationGetRelationName(state->rel)),
+						 errdetail_internal("Last item on page tid=(%u,%u) "
+											"right page block=%u "
+											"page lsn=%X/%X.",
+											state->targetblock, offset,
+											topaque->btpo_next,
+											(uint32) (state->targetlsn >> 32),
+											(uint32) state->targetlsn)));
+		}
+
+		/*
+		 *   ********************
+		 *   * Downlink check   *
+		 *   ********************
+		 *
+		 * Additional check of child items against target page (their parent),
+		 * iff this is internal page and holds ExclusiveLock on index relation.
+		 */
+		if (!P_ISLEAF(topaque) && state->exclusivelock)
+		{
+			BlockNumber childblock = ItemPointerGetBlockNumber(&(itup->t_tid));
+
+			bt_downlink_check(state, childblock, skey);
+		}
+	}
+}
+
+/*
+ * bt_right_page_check_scankey()
+ *
+ * Return a scankey for the first real data item on page to right of current
+ * target (or the first non-ignorable page), sufficient to check ordering
+ * invariant on last item in current target page.  Returned scankey relies on
+ * local memory allocated for the child page, which caller cannot pfree().
+ * Caller's memory context should be reset between calls here.
+ *
+ * Note that we always get the first real data item -- not the high key, and
+ * not the undefined first "real" item that internal pages must have (the
+ * "minus infinity" item).
+ *
+ * Iff there is no sane first real data item, return NULL.  Typically, this
+ * happens because the target page is rightmost and there simply is no page to
+ * find the first item on, but it's also possible because leaf pages may have
+ * no "real" items after VACUUM is run, or because the only item available in a
+ * "minus infinity" item.
+ */
+static ScanKey
+bt_right_page_check_scankey(BtreeCheckState *state)
+{
+	BTPageOpaque	opaque;
+	ItemId			firstsane;
+	OffsetNumber	nline;
+	BlockNumber		targetnext;
+	Page			rightpage;
+
+	/* Determine target's next block number */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+	/* No right-link; target it already at end of key space */
+	if (P_RIGHTMOST(opaque))
+		return NULL;
+
+	targetnext = opaque->btpo_next;
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		rightpage = palloc_btree_page(state, targetnext);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
+
+		if (!P_IGNORE(opaque))
+			break;
+
+		/*
+		 * Right-most page should not be ignorable:
+		 */
+		if (P_RIGHTMOST(opaque))
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("fell off the end of index \"%s\"",
+							RelationGetRelationName(state->rel))));
+
+		/*
+		 * We landed on a deleted page, so step right to find a live page
+		 * (there should be one; if not, the "fell off index" check would have
+		 * just failed)
+		 */
+		targetnext = opaque->btpo_next;
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
+						opaque->btpo.level, RelationGetRelationName(state->rel)),
+				 errdetail_internal("Deleted page found when getting scankey of first item to right.")));
+
+		pfree(rightpage);
+	}
+
+	nline = PageGetMaxOffsetNumber(rightpage);
+
+	if (P_ISLEAF(opaque) && nline >= P_FIRSTDATAKEY(opaque))
+	{
+		/* For leaf page, simply return first data item (if any) */
+		firstsane = PageGetItemId(rightpage, P_FIRSTDATAKEY(opaque));
+	}
+	else if (!P_ISLEAF(opaque) &&
+			 nline >= OffsetNumberNext(P_FIRSTDATAKEY(opaque)))
+	{
+		/*
+		 * Internal page must have at least one downlink (real item), but that
+		 * could just be a "minus infinity" item, which would be useless to
+		 * caller.  Return first item after the internal page's undefined
+		 * "minus infinity" item, if any.
+		 */
+		firstsane = PageGetItemId(rightpage,
+								  OffsetNumberNext(P_FIRSTDATAKEY(opaque)));
+	}
+	else
+	{
+		/*
+		 * No first item.  Page must be empty leaf page, or internal page with
+		 * only a "minus infinity" item.
+		 *
+		 * XXX: Maybe we should return a scankey with the high key, if any,
+		 * here, which would suit the caller's purposes.  It's simpler for this
+		 * function to have one clear purpose, though, and it's far from clear
+		 * that not doing so will ever actually matter.
+		 */
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("%s block %u of index \"%s\" has no first data item",
+						P_ISLEAF(opaque) ? "leaf":"internal", targetnext,
+						RelationGetRelationName(state->rel))));
+		return NULL;
+	}
+
+	/*
+	 * Return scankey of first real item.  Note that this relies on right page
+	 * memory remaining allocated.
+	 */
+	return _bt_mkscankey(state->rel,
+						 (IndexTuple) PageGetItem(rightpage, firstsane));
+}
+
+/*
+ * bt_downlink_check()
+ *
+ * Checks one of target's downlink against its child page.
+ *
+ * Conceptually, the target page continues to be what is checked here.  The
+ * target block is still blamed in the event of finding an invariant violation.
+ * The downlink insertion into the target is probably where any problem raised
+ * here arises, and their is no such thing as a parent link, so doing the
+ * verification this way around is much more practical.
+ */
+static void
+bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey)
+{
+	OffsetNumber	offset;
+	OffsetNumber	maxoffset;
+	Page			child;
+	BTPageOpaque	copaque;
+
+	Assert(state->exclusivelock);
+
+	/*
+	 * Verify child page has the down-link key from target page (its parent) as
+	 * a lower bound.
+	 *
+	 * This check would probably be almost as effective if only the first item
+	 * was checked, since the ordering of all items will be checked when we
+	 * descend to the next level anyway.  This behavior seems simpler, though,
+	 * and it might well make an appreciable difference to examine everything
+	 * when the operator class has subtle bugs.
+	 */
+	child = palloc_btree_page(state, childblock);
+	copaque = (BTPageOpaque) PageGetSpecialPointer(child);
+	maxoffset = PageGetMaxOffsetNumber(child);
+
+	for (offset = P_FIRSTDATAKEY(copaque); offset <= maxoffset; offset++)
+	{
+		/*
+		 * Skip comparison of target page key against "minus infinity" item, if
+		 * any.  Checking it would indicate that its not an upper bound, but
+		 * that's only because of the hard-coding within _bt_compare().
+		 */
+		if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
+			continue;
+
+		if (!invariant_key_less_than_equal_nontarget_offset(state, child,
+															targetkey, offset))
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("down-link lower bound invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Parent block=%u child index tid=(%u,%u) "
+										"parent page lsn=%X/%X.",
+										state->targetblock, childblock, offset,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+	}
+
+	pfree(child);
+}
+
+/*
+ * invariant_key_less_than_equal_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_offset(BtreeCheckState *state, ScanKey key,
+									 OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * invariant_key_less_than_equal_offset()
+ *
+ * Does the invariant hold that the key is greater than or equal to a given
+ * lower bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_greater_than_equal_offset(BtreeCheckState *state, ScanKey key,
+										OffsetNumber lowerbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound);
+
+	return cmp >= 0;
+}
+
+/*
+ * invariant_key_less_than_equal_nontarget_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item, with the offset relating to a caller-supplied page that
+ * is not the current target page? Caller's non-target page is typically a
+ * child page of the target, checked as part of checking a property of the
+ * target page (i.e. the key comes from the target).
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+											   Page nontarget, ScanKey key,
+											   OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * palloc_btree_page()
+ *
+ * Given a block number of a B-Tree page, return page in palloc()'d memory.
+ * While at it, perform some basic checks of the page.
+ *
+ * There is never an attempt to get a consistent view of multiple pages using
+ * multiple concurrent buffer locks; in general, we prefer to have only one pin
+ * and buffer lock at a time, which is often all that the nbtree code requires.
+ *
+ * Operating on a copy of the page is useful because it prevents control
+ * getting stuck in an uninterruptible state when an underlying operator class
+ * misbehaves.
+ */
+static Page
+palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
+{
+	Buffer			buffer;
+	Page			page;
+	BTPageOpaque	opaque;
+
+	page = palloc(BLCKSZ);
+
+	/*
+	 * We copy the page into local storage to avoid holding pin on the buffer
+	 * longer than we must.
+	 */
+	buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
+								state->checkstrategy);
+	LockBuffer(buffer, BT_READ);
+
+	/*
+	 * Perform the same basic sanity checking that nbtree itself performs for
+	 * every page:
+	 */
+	_bt_checkpage(state-> rel, buffer);
+
+	/* Only use copy of page in palloc()'d memory */
+	memcpy(page, BufferGetPage(buffer), BLCKSZ);
+	UnlockReleaseBuffer(buffer);
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (opaque->btpo_flags & BTP_META && blocknum != BTREE_METAPAGE)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid meta page found at block %u in index \"%s\"",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	/* Check page from block that ought to be meta page */
+	if (blocknum == BTREE_METAPAGE)
+	{
+		BTMetaPageData *metad = BTPageGetMeta(page);
+
+		if (!(opaque->btpo_flags & BTP_META) ||
+			metad->btm_magic != BTREE_MAGIC)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" meta page is corrupt",
+							RelationGetRelationName(state->rel))));
+
+		if (metad->btm_version != BTREE_VERSION)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("version mismatch in index \"%s\": file version %d, code version %d",
+							RelationGetRelationName(state->rel),
+							metad->btm_version, BTREE_VERSION)));
+	}
+
+	/*
+	 * Deleted pages have no sane "level" field, so can only check non-deleted
+	 * page level
+	 */
+	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+
+	if (blocknum != BTREE_METAPAGE && !P_ISLEAF(opaque) &&
+		!P_ISDELETED(opaque) && opaque->btpo.level == 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+						opaque->btpo.level, RelationGetRelationName(state->rel))));
+
+	if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("internal page block %u in index \"%s\" has garbage items",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	return page;
+}
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
new file mode 100644
index 0000000..180f150
--- /dev/null
+++ b/contrib/amcheck/amcheck.control
@@ -0,0 +1,5 @@
+# amcheck extension
+comment = 'verify access method invariants'
+default_version = '1.0'
+module_pathname = '$libdir/amcheck'
+relocatable = true
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
new file mode 100644
index 0000000..b042e58
--- /dev/null
+++ b/doc/src/sgml/amcheck.sgml
@@ -0,0 +1,244 @@
+<!-- doc/src/sgml/amcheck.sgml -->
+
+<sect1 id="amcheck" xreflabel="amcheck">
+ <title>amcheck</title>
+
+ <indexterm zone="amcheck">
+  <primary>amcheck</primary>
+ </indexterm>
+
+ <para>
+  The <filename>amcheck</> module provides functions that allow you to
+  verify the logical consistency of the structure of indexes.  If the
+  structure appears to be valid, no error is raised.
+ </para>
+
+ <para>
+  The functions verify various <emphasis>invariants</> in the
+  structure of the representation of particular indexes.  The
+  correctness of the access method functions behind index scans and
+  other important operations is predicated on these invariants always
+  holding.  For example, certain functions verify, among other things,
+  that all B-Tree pages have items in <quote>logical</> order; if that
+  particular invariant somehow fails to hold, we can expect binary
+  searches on any affected page to produce wrong answers.  As a
+  consequence of that, index scans may subtly produce wrong answers.
+ </para>
+
+ <para>
+  Verification is performed using the same procedures as those used by
+  index scans themselves, which may be user-defined operator class
+  code.  For example, B-Tree index verification relies on comparisons
+  made with one or more B-Tree support function 1 routines, much like
+  B-Tree index scans rely on the routines to guide the scan to a point
+  in the underlying table;  see <xref linkend="xindex-support"> for
+  details of operator class support functions.
+ </para>
+
+ <para>
+  <filename>amcheck</> functions may be used only by superusers.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>bt_index_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_check</function> tests that its target, a
+      B-Tree index, respects a variety of invariants.  Example usage:
+<screen>
+test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+FROM pg_index i
+JOIN pg_opclass op ON i.indclass[0] = op.oid
+JOIN pg_am am ON op.opcmethod = am.oid
+JOIN pg_class c ON i.indexrelid = c.oid
+JOIN pg_namespace n ON c.relnamespace = n.oid
+WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
+-- Don't check pg_class (bt_index_parent_check() requires this):
+AND c.relname NOT LIKE 'pg_class%'
+-- Function may throw an error when this is omitted:
+AND i.indisready AND i.indisvalid
+ORDER BY c.relpages DESC LIMIT 10;
+ bt_index_check |             relname             | relpages 
+----------------+---------------------------------+----------
+                | pg_depend_reference_index       |       43
+                | pg_depend_depender_index        |       40
+                | pg_proc_proname_args_nsp_index  |       31
+                | pg_description_o_c_o_index      |       21
+                | pg_attribute_relid_attnam_index |       14
+                | pg_proc_oid_index               |       10
+                | pg_attribute_relid_attnum_index |        9
+                | pg_amproc_fam_proc_index        |        5
+                | pg_amop_opr_fam_index           |        5
+                | pg_amop_fam_strat_index         |        5
+(10 rows)
+</screen>
+      This example shows a session that performs verification of every
+      catalog index in the database <quote>test</> (except those
+      associated with the <structname>pg_class</structname> catalog).
+      Details of just the 10 largest indexes verified are displayed.
+      Naturally, this query could easily be changed to call
+      <function>bt_index_check</function> for every index in the
+      database where verification is supported.
+     </para>
+     <para>
+      An <literal>AccessShareLock</> is acquired on the target index
+      by <function>bt_index_check</function>.  This lock mode is the
+      same lock mode acquired on relations by simple
+      <literal>SELECT</> statements.
+      <function>bt_index_check</function> does not verify invariants
+      that span child/parent relationships, nor does it verify that
+      the target index is consistent with its heap relation.  When a
+      routine, lightweight test for corruption is required in a live
+      production environment, using
+      <function>bt_index_check</function> often provides the best
+      trade-off between thoroughness of verification and limiting the
+      impact on application performance and availability.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>bt_index_parent_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_parent_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_parent_check</function> tests that its
+      target, a B-Tree index, respects a variety of invariants.  The
+      checks performed by <function>bt_index_parent_check</function>
+      are a superset of the checks performed by
+      <function>bt_index_check</function>.
+      <function>bt_index_parent_check</function> can be thought of as
+      a more thorough variant of <function>bt_index_check</function>:
+      unlike <function>bt_index_check</function>,
+      <function>bt_index_parent_check</function> also checks
+      invariants that span parent/child relationships.  However, it
+      does not verify that the target index is consistent with its
+      heap relation.
+     </para>
+     <para>
+      An <literal>ExclusiveLock</> is required on the target index by
+      <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</> is also acquired on the heap relation).
+      These locks prevent concurrent data modification from
+      <command>INSERT</>, <command>UPDATE</>, and <command>DELETE</>
+      commands.  The locks also prevent the underlying relation from
+      being concurrently processed by <command>VACUUM</> and certain
+      other utility commands.  Note that the function holds locks for
+      as short a duration as possible, so there is no advantage to
+      verifying each index individually in a series of transactions.
+     </para>
+     <para>
+      <function>bt_index_parent_check</function>'s additional
+      verification is more likely to detect various pathological
+      cases.  These cases may involve an incorrectly implemented
+      B-Tree operator class used by the index that is checked, or,
+      hypothetically, undiscovered bugs in the underlying B-Tree index
+      access method code.  Note that
+      <function>bt_index_parent_check</function> will raise an error
+      when called while Hot Standby mode is enabled, unlike
+      <function>bt_index_check</function>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Using <filename>amcheck</> effectively</title>
+
+ <para>
+  <filename>amcheck</> can be effective at detecting various types of
+  failure modes that <link
+  linkend="app-initdb-data-checksums"><application>data page
+  checksums</></link> will always fail to catch.  These include:
+
+  <itemizedlist>
+   <listitem>
+    <para>
+     Filesystem or storage subsystem faults where checksums happen to
+     simply not be enabled.  Note that <filename>amcheck</> examines a
+     page as represented in some shared memory buffer at the time of
+     verification if there is only a shared buffer hit when accessing
+     the block.  Consequently, <filename>amcheck</> does not
+     necessarily examine data read from the filesystem at the time of
+     verification.  Note that when checksums are enabled,
+     <filename>amcheck</> may raise an error due to a checksum failure
+     when a corrupt block is read into a buffer.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by faulty RAM, and the broader memory subsystem
+     and operating system.  <productname>PostgreSQL</> does not
+     protect against correctable memory errors and it is assumed you
+     will operate using RAM that uses industry standard Error
+     Correcting Codes (ECC) or better protection.  However, ECC memory
+     is typically only immune to single-bit errors, and should not be
+     assumed to provide <emphasis>absolute</emphasis> protection
+     against failures that result in memory corruption.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Structural inconsistencies caused by incorrect operator class
+     implementations, including issues due to the comparison rules of
+     operating system collations changing the ordering implied by
+     comparisons of an underlying collatable type, such as
+     <type>text</>.  Though rare, updates to operating system
+     collation rules can cause these issues.  More commonly, an
+     inconsistency in the collation order between a master server and
+     a standby server is implicated, possibly because the
+     <emphasis>major</> operating system version in use is
+     inconsistent.  Such inconsistencies will generally only arise on
+     standby servers, and so can generally only be detected on standby
+     servers.  If a problem like this arises, it may not affect each
+     individual index that is ordered using an affected collation,
+     simply because <emphasis>indexed</> values might happen to have
+     the same absolute ordering regardless of the behavioral
+     inconsistency.  See <xref linkend="locale"> and <xref
+     linkend="collation"> for further details about how
+     <productname>PostgreSQL</> uses operating system locales and
+     collations.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by hypothetical undiscovered bugs in the
+     underlying <productname>PostgreSQL</> access method code.
+     Automatic verification of the structural integrity of indexes
+     plays a role in the general testing of new or proposed
+     <productname>PostgreSQL</> features that could plausibly allow a
+     logical inconsistency to be introduced.  One obvious testing
+     strategy is to call <filename>amcheck</> functions continuously
+     when running the standard regression tests.  See <xref
+     linkend="regress-run"> for details on running the tests.
+    </para>
+   </listitem>
+  </itemizedlist>
+
+  No error concerning corruption raised by <filename>amcheck</> should
+  ever be a false positive.  Users are strongly advised to perform a
+  <command>REINDEX </> for any index where corruption is indicated,
+  although it is also important to identify and correct the underlying
+  problem.  In general, <filename>amcheck</> can only prove the
+  presence of corruption; it cannot prove its absence.
+ </para>
+
+ </sect2>
+
+</sect1>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 1b3d2d9..164f594 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -103,6 +103,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  </para>
 
  &adminpack;
+ &amcheck;
  &auth-delay;
  &auto-explain;
  &btree-gin;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index a12fee7..48476bd 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -104,6 +104,7 @@
 <!-- contrib information -->
 <!ENTITY contrib         SYSTEM "contrib.sgml">
 <!ENTITY adminpack       SYSTEM "adminpack.sgml">
+<!ENTITY amcheck         SYSTEM "amcheck.sgml">
 <!ENTITY auth-delay      SYSTEM "auth-delay.sgml">
 <!ENTITY auto-explain    SYSTEM "auto-explain.sgml">
 <!ENTITY btree-gin       SYSTEM "btree-gin.sgml">
-- 
1.9.1

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Peter Geoghegan (#1)

Re: amcheck (B-Tree integrity checking tool)

Hi,

I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.

On Mon, 2016-02-29 at 16:09 -0800, Peter Geoghegan wrote:

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck.

...

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

I'm afraid I don't understand what "target the heapam" means. Could you
elaborate?

Invariants
========

...

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

I agree. It'd be nice to have a tool that also checks for data
corruption the a lower level (e.g. that varlena headers are not
corrupted etc.) but that seems like a task for some other tool.

Interface
=======

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.

This also applies to the name of the contrib module - it's not
particularly obvious what "amcheck" unless you're familiar with the
concept of access methods. Which is quite unlikely for regular users.
Maybe something like "check_index" would be more illustrative?

Locking
======

...

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

Well, I wouldn't count myself as an expert here, but do we actually need
such protection? I mean, we only do such checks when holding an
exclusive lock on the parent relation, no? And even if we don't vacuum
can only remove entries from the pages - why should that cause
violations of any invariants?

A few minor comments:

1) This should probably say "one block per level":

/*
* A B-Tree cannot possibly have this many levels, since there must be
* one block per page, and that is bound by the range of BlockNumber:
*/

2) comment before bt_check_every_level() says:

... assumed to prevent any kind of physically modification ...

probably should be "physical modification" instead.

3) s/targecontext/targetcontext/ in BtreeCheckState comment

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Peter Geoghegan (#1)

Re: amcheck (B-Tree integrity checking tool)

01.03.2016 03:09, Peter Geoghegan:

I was assigned an "action point" during the FOSDEM developer meeting:
"Post new version of btree consistency checker patch". I attach a new
WIP version of my consistency checker tool, amcheck. This patch is
proposed for 9.6, as an extension in contrib -- maybe we can still get
it in. This is the first time I've added any version of this to a
commitfest, although I've posted a couple of rough versions of this in
the past couple of years. The attached version has received a major
overhaul, and is primarily aimed at finding corruption in production
systems, although I think it will still have significant value for
debugging too. Maybe it can help with some of the B-Tree patches in
the final commitfest, for example. I also have some hope that it will
become a learning tool for people interested in how B-Tree indexes
work.

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

Hi,
thank you for the contrib module. The patch itself looks great.
It was really useful to have such a tool while working on b-tree patches.

Invariants
========

nbtree invariants that the tool verifies with just an AccessShareLock
on the relation are:

* That all items are in the correct, opclass order on each page.

* That the page "high key", if any, actually bounds the items on the page.

* That the last item on a page is less than or equal to the first item
on the next page (the page to its right). The idea here is that the
key space spans multiple pages, not just one page, so it make sense to
check the last item where we can.

With an ExclusiveLock + ShareLock, some addition invariants are verified:

* That child pages actually have their parent's downlink as a lower bound.

* Sane right links and left links at each level.

Obviously, this tool is all about distrusting the structure of a
B-Tree index. That being the case, it's not always totally clear where
to draw the line. I think I have the balance about right, though.

Interface
=======

There are only 2 SQL callable functions in the extension, which are
very similar:

bt_index_check(index regclass)

bt_index_parent_check(index regclass)

The latter is more thorough than the former -- it performs all checks,
including those checks that I mentioned required an ExclusiveLock. So,
bt_index_check() only requires an AccessShareLock.
bt_index_parent_check() requires an ExclusiveLock on the index
relation, and a ShareLock on its heap relation, almost like REINDEX.
bt_index_parent_check() performs verification that is a superset of
the verification performed by bt_index_check() -- mostly, the extra
verification/work is that it verifies downlinks against child pages.

Both functions raise an error in the event of observing that an
invariant in a B-Tree was violated, such as items being out of order
on a page. I've written extensive documentation, which goes into
practical aspects of using amcheck effectively. It doesn't go into
significant detail about every invariant that is checked, but gives a
good idea of roughly what checks are performed.

I could almost justify only having one function with an argument about
the downlink/child verification, but that would be a significant
footgun given the varying locking requirements that such a function
would have.

I've done some testing. And I can say that both announced functions work
well.
BTW, if you know a good way to corrupt index (and do it reproducible)
I'd be very glad to see it.

I created an index, then opened the file and changed the order of elements.
And bt_index_check() found the error successfully.

/* Ensure that the data is changed. */
SELECT * FROM bt_page_items('idx', 1);

itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
3 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
(3 rows)

/* Great, amcheck found an error. */
select bt_index_check('idx');
ERROR: page order invariant violated for index "idx"
DETAIL: Lower index tid=(1,1) (points to heap tid=(0,3)) higher index
tid=(1,2) (points to heap tid=(0,2)) page lsn=0/17800D0.

Locking
======

We never rely on something like holding on to a buffer pin as an
interlock for correctness (the vacuum interlock thing isn't generally
necessary, because we don't look at the heap at all). We simply pin +
BT_READ lock a buffer, copy it into local memory allocated by
palloc(), and then immediately release the buffer lock and drop the
pin. This is the same in all instances. There is never more than one
buffer lock or pin held at a time.

We do, on the other hand, have a detailed rationale for why it's okay
that we don't have an ExclusiveLock on the index relation for checks
that span the key space of more than one page by following right links
to compare items across sibling pages. This isn't the same thing as
having an explicit interlock like a pin -- our interlock is one
against *recycling* by vacuum, which is based on recentGlobalXmin.
This rationale requires expert review.

I'm not an expert, but I promise to take a closer look at locking.
I will send another email soon.

Performance
==========

Trying to keep the tool as simple as possible, while still making it
do verification that is as useful as possible was my priority here,
not performance. Still, verification completes fairly quickly.
Certainly, it takes far less time than having to REINDEX the index,
and doesn't need too much memory. I think that in practice most
problems that can be detected by the B-Tree checker functions will be
detected with the lighter variant.

I do not see any performance issues.
I'm sure that if someone wants to check whether the index is corrupted,
he certainly can wait a minute (.

But I have some concerns about compatibility with my patches.
I've tried to call bt_index_check() over my "including" patch [1]https://commitfest.postgresql.org/9/433/ and
caught a segfault.

LOG: server process (PID 31794) was terminated by signal 11:
Segmentation fault
DETAIL: Failed process was running: select bt_index_check('idx');

I do hope that my patch will be accepted in 9.6, so this conflict looks
really bad.
I think that error is caused by changes in pages layout. To save some
space nonkey attributes are truncated
when btree copies the indexed value into inner pages or into high key.
You can look at index_reform_tuple() calls.

I wonder, whether you can add some check of catalog version to your
module or this case requires more changes?

With second patch which implements compression [2]https://commitfest.postgresql.org/9/494/, amcheck causes
another error.
postgres=# insert into tbl select 1 from generate_series(0,5);
INSERT 0 6
postgres=# SELECT * FROM bt_page_items('idx', 1);
itemoffset | ctid | itemlen | nulls | vars | data

------------+----------------+---------+-------+------+----------------------------------------------------------------------------------------
---------------------------------------------------------
1 | (2147483664,6) | 56 | f | f | 01 00 00 00 00
00 00 00 00 00 00 00 01 00 00 00 00 00 02 00 00 00 00 00 03 00 00 00 00
00 04 00 00 00 00 00 05 00 00 00 00 00 06 00 00 00 00 00

postgres=# select bt_index_check('idx');
ERROR: cannot check index "idx"
DETAIL: index is not yet ready for insertions

But I'm sure that it's a problem of my patch. So I'll fix it and try again.

[1]: https://commitfest.postgresql.org/9/433/
[2]: https://commitfest.postgresql.org/9/494/

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael.paquier@gmail.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#3)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:

BTW, if you know a good way to corrupt index (and do it reproducible) I'd be
very glad to see it.

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Michael Paquier (#4)

Re: amcheck (B-Tree integrity checking tool)

Hi,

On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:

On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:

BTW, if you know a good way to corrupt index (and do it
reproducible) I'd be
very glad to see it.

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc

I guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.

I think the main goal of the amcheck module is to identify issues that
are much more subtle - perhaps a coding error or unhandled cornercase
which results in page with a valid structure and checksum, but broken
page. Or perhaps issues coming from outside of PostgreSQL, like a
subtle change in the locales.

Simulating those issues requires a tool that allows minor tweaks to the
page, dd is a bit too dull for that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Michael Paquier (#4)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 8:24 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc

Sure, but that would probably fail at the first hurdle -- the page
header would be corrupt. Which is a valid test, but not all that
interesting.

One testing workflow I tried is overwriting some page in a B-Tree
relfilenode with some other page in the same file:

$:~/pgdata/base/12413$ dd if=somefile of=somefile conv=notrunc bs=8192
count=1 skip=2 seek=3

That should fail due to the key space not being in order across pages,
which is slightly interesting. Or, you could selectively change one
item with a hex editor, as Anastasia did.

Or, you could add code like this to comparetup_index_btree(), to
simulate a broken opclass:

diff --git a/src/backend/utils/sort/tuplesort.c
b/src/backend/utils/sort/tuplesort.c
index 67d86ed..23712ff 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
SortTuple *b,
        compare = ApplySortComparator(a->datum1, a->isnull1,

b->datum1, b->isnull1,
                                                                  sortKey);
+
+       if (random() <= (MAX_RANDOM_VALUE / 1000))
+               compare = -compare;
        if (compare != 0)
                return compare;

There are many options when you want to produce a corrupt B-Tree index!

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Peter Geoghegan (#6)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 1:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

Or, you could add code like this to comparetup_index_btree(), to
simulate a broken opclass:

diff --git a/src/backend/utils/sort/tuplesort.c
b/src/backend/utils/sort/tuplesort.c
index 67d86ed..23712ff 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3562,6 +3562,9 @@ comparetup_index_btree(const SortTuple *a, const
SortTuple *b,
compare = ApplySortComparator(a->datum1, a->isnull1,

b->datum1, b->isnull1,
sortKey);
+
+       if (random() <= (MAX_RANDOM_VALUE / 1000))
+               compare = -compare;
if (compare != 0)
return compare;

Note that this patch that I sketched would make CREATE INDEX produce
corrupt indexes, but the tool's verification itself would not be
affected. Although even if it was (even if _bt_compare() gave the same
wrong answers), it would still very likely detect corruption.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Tomas Vondra (#2)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.

Thanks. I try. This patch is something I've been working on slowly and
steadily for over 2 years. It was time to see it through.

Note that no function currently checks that the index is consistent
with the heap, which would be very useful (that's probably how I'd
eventually target the heapam, actually).

I'm afraid I don't understand what "target the heapam" means. Could you
elaborate?

Just that that's how I'd make amcheck verify that the heap was sane,
if I was going to undertake that project. Or, I'd start there.

I agree. It'd be nice to have a tool that also checks for data
corruption the a lower level (e.g. that varlena headers are not
corrupted etc.) but that seems like a task for some other tool.

I'm not sure that that's a task for another tool. I think that this
tool is scoped at detecting corruption, and that could work well for
the heap, too. There might be fewer interesting things to test about
the heap when indexes aren't involved. Following HOT chains through an
index, and verifying things like that about the heap as we go could
detect a lot of problematic cases.

Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

This also applies to the name of the contrib module - it's not
particularly obvious what "amcheck" unless you're familiar with the
concept of access methods. Which is quite unlikely for regular users.
Maybe something like "check_index" would be more illustrative?

I think that given the heap may be targeted in the future, amcheck is
more future-proof. I think Robert might have liked that name a year
ago or something. In general, I'm not too worried about what the
contrib module is called in the end.

Well, I wouldn't count myself as an expert here, but do we actually need
such protection? I mean, we only do such checks when holding an
exclusive lock on the parent relation, no? And even if we don't vacuum
can only remove entries from the pages - why should that cause
violations of any invariants?

I think that a more worked out explanation for why the ExclusiveLock
is needed is appropriate. I meant to do that.

Basically, a heavier lock is needed because of page deletion by
VACUUM, which is the major source of complexity (much more than page
splits, I'd say). In general, the key space can be consolidated by
VACUUM in a way that breaks child page checking because the downlink
we followed from our target page is no longer the current downlink.
Page deletion deletes the right sibling's downlink, and the deleted
page's downlink is used for its sibling. You could have a race, where
there was a concurrent page deletion of the left sibling of the child
page, then a concurrent insertion into the newly expanded keyspace of
the parent. Therefore, the downlink in the parent (which is the
"target", to use the patch's terminology) would not be a lower bound
on items in the page.

That's a very low probability race, because it involves deletion a
page representing a range in the keyspace that has no live items, but
then immediately getting one just in time for our check. But, I'm
pretty determined that false positives like that need to be impossible
(or else it's a bug).

I have a paranoid feeling that there is a similar very low probability
race with the left-right keyspace checks, which don't have relation
ExclusiveLock protection (IOW, I think that that might be buggy). I
need to think about that some more, but current thinking is that it
would hardly matter if we used the highkey from right page rather than
the first data item, which definitely would be correct. And it would
be simpler.

A few minor comments:

Thanks for catching those issues. Will fix.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Peter Geoghegan (#8)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 1:31 PM, Peter Geoghegan <pg@heroku.com> wrote:

You could have a race, where
there was a concurrent page deletion of the left sibling of the child
page, then a concurrent insertion into the newly expanded keyspace of
the parent. Therefore, the downlink in the parent (which is the
"target", to use the patch's terminology) would not be a lower bound
on items in the page.

Excuse me: I meant the newly expanded keyspace of the *child*. (The
parent's keyspace would have covered everything. It's naturally far
larger than either child's keyspace, since it typically has several
hundred pages.)

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#3)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I do hope that my patch will be accepted in 9.6, so this conflict looks
really bad.

I hope so too. I'll have to look into this issue.

I think that error is caused by changes in pages layout. To save some space
nonkey attributes are truncated
when btree copies the indexed value into inner pages or into high key. You
can look at index_reform_tuple() calls.

That seems like the kind of problem that would be expected when things
like that change. I think it's going to be hard to add new B-Tree
features without a tool like this, which was a big reason to work on
it; to make these new projects possible to test and review. I see many
opportunities to improve the B-Tree code, just as I imagine you do.
These projects are quite strategically important, because B-Trees are
so frequently used. I think that Postgres B-Trees produce excessive
random I/O for checkpoints for various reasons, and this disadvantages
Postgres when it is compared to other major systems. As things get
better in other areas of Postgres, these hard problems become more
important to solve.

I wonder, whether you can add some check of catalog version to your module
or this case requires more changes?

I think that it's just going to be tied to the Postgres version. So,
if your B-Tree patches are committed first, it's on me to make sure
they're handled correctly. Or vice-versa. Not worried that that will
be a problem.

We already take special steps to avoid the "minus infinity" item on
internal pages. I think that in the future, if Postgres B-Trees get
suffix truncation for internal page items, there are new problems for
amcheck (suffix truncation remove unneeded information from internal
page items, sometimes greatly increasing B-Tree fan-out. Internal page
items need only be sufficient to guide index scans correctly.).

Specially, with suffix truncation there might be "minus infinity"
*attributes*, too (these could make it safe to completely remove
attributes/columns past the first distinguishing/distinct attribute on
each item on internal pages). That's a case that amcheck then needs to
care about, just like it currently cares about the existing concept of
minus infinity items. That's how it goes for amcheck.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Peter Geoghegan (#8)

Re: amcheck (B-Tree integrity checking tool)

On 3/11/16 3:31 PM, Peter Geoghegan wrote:

Can we come up with names that more clearly identify the difference

between those two functions? I mean,_parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

First, thanks for your work on this. I've wanted it in the past.

I agree the name isn't very clear. Perhaps _recurse?

I also agree that the nmodule name isn't very clear. If this is meant to
be the start of a generic consistency checker, lets call it that.
Otherwise, it should be marked as being specific to btrees, because
presumably we might eventually want similar tools for GIN, etc. (FWIW
I'd vote for a general consistency checker).

I know the vacuum race condition would be very rare, but I don't think
it can be ignored. Last thing you want out of a consistency checker is
false negatives/positives. I do think it would be reasonable to just
wholesale block against concurrent vacuums, but I don't think there's
any reasonable way to do that.

I would prefer the ability to do something other than raising an error
when corruption is found, so that you could find all corruption in an
index. Obviously could log to a different level. Another option would be
SRFs that return info about all the corruption found, but that's
probably overkill.

It'd be nice if you had the option to obey vacuum_cost_delay when
running this, but that's clearly just a nice-to-have (or maybe just obey
it all the time, since it defaults to 0).
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Jim Nasby (#11)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 3:50 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

I also agree that the nmodule name isn't very clear. If this is meant to be
the start of a generic consistency checker, lets call it that. Otherwise, it
should be marked as being specific to btrees, because presumably we might
eventually want similar tools for GIN, etc. (FWIW I'd vote for a general
consistency checker).

It's a generic consistency checker -- that's the intent. Robert wanted
to go that way about a year ago, and I think it makes sense (this
module started out life as btreecheck). As I said, I don't really care
what the name ends up being. amcheck seems fine to me, but I'll go
with any reasonable suggestion that reflects the scope of the tool as
a generic consistency checker for AMs, including heapam.

I don't know that I understand the concern about the naming of
bt_index_parent_check(). I called it something that accurately
reflects what it does. That particular SQL-callable function is more
orientated towards helping hackers than towards helping users detect
routine problems, as the docs say, but it could help users working
with hackers, and it might even help users acting on their own. I
don't want to make it sound scary, because it isn't. I don't know what
problems will be detected by this tool most often, and I don't think
it would be wise to try to predict how that will work out. And so,
concealing the internal workings of each function/check by giving them
all totally non-technical names seems like a bad idea. It's not that
hard to understand that a B-Tree has multiple levels, and checking the
levels against each other requires more restrictive/stronger
heavyweight locks. I've only added 2 SQL-callable functions, each of
which has one argument, and the first of which has a generic name and
very light locking requirements, like a SELECT. It's not that hard to
get the general idea, as an ordinary user.

I know the vacuum race condition would be very rare, but I don't think it
can be ignored. Last thing you want out of a consistency checker is false
negatives/positives.

That's what I said. And the docs also say that there should never be a
false positive. That's definitely an important design goal here,
because it enables routine testing.

I do think it would be reasonable to just wholesale
block against concurrent vacuums, but I don't think there's any reasonable
way to do that.

Actually, that's exactly what bt_index_parent_check() does already.
VACUUM requires a SHARE UPDATE EXCLUSIVE lock on the heap relation,
which cannot be concurrently acquired due to bt_index_parent_check()'s
acquisition of a SHARE lock. The locking for bt_index_parent_check()
is almost the same as REINDEX, except that that acquires an ACCESS
EXCLUSIVE lock on the index relation; bt_index_parent_check() only
requires an EXCLUSIVE lock on the index relation.

Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.

If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in amcheck.c.
But having that be dynamically configurable is not really compatible
with the goal of having amcheck be run routinely.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Peter Geoghegan (#12)

Re: amcheck (B-Tree integrity checking tool)

On 3/11/16 6:17 PM, Peter Geoghegan wrote:

Not sure about the cost delay thing. Delays are disabled by default
for manually issued VACUUM, so have doubts that that's useful.

Right, but you still have the option to enable them if you don't want to
swamp your IO system. That's why CIC obeys it too. If I was running a
consistency check on a production system I'd certainly want the option
to throttle it. Without that option, I don't see running this on
production systems as being an option. If that's not a goal then fine,
but if it is a goal I think it needs to be there.

Isn't it just a few extra lines of code to support it?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Peter Geoghegan (#12)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan <pg@heroku.com> wrote:

If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in amcheck.c.
But having that be dynamically configurable is not really compatible
with the goal of having amcheck be run routinely.

Also, it's just really hard to reason about what remains OK to check
after the first problem is encountered in the general case. It's
"unreasonable" for any of the checks to ever fail. So, by that
standard, assuming that they might fail at all could be called
paranoid. Who can say what "paranoid enough" should be? I think it's
useful to have a low-impact, generic check function for B-Tree
indexes, but I don't think we need to hold back on being exhaustive
elsewhere.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Jim Nasby (#13)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Right, but you still have the option to enable them if you don't want to
swamp your IO system. That's why CIC obeys it too. If I was running a
consistency check on a production system I'd certainly want the option to
throttle it. Without that option, I don't see running this on production
systems as being an option. If that's not a goal then fine, but if it is a
goal I think it needs to be there.

Isn't it just a few extra lines of code to support it?

I see your point.

I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too). Note that the functions already use an appropriate
buffer access strategy (it avoids blowing shared_buffers, much like
VACUUM itself).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 10 years ago

In reply to: Peter Geoghegan (#14)

Re: amcheck (B-Tree integrity checking tool)

On Fri, 2016-03-11 at 16:40 -0800, Peter Geoghegan wrote:

On Fri, Mar 11, 2016 at 4:17 PM, Peter Geoghegan <pg@heroku.com>
wrote:

If you want the tool to limp on when it finds an error, that can be
done by changing the constant for the CORRUPTION macro in
amcheck.c.
But having that be dynamically configurable is not really
compatible
with the goal of having amcheck be run routinely.

Also, it's just really hard to reason about what remains OK to check
after the first problem is encountered in the general case. It's
"unreasonable" for any of the checks to ever fail. So, by that
standard, assuming that they might fail at all could be called
paranoid. Who can say what "paranoid enough" should be? I think it's
useful to have a low-impact, generic check function for B-Tree
indexes, but I don't think we need to hold back on being exhaustive
elsewhere.

Right, but isn't there a difference between the two functions in this
respect? Once you find corruption involving relationship between
multiple pages, then I agree it's complicated to do any reasoning about
what additional checks are safe.

But does that problem also apply to bt_index_check, which checks pages
independently?

Admittedly, this also depends on the use case. If all we need to do is
answering a question "Is the index corrupted?" then sure, bailing out
on the first error is perfectly appropriate.

But perhaps this would be useful for some recovery/forensics tasks?

From time to time we need to investigate corruption in a database, i.e.
see how much of the data is actually corrupted, list pages that need to
be zeroed to get the cluster up to salvage as much as possible, etc.
Currently this is tedious because we essentially find/fix the pages one
by one. It'd be very useful to list all broken pages in one go and then
fix all of them.

Obviously, that's about heapam checks, but perhaps it would be useful
for an index too?

regard

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Tomas Vondra (#16)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 6:33 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

Right, but isn't there a difference between the two functions in this
respect? Once you find corruption involving relationship between
multiple pages, then I agree it's complicated to do any reasoning about
what additional checks are safe.

But does that problem also apply to bt_index_check, which checks pages
independently?

I think so, yes.

Admittedly, this also depends on the use case. If all we need to do is
answering a question "Is the index corrupted?" then sure, bailing out
on the first error is perfectly appropriate.

But perhaps this would be useful for some recovery/forensics tasks?

Maybe, but I feel like making it possible to change the CORRUPTION
elevel macro was the right trade-off. I don't want to have to reason
about the universe of possible problems that could occur when the tool
must limp on in the event of corruption. For example, I don't want to
have to deal with infinite loops. In practice, an expert would
probably be fine to change the constant themselves if they needed to.

Indexes can always be rebuilt. The tool is for identifying and
diagnosing corruption, but if you want to diagnose a faulty opclass or
something, then I think you need to get out pageinspect. You need
human judgement for that.

From time to time we need to investigate corruption in a database, i.e.
see how much of the data is actually corrupted, list pages that need to
be zeroed to get the cluster up to salvage as much as possible, etc.
Currently this is tedious because we essentially find/fix the pages one
by one. It'd be very useful to list all broken pages in one go and then
fix all of them.

Obviously, that's about heapam checks, but perhaps it would be useful
for an index too?

Only insofar as it helps diagnose the underlying issue, when it is a
more subtle issue. Actually fixing the index is almost certainly a
REINDEX. Once you're into the messy business of diagnosing a
problematic opclass, you have to be an expert, and tweaking amcheck
for your requirements (i.e. rebuilding from source) becomes
reasonable. Part of the reason that the code is so heavily commented
is to make it hackable, because I do not feel optimistic that I can
get an expert-orientated interface right, but I still want to make the
tool as useful as possible to experts.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Matt Kelly

mkellycs@gmail.com

almost 10 years ago

In reply to: Peter Geoghegan (#17)

Re: amcheck (B-Tree integrity checking tool)

On Fri, 2016-03-11 at 17:24 +0100, Michael Paquier wrote:

On Fri, Mar 11, 2016 at 5:19 PM, Anastasia Lubennikova wrote:

BTW, if you know a good way to corrupt index (and do it
reproducible) I'd be
very glad to see it.

You can use for example dd in non-truncate mode to corrupt on-disk
page data, say that for example:
dd if=/dev/random bs=8192 count=1 \
seek=$BLOCK_ID of=base/$DBOID/$RELFILENODE \
conv=notrunc

I guess /dev/random is not very compatible with the "reproducible"
requirement. I mean, it will reproducibly break the page, but pretty
much completely, which is mostly what checksums are for.

You can actually pretty easily produce a test case by setting up streaming
replication between servers running two different version of glibc.

I actually wrote a tool that spins up a pair of VMs using vagrant and then
sets them up as streaming replica's using ansible. It provides a nice one
liner to get a streaming replica test environment going and it will easily
provide the cross glibc test case. Technically, though it belongs to Trip
because I wrote it on company time. Let me see if I can open source a
version of it later this week that way you can use it for testing.

- Matt K.

#19

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Matt Kelly (#18)

Re: amcheck (B-Tree integrity checking tool)

On Sat, Mar 12, 2016 at 7:46 PM, Matt Kelly <mkellycs@gmail.com> wrote:

You can actually pretty easily produce a test case by setting up streaming
replication between servers running two different version of glibc.

I actually wrote a tool that spins up a pair of VMs using vagrant and then
sets them up as streaming replica's using ansible. It provides a nice one
liner to get a streaming replica test environment going and it will easily
provide the cross glibc test case. Technically, though it belongs to Trip
because I wrote it on company time. Let me see if I can open source a
version of it later this week that way you can use it for testing.

That could be interesting. The earlier prototypes of this tool are
known to have detected glibc collation incompatibilities in real
production systems.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Jim Nasby

Jim.Nasby@BlueTreble.com

almost 10 years ago

In reply to: Peter Geoghegan (#15)

Re: amcheck (B-Tree integrity checking tool)

On 3/11/16 6:45 PM, Peter Geoghegan wrote:

I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too).

I thought that had already been overloaded by CIC, but I'm not finding
reference to it... ANALYZE does use it though, so the ship has already
sorta sailed.

I'm actually a bit surprised cost delay isn't used anywhere else. As
more background operations are added I suspect users will want it at
some point.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Robert Haas

robertmhaas@gmail.com

almost 10 years ago

In reply to: Peter Geoghegan (#15)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 7:45 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Mar 11, 2016 at 4:30 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Right, but you still have the option to enable them if you don't want to
swamp your IO system. That's why CIC obeys it too. If I was running a
consistency check on a production system I'd certainly want the option to
throttle it. Without that option, I don't see running this on production
systems as being an option. If that's not a goal then fine, but if it is a
goal I think it needs to be there.

Isn't it just a few extra lines of code to support it?

I see your point.

I'll add that if people like the interface you propose. (Overloading
the VACUUM cost delay functions to cause a delay for amcheck
functions, too). Note that the functions already use an appropriate
buffer access strategy (it avoids blowing shared_buffers, much like
VACUUM itself).

I don't particularly like that interface. I also suggest that it
would be better to leave throttling to a future commit, and focus on
getting the basic feature in first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Robert Haas (#21)

Re: amcheck (B-Tree integrity checking tool)

On Mon, Mar 14, 2016 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't particularly like that interface. I also suggest that it
would be better to leave throttling to a future commit, and focus on
getting the basic feature in first.

Works for me. I don't think throttling is an especially compelling feature.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Anastasia Lubennikova (#3)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Mar 11, 2016 at 8:19 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

But I have some concerns about compatibility with my patches.
I've tried to call bt_index_check() over my "including" patch [1] and caught
a segfault.

LOG: server process (PID 31794) was terminated by signal 11: Segmentation
fault
DETAIL: Failed process was running: select bt_index_check('idx');

I do hope that my patch will be accepted in 9.6, so this conflict looks
really bad.
I think that error is caused by changes in pages layout. To save some space
nonkey attributes are truncated

[1] https://commitfest.postgresql.org/9/433/

I posted a review of your "Covering + unique indexes" patch, where I
made an educated guess about what the problem is here (I sort of
hinted at what I thought it was already, in this thread, actually). I
haven't actually tested this theory of mine myself just yet, but let
me know what you think of it on the thread for your patch.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 10 years ago

In reply to: Peter Geoghegan (#8)

Re: amcheck (B-Tree integrity checking tool)

On 2016/03/12 6:31, Peter Geoghegan wrote:

On Thu, Mar 10, 2016 at 9:18 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

I've looked at this patch today, mostly to educate myself, so this
probably should not count as a full review. Anyway, the patch seems in
excellent shape - it'd be great if all patches (including those written
by me) had this level of comments/docs.

Can we come up with names that more clearly identify the difference
between those two functions? I mean, _parent_ does not make it
particularly obvious that the second function acquires exclusive lock
and performs more thorough checks.

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

At the risk of sounding a bit verbose, do bt_check_level() for a check
that inspects a level at a time and bt_check_multi_level() for a check
that spans levels sound descriptive?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Amit Langote (#24)

Re: amcheck (B-Tree integrity checking tool)

On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

At the risk of sounding a bit verbose, do bt_check_level() for a check
that inspects a level at a time and bt_check_multi_level() for a check
that spans levels sound descriptive?

Hmm. But all functions verify multiple levels. What distinguishes
bt_index_parent_check()'s verification is that the downlinks in
internal pages are checked against actual child pages (every item in
the child page, in fact). It's the parent/child relationship that is
verified in addition to the standard checks of every page on and
across (not between) every level.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 10 years ago

In reply to: Peter Geoghegan (#25)

Re: amcheck (B-Tree integrity checking tool)

Hi Peter,

On 2016/03/15 16:11, Peter Geoghegan wrote:

On Mon, Mar 14, 2016 at 11:48 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Dunno about that. It's defining characteristic is that it checks child
pages against their parent IMV. Things are not often defined in terms
of their locking requirements.

At the risk of sounding a bit verbose, do bt_check_level() for a check
that inspects a level at a time and bt_check_multi_level() for a check
that spans levels sound descriptive?

Hmm. But all functions verify multiple levels. What distinguishes
bt_index_parent_check()'s verification is that the downlinks in
internal pages are checked against actual child pages (every item in
the child page, in fact). It's the parent/child relationship that is
verified in addition to the standard checks of every page on and
across (not between) every level.

Ah, I see the nuance. Thanks for the explanation. Maybe,
bt_index_check() and bt_index_parent_child_check() /
bt_index_check_parent_child(). IMHO, the latter more clearly highlights
the fact that parent/child relationships in the form of down-links are
checked.

By the way, one request (as a non-native speaker of English language, who
ends up looking up quite a few words regularly) -

Could we use "conform" or "correspond" instead of "comport" in the
following error message:

"left link/right link pair in index \"%s\" don't comport"

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Amit Langote (#26)

Re: amcheck (B-Tree integrity checking tool)

On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Ah, I see the nuance. Thanks for the explanation. Maybe,
bt_index_check() and bt_index_parent_child_check() /
bt_index_check_parent_child(). IMHO, the latter more clearly highlights
the fact that parent/child relationships in the form of down-links are
checked.

Well, the downlink is in the parent, because there is no such thing as
an "uplink". So I prefer bt_index_parent_check(), since it usefully
hints at starting from the parent. It's also more concise.

By the way, one request (as a non-native speaker of English language, who
ends up looking up quite a few words regularly) -

Could we use "conform" or "correspond" instead of "comport" in the
following error message:

"left link/right link pair in index \"%s\" don't comport"

OK. I'll do something about that.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 10 years ago

In reply to: Peter Geoghegan (#27)

Re: amcheck (B-Tree integrity checking tool)

On 2016/03/15 16:42, Peter Geoghegan wrote:

On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

By the way, one request (as a non-native speaker of English language, who
ends up looking up quite a few words regularly) -

Could we use "conform" or "correspond" instead of "comport" in the
following error message:

"left link/right link pair in index \"%s\" don't comport"

OK. I'll do something about that.

Thanks a lot.

- Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

David Steele

david@pgmasters.net

almost 10 years ago

In reply to: Peter Geoghegan (#27)

Re: amcheck (B-Tree integrity checking tool)

On 3/15/16 3:42 AM, Peter Geoghegan wrote:

On Tue, Mar 15, 2016 at 12:31 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Ah, I see the nuance. Thanks for the explanation. Maybe,
bt_index_check() and bt_index_parent_child_check() /
bt_index_check_parent_child(). IMHO, the latter more clearly highlights
the fact that parent/child relationships in the form of down-links are
checked.

Well, the downlink is in the parent, because there is no such thing as
an "uplink". So I prefer bt_index_parent_check(), since it usefully
hints at starting from the parent. It's also more concise.

By the way, one request (as a non-native speaker of English language, who
ends up looking up quite a few words regularly) -

Could we use "conform" or "correspond" instead of "comport" in the
following error message:

"left link/right link pair in index \"%s\" don't comport"

OK. I'll do something about that.

It looks like an updated patch is expected here, though it seems that
the only requests are for updates to comments.

--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: David Steele (#29)

Re: amcheck (B-Tree integrity checking tool)

On Tue, Mar 22, 2016 at 9:33 AM, David Steele <david@pgmasters.net> wrote:

It looks like an updated patch is expected here, though it seems that
the only requests are for updates to comments.

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that. If I'm right about that then
it's not a lot of work to fix; I'm probably just going to use the
right page's high key rather than its first data item.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Peter Geoghegan (#30)

Re: amcheck (B-Tree integrity checking tool)

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg@heroku.com> wrote:

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that. If I'm right about that then
it's not a lot of work to fix; I'm probably just going to use the
right page's high key rather than its first data item.

I also want to use amcheck to test sorting, especially external
sorting which is currently totally untested. It would be nice to see
testing of strxfrm() on the buildfarm, too -- amcheck provides a nice
way to make sure strxfrm() and strcoll() are in agreement for at least
those cases that are tested, without having to worry about portability
in the same way as a simple pg_regress approach would require.

Note that there are reportedly systems in mainstream use where
strxfrm() is broken; it doesn't agree with strcoll().

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Peter Geoghegan

pg@heroku.com

almost 10 years ago

In reply to: Peter Geoghegan (#30)

1 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg@heroku.com> wrote:

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that.

I attach a V2 revision.

Behavioral changes:

* Fixes very low probability race condition when verifying ordering
across sibling pages with bt_index_check()/AccessShareLock.

Notably, the fix does not involve using the high key from the next
page instead of the first item, despite my recent suggestion that it
would; I changed my mind. In this revision,
bt_right_page_check_scankey() continues to return the first item on
the next/right page (as a scankey), just as in the first version. The
locking considerations here are more complicated than before, and
hopefully correct. I abandoned the high key idea when I realized that
a concurrent page split could happen in the right sibling, making the
high key generally no more safe in the face of concurrent target page
deletion than continuing to use the first data item. Using the first
data item from the right page is better than using the high key item
from the right page verification-wise anyway, so that's good. I found
a new way of locking down and recovering from any race conditions (the
low probability bug in V1 for bt_index_check()/AccessShareLock calls
to the C function bt_right_page_check_scankey()). I won't go into the
gory details here, but I will mention that my new approach is inspired
by how backwards scans work already.

* Adds DEBUG1 traces that indicate the level and page currently being checked.

Advanced users will probably find these traces useful on occasion.
It's nice to be able to see how amcheck walks the B-Tree in practice.

Non-behavioral changes:

* Explains why an ExclusiveLock is needed on the target index by
bt_index_parent_check() (and a ShareLock on its heap relation).

This new explanation talks about why an ExclusiveLock *is* necessary
when checking downlinks against child pages, per Tomas Vondra's
request. (Note that nothing changes about child/downlink verification
itself). Before now, I focused on why they were not necessary, but
Tomas was right to point out that that isn't enough information.

* No longer uses the word "comport" in an error message, per Amit
Langote. Similarly, minor typos fixed, per Tomas.

* Adds a variety of pg_regress-based tests for external sorting. (Via
various CREATE INDEX statements involving a variety of datatypes. Data
is generated pseudo-randomly.)

Notably, the regression tests *never* test external sorting today.

ISTM that amcheck is a convenient, reliable, and portable way of
testing the sorting of collatable types like text, so this is the best
way of testing external sorting. External sorting uses abbreviated
keys for runs, but discards them as runs are written out, and so the
merging of runs does not use abbreviated keys. Abbreviated comparisons
and authoritative comparisons are therefore reliably used within the
same sort. One particular goal here is that something like the recent
strxfrm() debacle might be caught by this testing on some platforms
(if we ever get back to trusting strxfrm() at all, that is). The style
of these new tests is unorthodox, because a variety of collations are
tested, which varies from system to system; we should discuss all
this. The tests can take a minute or two to run on my laptop, but that
could vary significantly. And so, whether or not that needs to be
targeted by something other than "check" and "installcheck" is a
discussion that needs to happen. I imagine a lot of buildfarm animals
have slow storage, but on the other hand I don't think hackers run the
contrib regression tests so often. I prefer to not follow the example
of test_decoding if possible; test_decoding's Makefile only runs tests
when wal_level=logical is expected (that's how the relevant Makefile
targets are scoped), but it's a complicated special case.

Review
------

As already noted after the first bullet point, there are some tricky
considerations on fixing the race when the user calls
bt_index_check(). Reviewers should pay close attention to this; could
I be wrong about that being race-safe even now? This seems to me to be
by far the best place for reviewers to look for problems in this
patch, as it's the only bt_index_check()/AccessShareLock case that
compares anything *across* pages. The measures taken to prevent false
positives described within bt_target_page_check() require expert
review, especially because this is the one case where a faulty
rationale may result in under-protection from false positives, and not
just harmless over-protection.

The rationale for why we can reliably recover from a race described
within the C function bt_right_page_check_scankey() is *complicated*.
I may have failed to make it as simple as possible despite significant
effort. It's hard to describe these things in an accessible way. Maybe
someone can try and understand what I've said there, and let me know
how I might have fallen short. I have used a new term, "cousin page"
in this explanation (cf. sibling page). This seems like useful
terminology that makes these difficult explanations a bit more
accessible.

--
Peter Geoghegan

Attachments:

0001-Add-amcheck-extension-to-contrib.patchtext/x-patch; charset=US-ASCII; name=0001-Add-amcheck-extension-to-contrib.patchDownload

From 06085b27dc36c1e05e0049d90374b9c687e5ea9a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.

To test amcheck, add regression tests that feature a variety of CREATE
INDEX statements (these are always external sort cases), and verify the
indexes afterwards in each case.  Aside from testing amcheck itself,
this approach also fills a gap in the test coverage of the standard
regression tests, since external sorts were previously completely
untested.  This also makes it possible to test OS collations without any
specific expectation about an "expected" pg_regress output, or even
specific advanced knowledge of which collations are to be tested.
---
 contrib/Makefile                                   |    1 +
 contrib/amcheck/Makefile                           |   22 +
 contrib/amcheck/amcheck--1.0.sql                   |   20 +
 contrib/amcheck/amcheck.c                          | 1265 ++++++++++++++++++++
 contrib/amcheck/amcheck.control                    |    5 +
 contrib/amcheck/expected/extern_sort_bytea.out     |   44 +
 .../amcheck/expected/extern_sort_collations.out    |   42 +
 contrib/amcheck/expected/extern_sort_numeric.out   |   44 +
 contrib/amcheck/expected/install_amcheck.out       |    1 +
 contrib/amcheck/sql/extern_sort_bytea.sql          |   23 +
 contrib/amcheck/sql/extern_sort_collations.sql     |   42 +
 contrib/amcheck/sql/extern_sort_numeric.sql        |   29 +
 contrib/amcheck/sql/install_amcheck.sql            |    1 +
 doc/src/sgml/amcheck.sgml                          |  247 ++++
 doc/src/sgml/contrib.sgml                          |    1 +
 doc/src/sgml/filelist.sgml                         |    1 +
 16 files changed, 1788 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.c
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 contrib/amcheck/expected/extern_sort_bytea.out
 create mode 100644 contrib/amcheck/expected/extern_sort_collations.out
 create mode 100644 contrib/amcheck/expected/extern_sort_numeric.out
 create mode 100644 contrib/amcheck/expected/install_amcheck.out
 create mode 100644 contrib/amcheck/sql/extern_sort_bytea.sql
 create mode 100644 contrib/amcheck/sql/extern_sort_collations.sql
 create mode 100644 contrib/amcheck/sql/extern_sort_numeric.sql
 create mode 100644 contrib/amcheck/sql/install_amcheck.sql
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index d12dd63..cecfb6d 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		btree_gin	\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 0000000..26b7b9b
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,22 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= amcheck.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+REGRESS = install_amcheck extern_sort_bytea \
+	extern_sort_collations extern_sort_numeric
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 0000000..ebbd6ac
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT;
diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c
new file mode 100644
index 0000000..66d1f4f
--- /dev/null
+++ b/contrib/amcheck/amcheck.c
@@ -0,0 +1,1265 @@
+/*-------------------------------------------------------------------------
+ *
+ * amcheck.c
+ *		Verifies the integrity of access methods based on invariants.
+ *
+ * Provides SQL-callable functions for verifying that various logical
+ * invariants in the structure of index access methods are respected.  This
+ * includes, for example, the invariant that each page in the target B-Tree
+ * index has "real" items in logical order as reported by an insertion scankey
+ * (the insertion scankey sort-wise NULL semantics are useful for
+ * verification).
+ *
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/amcheck/amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/transam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "commands/tablecmds.h"
+#include "miscadmin.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+
+PG_MODULE_MAGIC;
+
+#define CHECK_SUPERUSER() { \
+		if (!superuser()) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+					 (errmsg("must be superuser to use verification functions")))); }
+
+/*
+* As noted in comments above _bt_compare(), there is special handling of the
+* first data item (that is, the first item with a valid downlink -- not the
+* high key item) on a non-leaf (internal) page.  There is clearly no point in
+* having amcheck functions make any comparison of or against these "minus
+* infinity" items, because they contain no actual information other than the
+* downlink.
+*/
+#define OFFSET_IS_MINUS_INFINITY(opaque, offset) \
+	(!P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque))
+
+/*
+ * Callers to verification functions should never receive a false positive
+ * indication of corruption.  Therefore, when using amcheck functions for
+ * stress testing, it may be useful to temporally change the CORRUPTION elevel
+ * to PANIC, to immediately halt the server in the event of detecting an
+ * invariant condition violation.  This may preserve more information about the
+ * nature of the underlying problem.  Note that modifying the CORRUPTION
+ * constant to be an elevel < ERROR is not well tested.
+ */
+#ifdef NOT_USED
+#define CORRUPTION		PANIC
+#define CONCERN			WARNING
+#else
+#define CORRUPTION		ERROR
+#define CONCERN			INFO
+#endif
+
+/*
+ * A B-Tree cannot possibly have this many levels, since there must be one
+ * block per level, which is bound by the range of BlockNumber:
+ */
+#define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+
+/*
+ * State associated with verifying a B-Tree index
+ */
+typedef struct BtreeCheckState
+{
+	/*
+	 * Unchanging state, established at start of verification:
+	 *
+	 * rel:                 B-Tree Index Relation
+	 * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
+	 * checkstrategy:       Buffer access strategy
+	 * targetcontext:       Target page memory context
+	 */
+	Relation				rel;
+	bool					exclusivelock;
+	BufferAccessStrategy 	checkstrategy;
+	MemoryContext			targetcontext;
+
+	/*
+	 * Mutable state, for verification of particular page:
+	 *
+	 * target:              Main target page
+	 * targetblock:         Main target page's block number
+	 * targetlsn:           Main target page's LSN
+	 *
+	 * target is the point of reference for a verification operation.
+	 *
+	 * Note: targetlsn is always from "target".  It is stashed here out of
+	 * convenience.
+	 *
+	 * Other B-Tree pages may be allocated, but those are always auxiliary
+	 * (e.g. they are target's child pages).  Conceptually, only the target
+	 * page is checked.  Each page found by verification's left/right,
+	 * top/bottom scan becomes the target once.
+	 *
+	 * Memory is managed by resetting targetcontext after verification of some
+	 * target page finishes (possibly including target verification that
+	 * depends on non-target page state).
+	 */
+	Page					target;
+	BlockNumber				targetblock;
+	XLogRecPtr				targetlsn;
+
+} BtreeCheckState;
+
+/*
+ * Starting point for verifying an entire B-Tree index level
+ */
+typedef struct BtreeLevel
+{
+	/* Level number (0 is leaf page level). */
+	uint32			level;
+
+	/* Left most block on level.  Scan of level begins here. */
+	BlockNumber		leftmost;
+
+	/* Is this level reported as "true" root level by meta page? */
+	bool			istruerootlevel;
+} BtreeLevel;
+
+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);
+
+static void btree_index_checkable(Relation rel);
+static void bt_check_every_level(Relation rel, bool exclusivelock);
+static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
+											   BtreeLevel level);
+static void bt_target_page_check(BtreeCheckState *state);
+static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+							  ScanKey targetkey);
+static bool invariant_key_less_than_equal_offset(BtreeCheckState *state,
+												 ScanKey key,
+												 OffsetNumber upperbound);
+static bool invariant_key_greater_than_equal_offset(BtreeCheckState *state,
+													ScanKey key,
+													OffsetNumber lowerbound);
+static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+														   Page other,
+														   ScanKey key,
+														   OffsetNumber upperbound);
+static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+
+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */
+Datum
+bt_index_check(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	indrel;
+
+	CHECK_SUPERUSER();
+
+	indrel = relation_open(relid, AccessShareLock);
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, false);
+
+	relation_close(indrel, AccessShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * bt_index_parent_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Acquires ExclusiveLock on index relation, and ShareLock on the associated
+ * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
+ * are valid lower bounds on child pages.
+ */
+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+	Oid			indrelid = PG_GETARG_OID(0);
+	Oid			heapid;
+	Relation	indrel;
+	Relation	heaprel;
+
+	CHECK_SUPERUSER();
+
+	/*
+	 * We must lock table before index to avoid deadlocks.  However, if the
+	 * passed indrelid isn't an index then IndexGetRelation() will fail.
+	 * Rather than emitting a not-very-helpful error message, postpone
+	 * complaining, expecting that the is-it-an-index test below will fail.
+	 */
+	heapid = IndexGetRelation(indrelid, true);
+	if (OidIsValid(heapid))
+		heaprel = heap_open(heapid, ShareLock);
+	else
+		heaprel = NULL;
+
+	/*
+	 * Open the target index relations separately (like relation_openrv(), but
+	 * with heap relation locked first to prevent deadlocking).  In hot standby
+	 * mode this will raise an error.
+	 */
+	indrel = index_open(indrelid, ExclusiveLock);
+
+	/* Check for active uses of the index in the current transaction */
+	CheckTableNotInUse(indrel, "bt_index_parent_check");
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, true);
+
+	index_close(indrel, ExclusiveLock);
+	if (heaprel)
+		heap_close(heaprel, ShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * btree_index_checkable()
+ *
+ * Basic checks about the suitability of a relation for checking as a B-Tree
+ * index.
+ */
+static void
+btree_index_checkable(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_INDEX ||
+		rel->rd_rel->relam != BTREE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only nbtree access method indexes are supported"),
+				 errdetail("index \"%s\" does not using the nbtree access method.",
+						   RelationGetRelationName(rel))));
+
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions"),
+				 errdetail("index \"%s\" is associated with temporary relation.",
+						   RelationGetRelationName(rel))));
+
+	if (!rel->rd_index->indisready)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("index is not yet ready for insertions")));
+
+	if (!rel->rd_index->indisvalid)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("index is not valid")));
+}
+
+/*
+ * bt_check_every_level()
+ *
+ * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
+ * logical order, verifying invariants as it goes.
+ *
+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physical modification to the index structure, including
+ * modifications that VACUUM may make.
+ */
+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
+{
+	BtreeCheckState	   *state;
+	Page				metapage;
+	BTMetaPageData	   *metad;
+	uint32				previouslevel;
+	BtreeLevel			current;
+
+	/*
+	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
+	 * RecentGlobalXmin/B-Tree page deletion.
+	 */
+	Assert(TransactionIdIsValid(RecentGlobalXmin));
+
+	/*
+	 * Initialize state for entire verification operation
+	 */
+	state = palloc(sizeof(BtreeCheckState));
+	state->rel = rel;
+	state->exclusivelock = exclusivelock;
+	state->checkstrategy = GetAccessStrategy(BAS_BULKREAD);
+	/* Create context for page */
+	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
+												 "amcheck page data",
+												 ALLOCSET_DEFAULT_MINSIZE,
+												 ALLOCSET_DEFAULT_INITSIZE,
+												 ALLOCSET_DEFAULT_MAXSIZE);
+
+	/* Get true root block from meta-page */
+	metapage = palloc_btree_page(state, BTREE_METAPAGE);
+	metad = BTPageGetMeta(metapage);
+
+	/*
+	 * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+	 * the fast root and true root differ.
+	 *
+	 * Start from the true root, not the fast root, unlike conventional index
+	 * scans.  This approach is more thorough, and removes the risk of
+	 * following a stale fast root from the meta page.
+	 */
+	if (metad->btm_fastroot != metad->btm_root)
+		ereport(CONCERN,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("fast root mismatch in index %s",
+						RelationGetRelationName(rel)),
+				 errdetail_internal("Fast block %u (level %u) "
+									"differs from true root "
+									"block %u (level %u).",
+									metad->btm_fastroot, metad->btm_fastlevel,
+									metad->btm_root, metad->btm_level)));
+
+	/*
+	 * Starting at the root, verify every level.  Move left to right, top to
+	 * bottom.  Note that there may be no pages other than the meta page (meta
+	 * page can indicate that root is P_NONE when the index is totally empty).
+	 */
+	previouslevel = InvalidBtreeLevel;
+	current.level = metad->btm_level;
+	current.leftmost = metad->btm_root;
+	current.istruerootlevel =  true;
+	while (current.leftmost != P_NONE)
+	{
+		/*
+		 * Verify this level, and get left most page for next level down, if
+		 * not at leaf level
+		 */
+		current = bt_check_level_from_leftmost(state, current);
+
+		if (current.leftmost == InvalidBlockNumber)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" has no valid pages on level below %u or first level",
+							RelationGetRelationName(rel), previouslevel)));
+
+		previouslevel = current.level;
+	}
+
+	/* Be tidy: */
+	MemoryContextDelete(state->targetcontext);
+}
+
+/*
+ * bt_check_level_from_leftmost()
+ *
+ * Given a left-most block at some level, move right, verifying each page
+ * individually (with more verification across pages for "exclusivelock"
+ * callers).  Caller should pass the true root page as the leftmost initially,
+ * working their way down by passing what is returned for the last call here
+ * until level 0 (leaf page level) was reached.
+ *
+ * Returns state for next call, if any.  This includes left-most block number
+ * one level lower that should be checked on next call here, or P_NONE when
+ * leaf level is to be checked.  The nbtree level number convention is that
+ * higher levels have higher numbers, because new levels are added only due to
+ * a root page split.  Prior to the first root page split, when there are only
+ * two pages (the root page and the meta page), the root page is classified as
+ * a leaf page, not an internal page.
+ *
+ * Note on memory management:  State's per-page context is reset here, between
+ * each call to bt_target_page_check().
+ */
+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{
+	/* State to establish early, concerning entire level */
+	BTPageOpaque	opaque;
+	MemoryContext 	oldcontext;
+	BtreeLevel		nextleveldown;
+
+	/* Variables for iterating across level using right links */
+	BlockNumber		leftcurrent = P_NONE;
+	BlockNumber		current = level.leftmost;
+
+	/* Initialize return state */
+	nextleveldown.leftmost = InvalidBlockNumber;
+	nextleveldown.level = InvalidBtreeLevel;
+	nextleveldown.istruerootlevel = false;
+
+	/* Use page-level context for duration of this call */
+	oldcontext = MemoryContextSwitchTo(state->targetcontext);
+
+	elog(DEBUG1, "verifying level %u%s", level.level,
+		 level.istruerootlevel?
+		 " (true root level)" : level.level == 0 ? " (leaf level)" : "");
+
+	do
+	{
+		/* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Initialize state for this iteration */
+		state->targetblock = current;
+		state->target = palloc_btree_page(state, state->targetblock);
+		state->targetlsn = PageGetLSN(state->target);
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+		if (P_IGNORE(opaque))
+		{
+			if (P_RIGHTMOST(opaque))
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("block %u fell off the end of index \"%s\"",
+								current, RelationGetRelationName(state->rel))));
+			else
+				ereport(CONCERN,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("block %u of index \"%s\" ignored",
+								current, RelationGetRelationName(state->rel))));
+			goto nextpage;
+		}
+		else if (nextleveldown.leftmost == InvalidBlockNumber)
+		{
+			/*
+			 * A concurrent page split could make the caller supplied leftmost
+			 * block no longer contain the leftmost page, or no longer be the
+			 * true root, but where that isn't possible due to heavyweight
+			 * locking, check that the first valid page meets caller's
+			 * expectations.
+			 */
+			if (state->exclusivelock)
+			{
+				if (!P_LEFTMOST(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not leftmost in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+
+				if (level.istruerootlevel && !P_ISROOT(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not true root in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+			}
+
+			/*
+			 * Before beginning any non-trivial examination of level, establish
+			 * next level down's leftmost block number, which next call here
+			 * will pass as its leftmost (iff this isn't leaf level).
+			 *
+			 * There should be at least one non-ignorable page per level.
+			 */
+			if (!P_ISLEAF(opaque))
+			{
+				IndexTuple		itup;
+				ItemId			itemid;
+
+				/* Internal page -- downlink gets leftmost on next level */
+				itemid = PageGetItemId(state->target, P_FIRSTDATAKEY(opaque));
+				itup = (IndexTuple) PageGetItem(state->target, itemid);
+				nextleveldown.leftmost = ItemPointerGetBlockNumber(&(itup->t_tid));
+				nextleveldown.level = opaque->btpo.level - 1;
+			}
+			else
+			{
+				/*
+				 * Leaf page -- final level caller must process.
+				 *
+				 * Note that this could also be the root page, if there has
+				 * been no root page split yet.
+				 */
+				nextleveldown.leftmost = P_NONE;
+				nextleveldown.level = InvalidBtreeLevel;
+			}
+
+			/*
+			 * Finished setting up state for this call/level.  Control will
+			 * never end up back here in any future loop iteration for this
+			 * level.
+			 */
+		}
+
+		if (state->exclusivelock && opaque->btpo_prev != leftcurrent)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("right link/left link pair in index \"%s\" not in mutual agreement",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Deleted block=%u left block=%u link reported block=%u.",
+										current, leftcurrent, opaque->btpo_prev)));
+
+		/* Verify invariants for page -- all important checks occur here */
+		bt_target_page_check(state);
+
+nextpage:
+
+		/* Try to detect circular links */
+		if (current == leftcurrent || current == opaque->btpo_prev)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("circular link chain found in block %u of index \"%s\"",
+							current, RelationGetRelationName(state->rel))));
+
+		leftcurrent = current;
+		current = opaque->btpo_next;
+
+		/* Free page and associated memory for this iteration */
+		MemoryContextReset(state->targetcontext);
+	}
+	while (current != P_NONE);
+
+	/* Don't change context for caller */
+	MemoryContextSwitchTo(oldcontext);
+
+	return nextleveldown;
+}
+
+/*
+ * bt_target_page_check()
+ *
+ * Function performs the following checks on target page, or pages ancillary to
+ * target page:
+ *
+ * - That every "real" data item is less than or equal to the high key, which
+ *   is an upper bound on the items on the pages (where there is a high key at
+ *   all -- pages that are rightmost lack one).
+ *
+ * - That within the page, every "real" item is less than or equal to the item
+ *   immediately to its right, if any (i.e., that the items are in order within
+ *   the page, so that the binary searches performed by index scans are sane).
+ *
+ * - That the last item stored on the page is less than or equal to the first
+ *   "real" data item on the page to the right (if such a first item is
+ *   available from next page).
+ *
+ * Furthermore, when state passed shows ExclusiveLock held, function also
+ * checks:
+ *
+ * - That all child pages respect downlinks lower bound (internal pages only).
+ *
+ * Note:  This routine is not especially proactive in freeing memory.  High
+ * watermark memory consumption is bound to some small fixed multiple of
+ * BLCKSZ, though.  Caller should reset the current context between calls here.
+ */
+static void
+bt_target_page_check(BtreeCheckState *state)
+{
+	OffsetNumber	offset;
+	OffsetNumber	max;
+	BTPageOpaque	topaque;
+
+	topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+	max = PageGetMaxOffsetNumber(state->target);
+
+	elog(DEBUG1, "verifying %u items on %s block %u", max,
+		 P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);
+
+	/*
+	 * Loop over page items, but don't start from P_HIKEY (don't have iteration
+	 * directly considering high key item, if any).  That's something that is
+	 * used as part of verifying all other items, but doesn't get its own
+	 * iteration.
+	 */
+	for (offset = P_FIRSTDATAKEY(topaque);
+		 offset <= max;
+		 offset = OffsetNumberNext(offset))
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+		ScanKey		skey;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Don't try to generate scankey using "minus infinity" garbage data */
+		if (OFFSET_IS_MINUS_INFINITY(topaque, offset))
+			continue;
+
+		/* Build insertion scankey for current page offset */
+		itemid = PageGetItemId(state->target, offset);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
+		skey = _bt_mkscankey(state->rel, itup);
+
+		/*
+		 *   ********************
+		 *   * High key check   *
+		 *   ********************
+		 *
+		 * If there is a high key, which there must be for a non-rightmost
+		 * page, check that it actually is upper bound on all page items.
+		 *
+		 * We prefer to check all items, rather than checking just the first
+		 * and trusting that the operator class obeys the transitive law (which
+		 * implies that all subsequent items also respected the high key
+		 * invariant if they pass the page order check).
+		 *
+		 * Ideally, we'd compare every item in the index against every other
+		 * item in the index, and not trust opclass obedience of the transitive
+		 * law to bridge the gap between children and their grandparents (as
+		 * well as great-grandparents, and so on).  We don't go to those
+		 * lengths because that would be prohibitively expensive, and probably
+		 * not markedly more effective in practice.
+		 */
+		if (!P_RIGHTMOST(topaque) &&
+			!invariant_key_less_than_equal_offset(state, skey, P_HIKEY))
+		{
+			char	   *itid,  *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("high key invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
+		/*
+		 *   ********************
+		 *   * Page order check *
+		 *   ********************
+		 *
+		 * Check that items are stored on page in logical order, by checking
+		 * current item is less than or equal to next item (if any).
+		 */
+		if (OffsetNumberNext(offset) <= max &&
+			!invariant_key_less_than_equal_offset(state, skey,
+												  OffsetNumberNext(offset)))
+		{
+			char	   *itid,  *htid,
+					   *nitid, *nhtid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			nitid = psprintf("(%u,%u)", state->targetblock,
+							 OffsetNumberNext(offset));
+
+			/* Reuse itup to get pointed-to heap location of second item */
+			itemid = PageGetItemId(state->target, OffsetNumberNext(offset));
+			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			nhtid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("page order invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Lower index tid=%s (points to %s tid=%s) "
+										"higher index tid=%s (points to %s tid=%s) "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										htid,
+										nitid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										nhtid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+		/*
+		 *   ********************
+		 *   * Last item check  *
+		 *   ********************
+		 *
+		 * Check last item against next/right page's first data item's when
+		 * last item on page is reached.
+		 *
+		 * The general idea here is that checking the ordering of items on the
+		 * page should still perform some check on the last item on the page,
+		 * if at all possible.  In other words, this is roughly the same
+		 * process as the page order check that has already been performed for
+		 * every other "real" item on target page by now; we just need to reach
+		 * into the next page to get a scankey to compare against lower bound
+		 * of max.
+		 */
+		else if (offset == max)
+		{
+			ScanKey		rightkey;
+
+			/* Get item in next/right page */
+			rightkey = bt_right_page_check_scankey(state);
+
+			if (rightkey &&
+				!invariant_key_greater_than_equal_offset(state, rightkey, max))
+			{
+				/*
+				 * As discussed in bt_right_page_check_scankey(),
+				 * non-ExclusiveLock case might have had target page deleted,
+				 * in which case no error is raised
+				 */
+				if (!state->exclusivelock)
+				{
+					/* Get fresh copy of target page */
+					state->target = palloc_btree_page(state, state->targetblock);
+					topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+					/*
+					 * Because of RecentGlobalXmin interlock against VACUUM's
+					 * recycling of blocks, we can safely assume that this is
+					 * substantively the same target page as before.
+					 *
+					 * Just return, because all !exclusivelock checks already
+					 * performed against target.
+					 */
+					if (P_IGNORE(topaque))
+						return;
+				}
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("cross page order invariant violated for index \"%s\"",
+								RelationGetRelationName(state->rel)),
+						 errdetail_internal("Last item on page tid=(%u,%u) "
+											"right page block=%u "
+											"page lsn=%X/%X.",
+											state->targetblock, offset,
+											topaque->btpo_next,
+											(uint32) (state->targetlsn >> 32),
+											(uint32) state->targetlsn)));
+			}
+		}
+
+		/*
+		 *   ********************
+		 *   * Downlink check   *
+		 *   ********************
+		 *
+		 * Additional check of child items against target page (their parent),
+		 * iff this is internal page and caller holds ExclusiveLock on index
+		 * relation.  This involves a pass over each child page at the end of
+		 * each iteration (excluding the minus infinity iteration which
+		 * internal pages always have, which is immediately skipped).
+		 */
+		if (!P_ISLEAF(topaque) && state->exclusivelock)
+		{
+			BlockNumber childblock = ItemPointerGetBlockNumber(&(itup->t_tid));
+
+			bt_downlink_check(state, childblock, skey);
+		}
+	}
+}
+
+/*
+ * bt_right_page_check_scankey()
+ *
+ * Return a scankey for an item on page to right of current target (or the
+ * first non-ignorable page), sufficient to check ordering invariant on last
+ * item in current target page.  Returned scankey relies on local memory
+ * allocated for the child page, which caller cannot pfree().  Caller's memory
+ * context should be reset between calls here.
+ *
+ * This is the first data item, and so all adjacent items are checked against
+ * their immediate sibling item (which may be on the same page, or may be the
+ * cases this function helps with: comparisons of items across sibling pages,
+ * or even comparisons of items across "cousin" pages).
+ *
+ * If no first data item is available on right page, just return NULL.  Note
+ * that !exclusivelock callers must reverify that target page has not been
+ * concurrently deleted.  There is a complex explanation for why this is below.
+ */
+static ScanKey
+bt_right_page_check_scankey(BtreeCheckState *state)
+{
+	BTPageOpaque	opaque;
+	ItemId			rightitem;
+	BlockNumber		targetnext;
+	Page			rightpage;
+	OffsetNumber	nline;
+
+	/* Determine target's next block number */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+	/* If target is already rightmost, no right sibling; nothing to do here */
+	if (P_RIGHTMOST(opaque))
+		return NULL;
+
+	/*
+	 * General notes on concurrent page splits and page deletion:
+	 *
+	 * Concurrent page splits are not a problem for ordinary index scans, since
+	 * the key space always moves in a way that lets index scans not miss
+	 * things:  they might have to move right, but they never have to move left
+	 * (leaving aside index scans backwards, a special case).  A concurrent
+	 * page split could occur here, but just as with index scans we're
+	 * following the stale right link, which will reliably get us further along
+	 * in the key space, which is all we really need to get an item further
+	 * along in key space to check invariant in target page.
+	 *
+	 * A deleted page won't actually be recycled by VACUUM early enough for us
+	 * to fail to be able follow its right link (or left link, or downlink),
+	 * because it doesn't do so until it knows that no possible index scan
+	 * could land on the page with the expectation of at least being able to
+	 * move right and eventually find a non-ignorable page.  (see page
+	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 *
+	 * It's okay if we follow a rightlink and find a half-dead or dead
+	 * (ignorable) page.  Either way, there must be a sane further right link
+	 * to follow for these ignorable pages, because page deletion refuses to
+	 * merge the key space between adjacent pages that do not share a common
+	 * parent (that is, merging of the key space has to be among true sibling
+	 * pages, never cousin pages).  We should succeed in finding a page to the
+	 * right that isn't ignorable before too long.
+	 */
+	targetnext = opaque->btpo_next;
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		rightpage = palloc_btree_page(state, targetnext);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
+
+		if (!P_IGNORE(opaque) || P_RIGHTMOST(opaque))
+			break;
+
+		/* We landed on a deleted page, so step right to find a live page */
+		targetnext = opaque->btpo_next;
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
+						opaque->btpo.level, RelationGetRelationName(state->rel)),
+				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
+
+		/* Be slightly more pro-active in freeing this memory, just in case */
+		pfree(rightpage);
+	}
+
+	/*
+	 * No ExclusiveLock held case -- why it's safe to proceed.
+	 *
+	 * Problem:
+	 *
+	 * We must avoid false positive reports of corruption when caller treats
+	 * item returned here as an upper bound on target's last item.  In general,
+	 * false positives are disallowed.  Ensuring they don't happen in
+	 * !exclusivelock case is subtle.
+	 *
+	 * A concurrent page deletion by VACUUM of the target page can result in
+	 * the insertion of items on to this right sibling page that would
+	 * previously have been inserted on our target page.  There might have been
+	 * insertions that followed target's downlink after it was made to point to
+	 * right sibling instead of target by page deletion's first phase.  The
+	 * inserters insert items that would belong on target page.  This race is
+	 * very tight, but it's possible.  This is our only problem.
+	 *
+	 * Non-problems:
+	 *
+	 * We are not hindered by a concurrent page split of the target; we'll
+	 * never land on the second half of the page anyway.  A concurrent split of
+	 * the right page will also not matter, because the first data item remains
+	 * the same within the left half, which we'll reliably land on.  If we had
+	 * to skip over ignorable/deleted pages, it cannot matter because their key
+	 * space has already been atomically merged with the first non-ignorable
+	 * page we eventually find (doesn't matter whether the page we eventually
+	 * find is a true sibling or a cousin of target, which we go into below).
+	 *
+	 * Solution:
+	 *
+	 * Caller knows that it should reverify that target is not ignorable
+	 * (half-dead or deleted) when cross-page sibling item comparison appears
+	 * to indicate corruption (invariant fails).  This detects the single race
+	 * condition that exists for caller.  This is correct because the continued
+	 * existence of target block as non-ignorable (not half-dead or deleted)
+	 * implies that target page was not merged into from the right by deletion;
+	 * the key space at or after target never moved left.  Target's parent
+	 * either has the same downlink to target as before, or a <= downlink due
+	 * to deletion at the left of target.  Target either has the same high key
+	 * as before, or a high key <= before when there is a page split. (The
+	 * rightmost concurrently-split-from-target-page page will still have the
+	 * same high key as target was originally found to have, which for our
+	 * purposes is equivalent to target's high key itself never changing, since
+	 * we reliably skip over concurrently-split-from-target-page pages.)
+	 *
+	 * In simpler terms, we allow that the key space of the target may expand
+	 * left (the key space can move left on the left side of target only), but
+	 * the target key space cannot expand right and get ahead of us without our
+	 * detecting it.  The key space of the target cannot shrink, unless it
+	 * shrinks to zero due to the deletion of the original page, our canary
+	 * condition.  (To be very precise, we're a bit stricter than that because
+	 * it might just have been that the target page split and only the original
+	 * target page was deleted.  We can be more strict, just not more lax.)
+	 *
+	 * Top level tree walk caller moves on to next page (makes it the new
+	 * target) following recovery from this race.  (cf.  The rationale for
+	 * child/downlink verification needing an ExclusiveLock within
+	 * bt_downlink_check(), where page deletion is also the main source of
+	 * trouble.)
+	 *
+	 * Note that it doesn't matter if right sibling page here is actually a
+	 * cousin page, because in order for the key space to be readjusted in a
+	 * way that causes us issues in next level up (guiding problematic
+	 * concurrent insertions to the cousin from the grandparent rather than to
+	 * the sibling from the parent), there'd have to be page deletion of
+	 * target's parent page (affecting target's parent's downlink in target's
+	 * grandparent page).  Internal page deletion only occurs when there are no
+	 * child pages (they were all fully deleted), and caller is checking that
+	 * the target's parent has at least one non-deleted (so non-ignorable)
+	 * child: the target page.  (Note that the first phase of deletion
+	 * atomically marks the page to be deleted half-dead/ignorable at the same
+	 * time downlink in its parent is removed, so we'll definitely be able to
+	 * detect that this might have happened just from the target page.)
+	 *
+	 * This trick is inspired by the method backward scans use for dealing with
+	 * concurrent page splits; concurrent page deletion is a problem that
+	 * similarly receives special consideration sometimes (it's possible that
+	 * the backwards scan will re-read its "original" block after failing to
+	 * find a right-link to it, having already moved in the opposite direction
+	 * (right/"forwards") a few times to try to locate one).  Just like us,
+	 * that happens only to determine if there was a concurrent page deletion
+	 * of a reference page, and just like us if there was a page deletion of
+	 * that reference page it means we can move on from caring about the
+	 * reference page.  See the nbtree README for a full description of how
+	 * that works.
+	 */
+	nline = PageGetMaxOffsetNumber(rightpage);
+
+	/*
+	 * Get first data item.
+	 *
+	 * Importantly, this allows the verification of page order across target
+	 * and rightmost page when rightmost page is the target's right sibling.
+	 * Moreover, the only way logical inconsistencies can really be missed
+	 * across a given level is if the transitive law is broken by an opclass,
+	 * because contiguous pairs are always compared, even across page
+	 * boundaries. (Assuming we have a "stable snapshot" of structure, as in
+	 * exclusivelock case.  Otherwise, strictly speaking, we might just miss
+	 * corruption that would have been caught in exclusivelock case.)
+	 */
+	if (P_ISLEAF(opaque) && nline >= P_FIRSTDATAKEY(opaque))
+	{
+		/*
+		 * For leaf page, return first data item (if any), which cannot be a
+		 * "minus infinity" item
+		 */
+		rightitem = PageGetItemId(rightpage, P_FIRSTDATAKEY(opaque));
+	}
+	else if (!P_ISLEAF(opaque) &&
+			 nline >= OffsetNumberNext(P_FIRSTDATAKEY(opaque)))
+	{
+		/*
+		 * Return first item after the internal page's undefined "minus
+		 * infinity" item.  Also skip the high key when page isn't rightmost on
+		 * level.
+		 */
+		rightitem = PageGetItemId(rightpage,
+								  OffsetNumberNext(P_FIRSTDATAKEY(opaque)));
+	}
+	else
+	{
+		/*
+		 * No first item.  Page is probably empty leaf page, but it's also
+		 * possible that it's an internal page with only a minus infinity item,
+		 * because caller moves left to right.
+		 */
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("%s block %u of index \"%s\" has no first data item",
+						P_ISLEAF(opaque) ? "leaf":"internal", targetnext,
+						RelationGetRelationName(state->rel))));
+		return NULL;
+	}
+
+	/*
+	 * Return first real item scankey.  Note that this relies on right page
+	 * memory remaining allocated.
+	 */
+	return _bt_mkscankey(state->rel,
+						 (IndexTuple) PageGetItem(rightpage, rightitem));
+}
+
+/*
+ * bt_downlink_check()
+ *
+ * Checks one of target's downlink against its child page.
+ *
+ * Conceptually, the target page continues to be what is checked here.  The
+ * target block is still blamed in the event of finding an invariant violation.
+ * The downlink insertion into the target is probably where any problem raised
+ * here arises, and there is no such thing as a parent link, so doing the
+ * verification this way around is much more practical.
+ */
+static void
+bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey)
+{
+	OffsetNumber	offset;
+	OffsetNumber	maxoffset;
+	Page			child;
+	BTPageOpaque	copaque;
+
+	/*
+	 * Caller must have ExclusiveLock on target relation, because of
+	 * considerations around page deletion by VACUUM.
+	 *
+	 * N.B.: In general, page deletion deletes the right sibling's downlink,
+	 * not the downlink of the page being deleted; the deleted page's downlink
+	 * is reused for its sibling.  The key space is thereby consolidated
+	 * between the deleted page and its right sibling.  (We cannot delete a
+	 * parent page's rightmost page unless it is the last child page, and we
+	 * intend to delete the parent itself.)
+	 *
+	 * If this verification happened without an ExclusiveLock, the following
+	 * race condition could cause false positives (which are generally
+	 * disallowed):
+	 *
+	 * Not having an ExclusiveLock would allow concurrent page deletion,
+	 * including deletion of the left sibling of the child page that is
+	 * examined here.  If such a page deletion occurred, and was then closely
+	 * followed by an insertion into the newly expanded key space of the child,
+	 * a false positive may result: our stale parent/target downlink would
+	 * legitimately not be a lower bound on all items in the page anymore,
+	 * because the key space was concurrently expanded "left" (insertion
+	 * followed the "new" downlink for the child, not our now-stale downlink,
+	 * which was concurrently physically removed in target/parent as part of
+	 * deletion's first phase).
+	 *
+	 * Note that while the cross-page-same-level check uses a trick that allows
+	 * it to perform verification for !exclusivelock callers, an analogous
+	 * trick seems very difficult here.  The trick that that other check uses
+	 * is, in essence, to lock down race conditions to those that occur due to
+	 * concurrent page deletion of the target; that's a race that can be
+	 * reliably detected before actually reporting corruption.  On the other
+	 * hand, we'd need to lock down race conditions involving deletion of
+	 * child's left page, at least for long enough to read the child page into
+	 * memory.  Any more granular locking schemes all seem to involve multiple
+	 * concurrently held buffer locks.  That's unacceptable for amcheck on
+	 * general principle, though; amcheck functions never hold more than one
+	 * buffer lock at a time.
+	 */
+	Assert(state->exclusivelock);
+
+	/*
+	 * Verify child page has the downlink key from target page (its parent) as
+	 * a lower bound.
+	 *
+	 * We prefer to check all items, rather than checking just the first and
+	 * trusting that the operator class obeys the transitive law (which implies
+	 * that all subsequent items also respected the downlink-as-lower-bound
+	 * invariant if they pass the page order check when the child later becomes
+	 * our target for verification).
+	 */
+	child = palloc_btree_page(state, childblock);
+	copaque = (BTPageOpaque) PageGetSpecialPointer(child);
+	maxoffset = PageGetMaxOffsetNumber(child);
+
+	for (offset = P_FIRSTDATAKEY(copaque);
+		 offset <= maxoffset;
+		 offset = OffsetNumberNext(offset))
+	{
+		/*
+		 * Skip comparison of target page key against "minus infinity" item, if
+		 * any.  Checking it would indicate that its not an upper bound, but
+		 * that's only because of the hard-coding within _bt_compare().
+		 */
+		if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
+			continue;
+
+		if (!invariant_key_less_than_equal_nontarget_offset(state, child,
+															targetkey, offset))
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("down-link lower bound invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Parent block=%u child index tid=(%u,%u) "
+										"parent page lsn=%X/%X.",
+										state->targetblock, childblock, offset,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+	}
+
+	pfree(child);
+}
+
+/*
+ * invariant_key_less_than_equal_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_offset(BtreeCheckState *state, ScanKey key,
+									 OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * invariant_key_less_than_equal_offset()
+ *
+ * Does the invariant hold that the key is greater than or equal to a given
+ * lower bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_greater_than_equal_offset(BtreeCheckState *state, ScanKey key,
+										OffsetNumber lowerbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound);
+
+	return cmp >= 0;
+}
+
+/*
+ * invariant_key_less_than_equal_nontarget_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item, with the offset relating to a caller-supplied page that
+ * is not the current target page? Caller's non-target page is typically a
+ * child page of the target, checked as part of checking a property of the
+ * target page (i.e. the key comes from the target).
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+											   Page nontarget, ScanKey key,
+											   OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * palloc_btree_page()
+ *
+ * Given a block number of a B-Tree page, return page in palloc()'d memory.
+ * While at it, perform some basic checks of the page.
+ *
+ * There is never an attempt to get a consistent view of multiple pages using
+ * multiple concurrent buffer locks; in general, we prefer to have only one pin
+ * and buffer lock at a time, which is often all that the nbtree code requires.
+ *
+ * Operating on a copy of the page is useful because it prevents control
+ * getting stuck in an uninterruptible state when an underlying operator class
+ * misbehaves.
+ */
+static Page
+palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
+{
+	Buffer			buffer;
+	Page			page;
+	BTPageOpaque	opaque;
+
+	page = palloc(BLCKSZ);
+
+	/*
+	 * We copy the page into local storage to avoid holding pin on the buffer
+	 * longer than we must.
+	 */
+	buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
+								state->checkstrategy);
+	LockBuffer(buffer, BT_READ);
+
+	/*
+	 * Perform the same basic sanity checking that nbtree itself performs for
+	 * every page:
+	 */
+	_bt_checkpage(state-> rel, buffer);
+
+	/* Only use copy of page in palloc()'d memory */
+	memcpy(page, BufferGetPage(buffer), BLCKSZ);
+	UnlockReleaseBuffer(buffer);
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (opaque->btpo_flags & BTP_META && blocknum != BTREE_METAPAGE)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid meta page found at block %u in index \"%s\"",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	/* Check page from block that ought to be meta page */
+	if (blocknum == BTREE_METAPAGE)
+	{
+		BTMetaPageData *metad = BTPageGetMeta(page);
+
+		if (!(opaque->btpo_flags & BTP_META) ||
+			metad->btm_magic != BTREE_MAGIC)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" meta page is corrupt",
+							RelationGetRelationName(state->rel))));
+
+		if (metad->btm_version != BTREE_VERSION)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("version mismatch in index \"%s\": file version %d, code version %d",
+							RelationGetRelationName(state->rel),
+							metad->btm_version, BTREE_VERSION)));
+	}
+
+	/*
+	 * Deleted pages have no sane "level" field, so can only check non-deleted
+	 * page level
+	 */
+	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+
+	if (blocknum != BTREE_METAPAGE && !P_ISLEAF(opaque) &&
+		!P_ISDELETED(opaque) && opaque->btpo.level == 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+						opaque->btpo.level, RelationGetRelationName(state->rel))));
+
+	if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("internal page block %u in index \"%s\" has garbage items",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	return page;
+}
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
new file mode 100644
index 0000000..180f150
--- /dev/null
+++ b/contrib/amcheck/amcheck.control
@@ -0,0 +1,5 @@
+# amcheck extension
+comment = 'verify access method invariants'
+default_version = '1.0'
+module_pathname = '$libdir/amcheck'
+relocatable = true
diff --git a/contrib/amcheck/expected/extern_sort_bytea.out b/contrib/amcheck/expected/extern_sort_bytea.out
new file mode 100644
index 0000000..54a8f3b
--- /dev/null
+++ b/contrib/amcheck/expected/extern_sort_bytea.out
@@ -0,0 +1,44 @@
+CREATE TEMP TABLE bytea_test (i bytea);
+-- Seed random number generator to make outcome deterministic.
+SELECT setseed(0.5);
+ setseed 
+---------
+ 
+(1 row)
+
+INSERT INTO bytea_test SELECT decode(repeat(md5(random()::text), (random() * 100)::int4), 'hex')
+      FROM generate_series(0, 100000);
+SET maintenance_work_mem = '1MB';
+CREATE INDEX idx_bytea ON bytea_test(i);
+SELECT bt_index_parent_check('idx_bytea');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+SET maintenance_work_mem = '3MB';
+REINDEX INDEX idx_bytea;
+SELECT bt_index_parent_check('idx_bytea');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+SET maintenance_work_mem = '5MB';
+REINDEX INDEX idx_bytea;
+SELECT bt_index_parent_check('idx_bytea');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+SET maintenance_work_mem = '20MB';
+REINDEX INDEX idx_bytea;
+SELECT bt_index_parent_check('idx_bytea');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+-- drop the temporary table
+DROP TABLE bytea_test;
diff --git a/contrib/amcheck/expected/extern_sort_collations.out b/contrib/amcheck/expected/extern_sort_collations.out
new file mode 100644
index 0000000..63ee4d0
--- /dev/null
+++ b/contrib/amcheck/expected/extern_sort_collations.out
@@ -0,0 +1,42 @@
+-- Perform tests of external sorting collated text.  Only test at most 10
+-- available collations, to make sure that the tests complete in a reasonable
+-- amount of time.
+DO $x$
+DECLARE
+  r record;
+  index_build_mem int4;
+BEGIN
+	FOR r IN SELECT oid, collname FROM pg_collation ORDER BY oid LIMIT 10 LOOP
+
+  EXECUTE format($y$CREATE TEMP TABLE text_test_collation (i text COLLATE "%s");$y$, r.collname);
+
+  -- Seed random number generator so that each collation that happens to be
+  -- available on the system consistently sorts the same strings, using the
+  -- same maintenance_work_mem setting.  In theory, varying these makes the
+  -- tests more likely to catch problems.
+  PERFORM setseed(get_byte(decode(md5(r.collname), 'hex'), 5) / 256.0);
+  -- Don't actually output collation names
+  -- RAISE NOTICE '%', r.collname;
+
+  -- Make sure MD5 strings contain some space characters, and a doublequote, so
+  -- multiple "weight levels" are represented.  Typically, OS strcoll()
+  -- implementations follow the conventions of the standard Unicode collation
+  -- algorithm, and primarily weight alphabetical ordering when comparing
+  -- strings.
+  INSERT INTO text_test_collation
+      SELECT replace(replace(md5(random()::text), '4', ' '), '9', '"')
+      FROM generate_series(0, 100000);
+
+  -- Set maintenance_work_mem  to at least mimimum value, 1MB, and at most 10MB
+  -- to test external sorting:
+  index_build_mem := (SELECT greatest((random() * 1024 * 10)::int4, 1024));
+  -- Don't actually output pseudo-random maintenance_work_mem:
+  -- RAISE NOTICE '%', index_build_mem;
+  EXECUTE 'SET maintenance_work_mem = ' || index_build_mem;
+  CREATE INDEX idx_text ON text_test_collation(i);
+  PERFORM bt_index_parent_check('idx_text');
+  -- drop the temporary table
+  DROP TABLE text_test_collation;
+	END LOOP;
+END;
+$x$;
diff --git a/contrib/amcheck/expected/extern_sort_numeric.out b/contrib/amcheck/expected/extern_sort_numeric.out
new file mode 100644
index 0000000..253f9b3
--- /dev/null
+++ b/contrib/amcheck/expected/extern_sort_numeric.out
@@ -0,0 +1,44 @@
+CREATE TEMP TABLE numeric_test (i numeric);
+-- Seed random number generator to make outcome deterministic.
+SELECT setseed(0.5);
+ setseed 
+---------
+ 
+(1 row)
+
+INSERT INTO numeric_test SELECT random() * 100000000 FROM generate_series(0, 5000000);
+-- special cases
+INSERT INTO numeric_test SELECT exp(0.0);
+INSERT INTO numeric_test SELECT exp(1.0);
+INSERT INTO numeric_test SELECT exp(1.0::numeric(71,70));
+INSERT INTO numeric_test SELECT 0.0 ^ 0.0;
+INSERT INTO numeric_test SELECT (-12.34) ^ 0.0;
+INSERT INTO numeric_test SELECT 12.34 ^ 0.0;
+INSERT INTO numeric_test SELECT 0.0 ^ 12.34;
+SET maintenance_work_mem = '1MB';
+CREATE INDEX idx_numeric ON numeric_test(i);
+SELECT bt_index_parent_check('idx_numeric');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+SET maintenance_work_mem = '5MB';
+REINDEX INDEX idx_numeric;
+SELECT bt_index_parent_check('idx_numeric');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+-- Only this final REINDEX requires just one merge pass:
+SET maintenance_work_mem = '20MB';
+REINDEX INDEX idx_numeric;
+SELECT bt_index_parent_check('idx_numeric');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+-- drop the temporary table
+DROP TABLE numeric_test;
diff --git a/contrib/amcheck/expected/install_amcheck.out b/contrib/amcheck/expected/install_amcheck.out
new file mode 100644
index 0000000..5b3e6d5
--- /dev/null
+++ b/contrib/amcheck/expected/install_amcheck.out
@@ -0,0 +1 @@
+CREATE EXTENSION amcheck;
diff --git a/contrib/amcheck/sql/extern_sort_bytea.sql b/contrib/amcheck/sql/extern_sort_bytea.sql
new file mode 100644
index 0000000..a302c55e
--- /dev/null
+++ b/contrib/amcheck/sql/extern_sort_bytea.sql
@@ -0,0 +1,23 @@
+CREATE TEMP TABLE bytea_test (i bytea);
+
+-- Seed random number generator to make outcome deterministic.
+SELECT setseed(0.5);
+
+INSERT INTO bytea_test SELECT decode(repeat(md5(random()::text), (random() * 100)::int4), 'hex')
+      FROM generate_series(0, 100000);
+
+SET maintenance_work_mem = '1MB';
+CREATE INDEX idx_bytea ON bytea_test(i);
+SELECT bt_index_parent_check('idx_bytea');
+SET maintenance_work_mem = '3MB';
+REINDEX INDEX idx_bytea;
+SELECT bt_index_parent_check('idx_bytea');
+SET maintenance_work_mem = '5MB';
+REINDEX INDEX idx_bytea;
+SELECT bt_index_parent_check('idx_bytea');
+SET maintenance_work_mem = '20MB';
+REINDEX INDEX idx_bytea;
+SELECT bt_index_parent_check('idx_bytea');
+
+-- drop the temporary table
+DROP TABLE bytea_test;
diff --git a/contrib/amcheck/sql/extern_sort_collations.sql b/contrib/amcheck/sql/extern_sort_collations.sql
new file mode 100644
index 0000000..63ee4d0
--- /dev/null
+++ b/contrib/amcheck/sql/extern_sort_collations.sql
@@ -0,0 +1,42 @@
+-- Perform tests of external sorting collated text.  Only test at most 10
+-- available collations, to make sure that the tests complete in a reasonable
+-- amount of time.
+DO $x$
+DECLARE
+  r record;
+  index_build_mem int4;
+BEGIN
+	FOR r IN SELECT oid, collname FROM pg_collation ORDER BY oid LIMIT 10 LOOP
+
+  EXECUTE format($y$CREATE TEMP TABLE text_test_collation (i text COLLATE "%s");$y$, r.collname);
+
+  -- Seed random number generator so that each collation that happens to be
+  -- available on the system consistently sorts the same strings, using the
+  -- same maintenance_work_mem setting.  In theory, varying these makes the
+  -- tests more likely to catch problems.
+  PERFORM setseed(get_byte(decode(md5(r.collname), 'hex'), 5) / 256.0);
+  -- Don't actually output collation names
+  -- RAISE NOTICE '%', r.collname;
+
+  -- Make sure MD5 strings contain some space characters, and a doublequote, so
+  -- multiple "weight levels" are represented.  Typically, OS strcoll()
+  -- implementations follow the conventions of the standard Unicode collation
+  -- algorithm, and primarily weight alphabetical ordering when comparing
+  -- strings.
+  INSERT INTO text_test_collation
+      SELECT replace(replace(md5(random()::text), '4', ' '), '9', '"')
+      FROM generate_series(0, 100000);
+
+  -- Set maintenance_work_mem  to at least mimimum value, 1MB, and at most 10MB
+  -- to test external sorting:
+  index_build_mem := (SELECT greatest((random() * 1024 * 10)::int4, 1024));
+  -- Don't actually output pseudo-random maintenance_work_mem:
+  -- RAISE NOTICE '%', index_build_mem;
+  EXECUTE 'SET maintenance_work_mem = ' || index_build_mem;
+  CREATE INDEX idx_text ON text_test_collation(i);
+  PERFORM bt_index_parent_check('idx_text');
+  -- drop the temporary table
+  DROP TABLE text_test_collation;
+	END LOOP;
+END;
+$x$;
diff --git a/contrib/amcheck/sql/extern_sort_numeric.sql b/contrib/amcheck/sql/extern_sort_numeric.sql
new file mode 100644
index 0000000..018b87a
--- /dev/null
+++ b/contrib/amcheck/sql/extern_sort_numeric.sql
@@ -0,0 +1,29 @@
+CREATE TEMP TABLE numeric_test (i numeric);
+
+-- Seed random number generator to make outcome deterministic.
+SELECT setseed(0.5);
+
+INSERT INTO numeric_test SELECT random() * 100000000 FROM generate_series(0, 5000000);
+
+-- special cases
+INSERT INTO numeric_test SELECT exp(0.0);
+INSERT INTO numeric_test SELECT exp(1.0);
+INSERT INTO numeric_test SELECT exp(1.0::numeric(71,70));
+INSERT INTO numeric_test SELECT 0.0 ^ 0.0;
+INSERT INTO numeric_test SELECT (-12.34) ^ 0.0;
+INSERT INTO numeric_test SELECT 12.34 ^ 0.0;
+INSERT INTO numeric_test SELECT 0.0 ^ 12.34;
+
+SET maintenance_work_mem = '1MB';
+CREATE INDEX idx_numeric ON numeric_test(i);
+SELECT bt_index_parent_check('idx_numeric');
+SET maintenance_work_mem = '5MB';
+REINDEX INDEX idx_numeric;
+SELECT bt_index_parent_check('idx_numeric');
+-- Only this final REINDEX requires just one merge pass:
+SET maintenance_work_mem = '20MB';
+REINDEX INDEX idx_numeric;
+SELECT bt_index_parent_check('idx_numeric');
+
+-- drop the temporary table
+DROP TABLE numeric_test;
diff --git a/contrib/amcheck/sql/install_amcheck.sql b/contrib/amcheck/sql/install_amcheck.sql
new file mode 100644
index 0000000..5b3e6d5
--- /dev/null
+++ b/contrib/amcheck/sql/install_amcheck.sql
@@ -0,0 +1 @@
+CREATE EXTENSION amcheck;
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
new file mode 100644
index 0000000..524f88d
--- /dev/null
+++ b/doc/src/sgml/amcheck.sgml
@@ -0,0 +1,247 @@
+<!-- doc/src/sgml/amcheck.sgml -->
+
+<sect1 id="amcheck" xreflabel="amcheck">
+ <title>amcheck</title>
+
+ <indexterm zone="amcheck">
+  <primary>amcheck</primary>
+ </indexterm>
+
+ <para>
+  The <filename>amcheck</> module provides functions that allow you to
+  verify the logical consistency of the structure of indexes.  If the
+  structure appears to be valid, no error is raised.
+ </para>
+
+ <para>
+  The functions verify various <emphasis>invariants</> in the
+  structure of the representation of particular indexes.  The
+  correctness of the access method functions behind index scans and
+  other important operations is predicated on these invariants always
+  holding.  For example, certain functions verify, among other things,
+  that all B-Tree pages have items in <quote>logical</> order; if that
+  particular invariant somehow fails to hold, we can expect binary
+  searches on any affected page to produce wrong answers.  As a
+  consequence of that, index scans may subtly produce wrong answers.
+ </para>
+
+ <para>
+  Verification is performed using the same procedures as those used by
+  index scans themselves, which may be user-defined operator class
+  code.  For example, B-Tree index verification relies on comparisons
+  made with one or more B-Tree support function 1 routines, much like
+  B-Tree index scans rely on the routines to guide the scan to a point
+  in the underlying table;  see <xref linkend="xindex-support"> for
+  details of operator class support functions.
+ </para>
+
+ <para>
+  <filename>amcheck</> functions may be used only by superusers.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>bt_index_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_check</function> tests that its target, a
+      B-Tree index, respects a variety of invariants.  Example usage:
+<screen>
+test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+FROM pg_index i
+JOIN pg_opclass op ON i.indclass[0] = op.oid
+JOIN pg_am am ON op.opcmethod = am.oid
+JOIN pg_class c ON i.indexrelid = c.oid
+JOIN pg_namespace n ON c.relnamespace = n.oid
+WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
+-- Don't check pg_class (bt_index_parent_check() requires this):
+AND c.relname NOT LIKE 'pg_class%'
+-- Function may throw an error when this is omitted:
+AND i.indisready AND i.indisvalid
+ORDER BY c.relpages DESC LIMIT 10;
+ bt_index_check |             relname             | relpages 
+----------------+---------------------------------+----------
+                | pg_depend_reference_index       |       43
+                | pg_depend_depender_index        |       40
+                | pg_proc_proname_args_nsp_index  |       31
+                | pg_description_o_c_o_index      |       21
+                | pg_attribute_relid_attnam_index |       14
+                | pg_proc_oid_index               |       10
+                | pg_attribute_relid_attnum_index |        9
+                | pg_amproc_fam_proc_index        |        5
+                | pg_amop_opr_fam_index           |        5
+                | pg_amop_fam_strat_index         |        5
+(10 rows)
+</screen>
+      This example shows a session that performs verification of every
+      catalog index in the database <quote>test</> (except those
+      associated with the <structname>pg_class</structname> catalog).
+      Details of just the 10 largest indexes verified are displayed.
+      Since no error is raised, all indexes tested appear to be
+      logically consistent.  Naturally, this query could easily be
+      changed to call <function>bt_index_check</function> for every
+      index in the database where verification is supported.
+     </para>
+     <para>
+      An <literal>AccessShareLock</> is acquired on the target index
+      by <function>bt_index_check</function>.  This lock mode is the
+      same lock mode acquired on relations by simple
+      <literal>SELECT</> statements.
+      <function>bt_index_check</function> does not verify invariants
+      that span child/parent relationships, nor does it verify that
+      the target index is consistent with its heap relation.  When a
+      routine, lightweight test for corruption is required in a live
+      production environment, using
+      <function>bt_index_check</function> often provides the best
+      trade-off between thoroughness of verification and limiting the
+      impact on application performance and availability.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>bt_index_parent_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_parent_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_parent_check</function> tests that its
+      target, a B-Tree index, respects a variety of invariants.  The
+      checks performed by <function>bt_index_parent_check</function>
+      are a superset of the checks performed by
+      <function>bt_index_check</function>.
+      <function>bt_index_parent_check</function> can be thought of as
+      a more thorough variant of <function>bt_index_check</function>:
+      unlike <function>bt_index_check</function>,
+      <function>bt_index_parent_check</function> also checks
+      invariants that span parent/child relationships.  However, it
+      does not verify that the target index is consistent with its
+      heap relation.  <function>bt_index_parent_check</function>
+      follows the general convention of raising an error if it finds a
+      logical inconsistency or other problem.
+     </para>
+     <para>
+      An <literal>ExclusiveLock</> is required on the target index by
+      <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</> is also acquired on the heap relation).
+      These locks prevent concurrent data modification from
+      <command>INSERT</>, <command>UPDATE</>, and <command>DELETE</>
+      commands.  The locks also prevent the underlying relation from
+      being concurrently processed by <command>VACUUM</> and certain
+      other utility commands.  Note that the function holds locks for
+      as short a duration as possible, so there is no advantage to
+      verifying each index individually in a series of transactions.
+     </para>
+     <para>
+      <function>bt_index_parent_check</function>'s additional
+      verification is more likely to detect various pathological
+      cases.  These cases may involve an incorrectly implemented
+      B-Tree operator class used by the index that is checked, or,
+      hypothetically, undiscovered bugs in the underlying B-Tree index
+      access method code.  Note that
+      <function>bt_index_parent_check</function> will raise an error
+      when called while Hot Standby mode is enabled, unlike
+      <function>bt_index_check</function>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Using <filename>amcheck</> effectively</title>
+
+ <para>
+  <filename>amcheck</> can be effective at detecting various types of
+  failure modes that <link
+  linkend="app-initdb-data-checksums"><application>data page
+  checksums</></link> will always fail to catch.  These include:
+
+  <itemizedlist>
+   <listitem>
+    <para>
+     Filesystem or storage subsystem faults where checksums happen to
+     simply not be enabled.  Note that <filename>amcheck</> examines a
+     page as represented in some shared memory buffer at the time of
+     verification if there is only a shared buffer hit when accessing
+     the block.  Consequently, <filename>amcheck</> does not
+     necessarily examine data read from the filesystem at the time of
+     verification.  Note that when checksums are enabled,
+     <filename>amcheck</> may raise an error due to a checksum failure
+     when a corrupt block is read into a buffer.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by faulty RAM, and the broader memory subsystem
+     and operating system.  <productname>PostgreSQL</> does not
+     protect against correctable memory errors and it is assumed you
+     will operate using RAM that uses industry standard Error
+     Correcting Codes (ECC) or better protection.  However, ECC memory
+     is typically only immune to single-bit errors, and should not be
+     assumed to provide <emphasis>absolute</emphasis> protection
+     against failures that result in memory corruption.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Structural inconsistencies caused by incorrect operator class
+     implementations, including issues due to the comparison rules of
+     operating system collations changing the ordering implied by
+     comparisons of an underlying collatable type, such as
+     <type>text</>.  Though rare, updates to operating system
+     collation rules can cause these issues.  More commonly, an
+     inconsistency in the collation order between a master server and
+     a standby server is implicated, possibly because the
+     <emphasis>major</> operating system version in use is
+     inconsistent.  Such inconsistencies will generally only arise on
+     standby servers, and so can generally only be detected on standby
+     servers.  If a problem like this arises, it may not affect each
+     individual index that is ordered using an affected collation,
+     simply because <emphasis>indexed</> values might happen to have
+     the same absolute ordering regardless of the behavioral
+     inconsistency.  See <xref linkend="locale"> and <xref
+     linkend="collation"> for further details about how
+     <productname>PostgreSQL</> uses operating system locales and
+     collations.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by hypothetical undiscovered bugs in the
+     underlying <productname>PostgreSQL</> access method code.
+     Automatic verification of the structural integrity of indexes
+     plays a role in the general testing of new or proposed
+     <productname>PostgreSQL</> features that could plausibly allow a
+     logical inconsistency to be introduced.  One obvious testing
+     strategy is to call <filename>amcheck</> functions continuously
+     when running the standard regression tests.  See <xref
+     linkend="regress-run"> for details on running the tests.
+    </para>
+   </listitem>
+  </itemizedlist>
+
+  No error concerning corruption raised by <filename>amcheck</> should
+  ever be a false positive.  Users are strongly advised to perform a
+  <command>REINDEX</> for any index where corruption is indicated,
+  although it is also important to identify and correct the underlying
+  problem.  In general, <filename>amcheck</> can only prove the
+  presence of corruption; it cannot prove its absence.
+ </para>
+
+ </sect2>
+
+</sect1>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 4e3f337..9f6389c 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -103,6 +103,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  </para>
 
  &adminpack;
+ &amcheck;
  &auth-delay;
  &auto-explain;
  &btree-gin;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 30adece..5675d6d 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -104,6 +104,7 @@
 <!-- contrib information -->
 <!ENTITY contrib         SYSTEM "contrib.sgml">
 <!ENTITY adminpack       SYSTEM "adminpack.sgml">
+<!ENTITY amcheck         SYSTEM "amcheck.sgml">
 <!ENTITY auth-delay      SYSTEM "auth-delay.sgml">
 <!ENTITY auto-explain    SYSTEM "auto-explain.sgml">
 <!ENTITY btree-gin       SYSTEM "btree-gin.sgml">
-- 
1.9.1

#33

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 10 years ago

In reply to: Peter Geoghegan (#32)

1 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

29.03.2016 06:13, Peter Geoghegan:

On Tue, Mar 22, 2016 at 10:57 AM, Peter Geoghegan <pg@heroku.com> wrote:

That's right - I have a small number of feedback items to work
through. I also determined myself that there could be a very low
probability race condition when checking the key space across sibling
pages, and will work to address that.

I attach a V2 revision.

Behavioral changes:

* Fixes very low probability race condition when verifying ordering
across sibling pages with bt_index_check()/AccessShareLock.

Notably, the fix does not involve using the high key from the next
page instead of the first item, despite my recent suggestion that it
would; I changed my mind. In this revision,
bt_right_page_check_scankey() continues to return the first item on
the next/right page (as a scankey), just as in the first version. The
locking considerations here are more complicated than before, and
hopefully correct. I abandoned the high key idea when I realized that
a concurrent page split could happen in the right sibling, making the
high key generally no more safe in the face of concurrent target page
deletion than continuing to use the first data item. Using the first
data item from the right page is better than using the high key item
from the right page verification-wise anyway, so that's good. I found
a new way of locking down and recovering from any race conditions (the
low probability bug in V1 for bt_index_check()/AccessShareLock calls
to the C function bt_right_page_check_scankey()). I won't go into the
gory details here, but I will mention that my new approach is inspired
by how backwards scans work already.

* Adds DEBUG1 traces that indicate the level and page currently being checked.

Advanced users will probably find these traces useful on occasion.
It's nice to be able to see how amcheck walks the B-Tree in practice.

Non-behavioral changes:

* Explains why an ExclusiveLock is needed on the target index by
bt_index_parent_check() (and a ShareLock on its heap relation).

This new explanation talks about why an ExclusiveLock *is* necessary
when checking downlinks against child pages, per Tomas Vondra's
request. (Note that nothing changes about child/downlink verification
itself). Before now, I focused on why they were not necessary, but
Tomas was right to point out that that isn't enough information.

* No longer uses the word "comport" in an error message, per Amit
Langote. Similarly, minor typos fixed, per Tomas.

* Adds a variety of pg_regress-based tests for external sorting. (Via
various CREATE INDEX statements involving a variety of datatypes. Data
is generated pseudo-randomly.)

Notably, the regression tests *never* test external sorting today.

ISTM that amcheck is a convenient, reliable, and portable way of
testing the sorting of collatable types like text, so this is the best
way of testing external sorting. External sorting uses abbreviated
keys for runs, but discards them as runs are written out, and so the
merging of runs does not use abbreviated keys. Abbreviated comparisons
and authoritative comparisons are therefore reliably used within the
same sort. One particular goal here is that something like the recent
strxfrm() debacle might be caught by this testing on some platforms
(if we ever get back to trusting strxfrm() at all, that is). The style
of these new tests is unorthodox, because a variety of collations are
tested, which varies from system to system; we should discuss all
this. The tests can take a minute or two to run on my laptop, but that
could vary significantly. And so, whether or not that needs to be
targeted by something other than "check" and "installcheck" is a
discussion that needs to happen. I imagine a lot of buildfarm animals
have slow storage, but on the other hand I don't think hackers run the
contrib regression tests so often. I prefer to not follow the example
of test_decoding if possible; test_decoding's Makefile only runs tests
when wal_level=logical is expected (that's how the relevant Makefile
targets are scoped), but it's a complicated special case.

Review
------

As already noted after the first bullet point, there are some tricky
considerations on fixing the race when the user calls
bt_index_check(). Reviewers should pay close attention to this; could
I be wrong about that being race-safe even now? This seems to me to be
by far the best place for reviewers to look for problems in this
patch, as it's the only bt_index_check()/AccessShareLock case that
compares anything *across* pages. The measures taken to prevent false
positives described within bt_target_page_check() require expert
review, especially because this is the one case where a faulty
rationale may result in under-protection from false positives, and not
just harmless over-protection.

The rationale for why we can reliably recover from a race described
within the C function bt_right_page_check_scankey() is *complicated*.
I may have failed to make it as simple as possible despite significant
effort. It's hard to describe these things in an accessible way. Maybe
someone can try and understand what I've said there, and let me know
how I might have fallen short. I have used a new term, "cousin page"
in this explanation (cf. sibling page). This seems like useful
terminology that makes these difficult explanations a bit more
accessible.

I'm not an expert in btree locking races, but I tested the patch it on
different loads and it works as promised.
I didn't find mistakes in the code logic as well.

As a reviewer, I don't have any objections to mark it "Ready for Committer".
The only notice I'd like to add is about it's compatibility with
covering indexes.
We already discussed it in the thread
https://commitfest.postgresql.org/9/433/
Tiny attached patch fixes this issue.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

amcheck_compatible_with_covering_indexes.patchtext/x-patch; name=amcheck_compatible_with_covering_indexes.patchDownload

diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c
index 66d1f4f..91b2bc5 100644
--- a/contrib/amcheck/amcheck.c
+++ b/contrib/amcheck/amcheck.c
@@ -1118,6 +1118,9 @@ invariant_key_less_than_equal_offset(BtreeCheckState *state, ScanKey key,
 	int16		natts = state->rel->rd_rel->relnatts;
 	int32		cmp;
 
+#if (PG_VERSION_NUM >= 90600)
+		natts = state->rel->rd_index->indnkeyatts;
+#endif
 	cmp = _bt_compare(state->rel, natts, key, state->target, upperbound);
 
 	return cmp <= 0;
@@ -1139,6 +1142,9 @@ invariant_key_greater_than_equal_offset(BtreeCheckState *state, ScanKey key,
 	int16		natts = state->rel->rd_rel->relnatts;
 	int32		cmp;
 
+#if (PG_VERSION_NUM >= 90600)
+		natts = state->rel->rd_index->indnkeyatts;
+#endif
 	cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound);
 
 	return cmp >= 0;
@@ -1164,6 +1170,9 @@ invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
 	int16		natts = state->rel->rd_rel->relnatts;
 	int32		cmp;
 
+#if (PG_VERSION_NUM >= 90600)
+		natts = state->rel->rd_index->indnkeyatts;
+#endif
 	cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);
 
 	return cmp <= 0;

#34

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Peter Geoghegan (#17)

Re: amcheck (B-Tree integrity checking tool)

On Sat, Mar 12, 2016 at 12:38 PM, Peter Geoghegan <pg@heroku.com> wrote:

Only insofar as it helps diagnose the underlying issue, when it is a
more subtle issue. Actually fixing the index is almost certainly a
REINDEX. Once you're into the messy business of diagnosing a
problematic opclass, you have to be an expert, and tweaking amcheck
for your requirements (i.e. rebuilding from source) becomes
reasonable. Part of the reason that the code is so heavily commented
is to make it hackable, because I do not feel optimistic that I can
get an expert-orientated interface right, but I still want to make the
tool as useful as possible to experts.

Heroku began a selective roll-out of amcheck yesterday. amcheck
already found a bug in the PostGiS Geography B-Tree opclass:

https://github.com/postgis/postgis/blob/svn-trunk/postgis/geography_btree.c#L260

The issue is not that PG_RETURN_BOOL() is used here (that's just a
problem of style). Rather, the issue is that the Geography opclass
support function 1 considers an "empty geometry" equal to all other
possible values (values not limited to other empty geometries). This
breaks the transitive law, which nbtree requires and relies on.

I'll go report this to the PostGiS people.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Peter Geoghegan (#34)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Aug 19, 2016 at 2:40 AM, Peter Geoghegan <pg@heroku.com> wrote:

Heroku began a selective roll-out of amcheck yesterday. amcheck
already found a bug in the PostGiS Geography B-Tree opclass:
[...]
I'll go report this to the PostGiS people.

Cool. I have been honestly wondering about deploying this tool as well
to allow some of the QE tests to perform live checks of btree indexes
as we use a bunch of them. By the way, I have not looked at the patch,
but this supports just btree, right? Wouldn't btree_check be a better
name, or do you think that the interface you provide is generic enough
that it could be extended in the future for gin, gist, etc.?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Michael Paquier (#35)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Cool. I have been honestly wondering about deploying this tool as well
to allow some of the QE tests to perform live checks of btree indexes
as we use a bunch of them.

I'd certainly welcome that. There are Debian packages available from
the Github version of amcheck, which is otherwise practically
identical to the most recent version of the patch posted here:

https://github.com/petergeoghegan/amcheck

(This version also targets Postgres 9.4+).

By the way, I have not looked at the patch,
but this supports just btree, right? Wouldn't btree_check be a better
name, or do you think that the interface you provide is generic enough
that it could be extended in the future for gin, gist, etc.?

Yes, the idea of calling it amcheck was support for other AMs could be
added later.

Personally, I would like to make amcheck verifying the heap through a
B-Tree index as a next step. There is also a good case for the tool
directly verifying heap relations, without involving any index, but
that can come later.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Peter Geoghegan (#36)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Aug 19, 2016 at 9:14 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 18, 2016 at 5:06 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Cool. I have been honestly wondering about deploying this tool as well
to allow some of the QE tests to perform live checks of btree indexes
as we use a bunch of them.

I'd certainly welcome that. There are Debian packages available from
the Github version of amcheck, which is otherwise practically
identical to the most recent version of the patch posted here:

https://github.com/petergeoghegan/amcheck

This would be packaged from source in my case, but that's no big deal
:) At least I can see that it is added in the next CF, and that's
marked as ready for committer for a couple of months now...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Michael Paquier (#37)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Aug 18, 2016 at 5:35 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

This would be packaged from source in my case, but that's no big deal
:) At least I can see that it is added in the next CF, and that's
marked as ready for committer for a couple of months now...

If you consider how the code is written, you'll see that it's very
unlikely to destabilize production systems.

Every page is copied into local memory once before being operated on,
and only one buffer lock/pin is held at once (for long enough to do
that copying). If there were bugs in amcheck, it's much more likely
that they'd be "false positive" bugs. In any case, I haven't seen any
issue with the tool itself yet, having now run the tool on thousands
of servers.

I think I'll have a lot more information in about a week, when I've
had time to work through more data.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Peter Geoghegan (#36)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan <pg@heroku.com> wrote:

I'd certainly welcome that. There are Debian packages available from
the Github version of amcheck, which is otherwise practically
identical to the most recent version of the patch posted here:

https://github.com/petergeoghegan/amcheck

(This version also targets Postgres 9.4+).

There were some extremely minor changes made here compared to my V2
patch, which, ISTM, means that a V3 patch is necessary. I'll get to
that next week.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Kevin Grittner

kgrittn@gmail.com

over 9 years ago

In reply to: Peter Geoghegan (#39)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Sep 2, 2016 at 11:31 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Aug 18, 2016 at 5:14 PM, Peter Geoghegan <pg@heroku.com> wrote:

I'd certainly welcome that. There are Debian packages available from
the Github version of amcheck, which is otherwise practically
identical to the most recent version of the patch posted here:

https://github.com/petergeoghegan/amcheck

(This version also targets Postgres 9.4+).

There were some extremely minor changes made here compared to my V2
patch, which, ISTM, means that a V3 patch is necessary. I'll get to
that next week.

I changed the CF entry to "Waiting on Author" pending that. (I was
starting to review the patch for commit, so it's none to early to
mention that the last posted version is not what I should be
looking at.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Kevin Grittner (#40)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner <kgrittn@gmail.com> wrote:

I changed the CF entry to "Waiting on Author" pending that. (I was
starting to review the patch for commit, so it's none to early to
mention that the last posted version is not what I should be
looking at.)

There are only tiny differences, which in any case you can see in the
commit log on Github. There is no reason why code review needs to
block on this V3, IMV.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Kevin Grittner

kgrittn@gmail.com

over 9 years ago

In reply to: Peter Geoghegan (#41)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Sep 2, 2016 at 1:16 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Sep 2, 2016 at 11:11 AM, Kevin Grittner <kgrittn@gmail.com> wrote:

I changed the CF entry to "Waiting on Author" pending that. (I was
starting to review the patch for commit, so it's none to early to
mention that the last posted version is not what I should be
looking at.)

There are only tiny differences, which in any case you can see in the
commit log on Github. There is no reason why code review needs to
block on this V3, IMV.

IMV the process is to post a patch to this list to certify that it
is yours to contribute and free of IP encumbrances that would
prevent us from using it. I will wait for that post.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Peter Geoghegan (#41)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Sep 2, 2016 at 11:16 AM, Peter Geoghegan <pg@heroku.com> wrote:

There are only tiny differences, which in any case you can see in the
commit log on Github. There is no reason why code review needs to
block on this V3, IMV.

Also, I don't think that we need to include V2's tests now that there
is independent test coverage for external sorts, which is a recent
development. I imagine that the V2 test run duration might be
problematic on some buildfarm animals. I'll omit that in V3.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Kevin Grittner (#42)

1 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner <kgrittn@gmail.com> wrote:

IMV the process is to post a patch to this list to certify that it
is yours to contribute and free of IP encumbrances that would
prevent us from using it. I will wait for that post.

I attach my V3. There are only very minor code changes here, such as
breaking out a separate elevel constant for DEBUG2 traces that show
information of how the B-Tree is accessed (e.g. current level, block
number). Tiny tweaks to about 3 comments were also performed. These
changes are trivial.

I've also removed the tests added, since external sort regression test
coverage is in a much better place now.

Finally, the documentation was updated, to make it closer to the
Github project's documentation. I expanded what was started as the
original amcheck sgml documentation based on the experience of using
amcheck on production databases at Heroku. This isn't a big change,
but it's the biggest in V3. The documentation now emphasizes the
likelihood of amcheck finding software errors, which is all we found.
Jim Gray predicted that fault tolerant systems will tend to have
almost all problems arise from operator error and software faults in
practice, so perhaps I should have predicted that that would be the
general trend.

--
Peter Geoghegan

Attachments:

0001-Add-amcheck-extension-to-contrib.patchtext/x-patch; charset=US-ASCII; name=0001-Add-amcheck-extension-to-contrib.patchDownload

From 03046b49c893c7695bbae365c33e3d7b27a1f9a3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.
---
 contrib/Makefile                 |    1 +
 contrib/amcheck/Makefile         |   19 +
 contrib/amcheck/amcheck--1.0.sql |   20 +
 contrib/amcheck/amcheck.c        | 1267 ++++++++++++++++++++++++++++++++++++++
 contrib/amcheck/amcheck.control  |    5 +
 doc/src/sgml/amcheck.sgml        |  278 +++++++++
 doc/src/sgml/contrib.sgml        |    1 +
 doc/src/sgml/filelist.sgml       |    1 +
 8 files changed, 1592 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.c
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 25263c0..4acd50e 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 0000000..7bb29aa
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= amcheck.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 0000000..ebbd6ac
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT;
diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c
new file mode 100644
index 0000000..5c4d02d
--- /dev/null
+++ b/contrib/amcheck/amcheck.c
@@ -0,0 +1,1267 @@
+/*-------------------------------------------------------------------------
+ *
+ * amcheck.c
+ *		Verifies the integrity of access methods based on invariants.
+ *
+ * Provides SQL-callable functions for verifying that various logical
+ * invariants in the structure of index access methods are respected.  This
+ * includes, for example, the invariant that each page in the target B-Tree
+ * index has "real" items in logical order as reported by an insertion scankey
+ * (the insertion scankey sort-wise NULL semantics are useful for
+ * verification).
+ *
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/amcheck/amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/transam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "commands/tablecmds.h"
+#include "miscadmin.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+
+PG_MODULE_MAGIC;
+
+#define CHECK_SUPERUSER() { \
+		if (!superuser()) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+					 (errmsg("must be superuser to use verification functions")))); }
+
+/*
+ * As noted in comments above _bt_compare(), there is special handling of the
+ * first data item (that is, the first item with a valid downlink -- not the
+ * high key item) on a non-leaf (internal) page.  There is clearly no point in
+ * having amcheck functions make any comparison of or against these "minus
+ * infinity" items, because they contain no actual information other than the
+ * downlink.
+ */
+#define OFFSET_IS_MINUS_INFINITY(opaque, offset) \
+	(!P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque))
+
+/*
+ * Callers to verification functions should never receive a false positive
+ * indication of corruption.  Therefore, when using amcheck functions for
+ * stress testing, it may be useful to temporally change the CORRUPTION elevel
+ * to PANIC, to immediately halt the server in the event of detecting an
+ * invariant condition violation.  This may preserve more information about the
+ * nature of the underlying problem.  Note that modifying the CORRUPTION
+ * constant to be an elevel < ERROR is not well tested.
+ */
+#ifdef NOT_USED
+#define CORRUPTION		PANIC
+#define CONCERN			WARNING
+#define POSITION		NOTICE
+#else
+#define CORRUPTION		ERROR
+#define CONCERN			DEBUG1
+#define POSITION		DEBUG2
+#endif
+
+/*
+ * A B-Tree cannot possibly have this many levels, since there must be one
+ * block per level, which is bound by the range of BlockNumber:
+ */
+#define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+
+/*
+ * State associated with verifying a B-Tree index
+ */
+typedef struct BtreeCheckState
+{
+	/*
+	 * Unchanging state, established at start of verification:
+	 *
+	 * rel:                 B-Tree Index Relation
+	 * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
+	 * checkstrategy:       Buffer access strategy
+	 * targetcontext:       Target page memory context
+	 */
+	Relation				rel;
+	bool					exclusivelock;
+	BufferAccessStrategy 	checkstrategy;
+	MemoryContext			targetcontext;
+
+	/*
+	 * Mutable state, for verification of particular page:
+	 *
+	 * target:              Main target page
+	 * targetblock:         Main target page's block number
+	 * targetlsn:           Main target page's LSN
+	 *
+	 * target is the point of reference for a verification operation.
+	 *
+	 * Note: targetlsn is always from "target".  It is stashed here out of
+	 * convenience.
+	 *
+	 * Other B-Tree pages may be allocated, but those are always auxiliary
+	 * (e.g. they are target's child pages).  Conceptually, only the target
+	 * page is checked.  Each page found by verification's left/right,
+	 * top/bottom scan becomes the target once.
+	 *
+	 * Memory is managed by resetting targetcontext after verification of some
+	 * target page finishes (possibly including target verification that
+	 * depends on non-target page state).
+	 */
+	Page					target;
+	BlockNumber				targetblock;
+	XLogRecPtr				targetlsn;
+
+} BtreeCheckState;
+
+/*
+ * Starting point for verifying an entire B-Tree index level
+ */
+typedef struct BtreeLevel
+{
+	/* Level number (0 is leaf page level). */
+	uint32			level;
+
+	/* Left most block on level.  Scan of level begins here. */
+	BlockNumber		leftmost;
+
+	/* Is this level reported as "true" root level by meta page? */
+	bool			istruerootlevel;
+} BtreeLevel;
+
+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);
+
+static void btree_index_checkable(Relation rel);
+static void bt_check_every_level(Relation rel, bool exclusivelock);
+static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
+											   BtreeLevel level);
+static void bt_target_page_check(BtreeCheckState *state);
+static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+							  ScanKey targetkey);
+static bool invariant_key_less_than_equal_offset(BtreeCheckState *state,
+												 ScanKey key,
+												 OffsetNumber upperbound);
+static bool invariant_key_greater_than_equal_offset(BtreeCheckState *state,
+													ScanKey key,
+													OffsetNumber lowerbound);
+static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+														   Page other,
+														   ScanKey key,
+														   OffsetNumber upperbound);
+static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+
+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */
+Datum
+bt_index_check(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	indrel;
+
+	CHECK_SUPERUSER();
+
+	indrel = relation_open(relid, AccessShareLock);
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, false);
+
+	relation_close(indrel, AccessShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * bt_index_parent_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Acquires ExclusiveLock on index relation, and ShareLock on the associated
+ * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
+ * are valid lower bounds on child pages.
+ */
+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+	Oid			indrelid = PG_GETARG_OID(0);
+	Oid			heapid;
+	Relation	indrel;
+	Relation	heaprel;
+
+	CHECK_SUPERUSER();
+
+	/*
+	 * We must lock table before index to avoid deadlocks.  However, if the
+	 * passed indrelid isn't an index then IndexGetRelation() will fail.
+	 * Rather than emitting a not-very-helpful error message, postpone
+	 * complaining, expecting that the is-it-an-index test below will fail.
+	 */
+	heapid = IndexGetRelation(indrelid, true);
+	if (OidIsValid(heapid))
+		heaprel = heap_open(heapid, ShareLock);
+	else
+		heaprel = NULL;
+
+	/*
+	 * Open the target index relations separately (like relation_openrv(), but
+	 * with heap relation locked first to prevent deadlocking).  In hot standby
+	 * mode this will raise an error.
+	 */
+	indrel = index_open(indrelid, ExclusiveLock);
+
+	/* Check for active uses of the index in the current transaction */
+	CheckTableNotInUse(indrel, "bt_index_parent_check");
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, true);
+
+	index_close(indrel, ExclusiveLock);
+	if (heaprel)
+		heap_close(heaprel, ShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * btree_index_checkable()
+ *
+ * Basic checks about the suitability of a relation for checking as a B-Tree
+ * index.
+ */
+static void
+btree_index_checkable(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_INDEX ||
+		rel->rd_rel->relam != BTREE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only nbtree access method indexes are supported"),
+				 errdetail("Relation \"%s\" is not a B-Tree index.",
+						   RelationGetRelationName(rel))));
+
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions"),
+				 errdetail("Index \"%s\" is associated with temporary relation.",
+						   RelationGetRelationName(rel))));
+
+	if (!rel->rd_index->indisready)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("Index is not yet ready for insertions")));
+
+	if (!rel->rd_index->indisvalid)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("Index is not valid")));
+}
+
+/*
+ * bt_check_every_level()
+ *
+ * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
+ * logical order, verifying invariants as it goes.
+ *
+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physical modification to the index structure, including
+ * modifications that VACUUM may make.
+ */
+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
+{
+	BtreeCheckState	   *state;
+	Page				metapage;
+	BTMetaPageData	   *metad;
+	uint32				previouslevel;
+	BtreeLevel			current;
+
+	/*
+	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
+	 * RecentGlobalXmin/B-Tree page deletion.
+	 */
+	Assert(TransactionIdIsValid(RecentGlobalXmin));
+
+	/*
+	 * Initialize state for entire verification operation
+	 */
+	state = palloc(sizeof(BtreeCheckState));
+	state->rel = rel;
+	state->exclusivelock = exclusivelock;
+	state->checkstrategy = GetAccessStrategy(BAS_BULKREAD);
+	/* Create context for page */
+	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
+												 "amcheck page data",
+												 ALLOCSET_DEFAULT_MINSIZE,
+												 ALLOCSET_DEFAULT_INITSIZE,
+												 ALLOCSET_DEFAULT_MAXSIZE);
+
+	/* Get true root block from meta-page */
+	metapage = palloc_btree_page(state, BTREE_METAPAGE);
+	metad = BTPageGetMeta(metapage);
+
+	/*
+	 * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+	 * the fast root and true root differ.
+	 *
+	 * Start from the true root, not the fast root, unlike conventional index
+	 * scans.  This approach is more thorough, and removes the risk of
+	 * following a stale fast root from the meta page.
+	 */
+	if (metad->btm_fastroot != metad->btm_root)
+		ereport(CONCERN,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("fast root mismatch in index %s",
+						RelationGetRelationName(rel)),
+				 errdetail_internal("Fast block %u (level %u) "
+									"differs from true root "
+									"block %u (level %u).",
+									metad->btm_fastroot, metad->btm_fastlevel,
+									metad->btm_root, metad->btm_level)));
+
+	/*
+	 * Starting at the root, verify every level.  Move left to right, top to
+	 * bottom.  Note that there may be no pages other than the meta page (meta
+	 * page can indicate that root is P_NONE when the index is totally empty).
+	 */
+	previouslevel = InvalidBtreeLevel;
+	current.level = metad->btm_level;
+	current.leftmost = metad->btm_root;
+	current.istruerootlevel =  true;
+	while (current.leftmost != P_NONE)
+	{
+		/*
+		 * Verify this level, and get left most page for next level down, if
+		 * not at leaf level
+		 */
+		current = bt_check_level_from_leftmost(state, current);
+
+		if (current.leftmost == InvalidBlockNumber)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" has no valid pages on level below %u or first level",
+							RelationGetRelationName(rel), previouslevel)));
+
+		previouslevel = current.level;
+	}
+
+	/* Be tidy: */
+	MemoryContextDelete(state->targetcontext);
+}
+
+/*
+ * bt_check_level_from_leftmost()
+ *
+ * Given a left-most block at some level, move right, verifying each page
+ * individually (with more verification across pages for "exclusivelock"
+ * callers).  Caller should pass the true root page as the leftmost initially,
+ * working their way down by passing what is returned for the last call here
+ * until level 0 (leaf page level) was reached.
+ *
+ * Returns state for next call, if any.  This includes left-most block number
+ * one level lower that should be passed on next level/call, or P_NONE leaf
+ * level is checked.  Level numbers follow the nbtree convention: higher levels
+ * have higher numbers, because new levels are added only due to a root page
+ * split.  Note that prior to the first root page split, the root is also a
+ * leaf page.  This means that there is always a level 0 (leaf level), and it's
+ * always the last level processed.
+ *
+ * Note on memory management:  State's per-page context is reset here, between
+ * each call to bt_target_page_check().
+ */
+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{
+	/* State to establish early, concerning entire level */
+	BTPageOpaque	opaque;
+	MemoryContext 	oldcontext;
+	BtreeLevel		nextleveldown;
+
+	/* Variables for iterating across level using right links */
+	BlockNumber		leftcurrent = P_NONE;
+	BlockNumber		current = level.leftmost;
+
+	/* Initialize return state */
+	nextleveldown.leftmost = InvalidBlockNumber;
+	nextleveldown.level = InvalidBtreeLevel;
+	nextleveldown.istruerootlevel = false;
+
+	/* Use page-level context for duration of this call */
+	oldcontext = MemoryContextSwitchTo(state->targetcontext);
+
+	elog(POSITION, "verifying level %u%s", level.level,
+		 level.istruerootlevel?
+		 " (true root level)" : level.level == 0 ? " (leaf level)" : "");
+
+	do
+	{
+		/* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Initialize state for this iteration */
+		state->targetblock = current;
+		state->target = palloc_btree_page(state, state->targetblock);
+		state->targetlsn = PageGetLSN(state->target);
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+		if (P_IGNORE(opaque))
+		{
+			if (P_RIGHTMOST(opaque))
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("block %u fell off the end of index \"%s\"",
+								current, RelationGetRelationName(state->rel))));
+			else
+				ereport(CONCERN,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("block %u of index \"%s\" ignored",
+								current, RelationGetRelationName(state->rel))));
+			goto nextpage;
+		}
+		else if (nextleveldown.leftmost == InvalidBlockNumber)
+		{
+			/*
+			 * A concurrent page split could make the caller supplied leftmost
+			 * block no longer contain the leftmost page, or no longer be the
+			 * true root, but where that isn't possible due to heavyweight
+			 * locking, check that the first valid page meets caller's
+			 * expectations.
+			 */
+			if (state->exclusivelock)
+			{
+				if (!P_LEFTMOST(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not leftmost in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+
+				if (level.istruerootlevel && !P_ISROOT(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not true root in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+			}
+
+			/*
+			 * Before beginning any non-trivial examination of level, establish
+			 * next level down's leftmost block number, which next call here
+			 * will pass as its leftmost (iff this isn't leaf level).
+			 *
+			 * There should be at least one non-ignorable page per level.
+			 */
+			if (!P_ISLEAF(opaque))
+			{
+				IndexTuple		itup;
+				ItemId			itemid;
+
+				/* Internal page -- downlink gets leftmost on next level */
+				itemid = PageGetItemId(state->target, P_FIRSTDATAKEY(opaque));
+				itup = (IndexTuple) PageGetItem(state->target, itemid);
+				nextleveldown.leftmost = ItemPointerGetBlockNumber(&(itup->t_tid));
+				nextleveldown.level = opaque->btpo.level - 1;
+			}
+			else
+			{
+				/*
+				 * Leaf page -- final level caller must process.
+				 *
+				 * Note that this could also be the root page, if there has
+				 * been no root page split yet.
+				 */
+				nextleveldown.leftmost = P_NONE;
+				nextleveldown.level = InvalidBtreeLevel;
+			}
+
+			/*
+			 * Finished setting up state for this call/level.  Control will
+			 * never end up back here in any future loop iteration for this
+			 * level.
+			 */
+		}
+
+		if (state->exclusivelock && opaque->btpo_prev != leftcurrent)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("right link/left link pair in index \"%s\" not in mutual agreement",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Deleted block=%u left block=%u link reported block=%u.",
+										current, leftcurrent, opaque->btpo_prev)));
+
+		/* Verify invariants for page -- all important checks occur here */
+		bt_target_page_check(state);
+
+nextpage:
+
+		/* Try to detect circular links */
+		if (current == leftcurrent || current == opaque->btpo_prev)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("circular link chain found in block %u of index \"%s\"",
+							current, RelationGetRelationName(state->rel))));
+
+		leftcurrent = current;
+		current = opaque->btpo_next;
+
+		/* Free page and associated memory for this iteration */
+		MemoryContextReset(state->targetcontext);
+	}
+	while (current != P_NONE);
+
+	/* Don't change context for caller */
+	MemoryContextSwitchTo(oldcontext);
+
+	return nextleveldown;
+}
+
+/*
+ * bt_target_page_check()
+ *
+ * Function performs the following checks on target page, or pages ancillary to
+ * target page:
+ *
+ * - That every "real" data item is less than or equal to the high key, which
+ *   is an upper bound on the items on the pages (where there is a high key at
+ *   all -- pages that are rightmost lack one).
+ *
+ * - That within the page, every "real" item is less than or equal to the item
+ *   immediately to its right, if any (i.e., that the items are in order within
+ *   the page, so that the binary searches performed by index scans are sane).
+ *
+ * - That the last item stored on the page is less than or equal to the first
+ *   "real" data item on the page to the right (if such a first item is
+ *   available).
+ *
+ * Furthermore, when state passed shows ExclusiveLock held, function also
+ * checks:
+ *
+ * - That all child pages respect downlinks lower bound (internal pages only).
+ *
+ * Note:  This routine is not especially proactive in freeing memory.  High
+ * watermark memory consumption is bound to some small fixed multiple of
+ * BLCKSZ, though.  Caller should reset the current context between calls here.
+ */
+static void
+bt_target_page_check(BtreeCheckState *state)
+{
+	OffsetNumber	offset;
+	OffsetNumber	max;
+	BTPageOpaque	topaque;
+
+	topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+	max = PageGetMaxOffsetNumber(state->target);
+
+	elog(POSITION, "verifying %u items on %s block %u", max,
+		 P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);
+
+	/*
+	 * Loop over page items, but don't start from P_HIKEY (don't have iteration
+	 * directly considering high key item, if any).  That's something that is
+	 * used as part of verifying all other items, but doesn't get its own
+	 * iteration.
+	 */
+	for (offset = P_FIRSTDATAKEY(topaque);
+		 offset <= max;
+		 offset = OffsetNumberNext(offset))
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+		ScanKey		skey;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Don't try to generate scankey using "minus infinity" garbage data */
+		if (OFFSET_IS_MINUS_INFINITY(topaque, offset))
+			continue;
+
+		/* Build insertion scankey for current page offset */
+		itemid = PageGetItemId(state->target, offset);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
+		skey = _bt_mkscankey(state->rel, itup);
+
+		/*
+		 *   ********************
+		 *   * High key check   *
+		 *   ********************
+		 *
+		 * If there is a high key, which there must be for a non-rightmost
+		 * page, check that it actually is upper bound on all page items.
+		 *
+		 * We prefer to check all items, rather than checking just the first
+		 * and trusting that the operator class obeys the transitive law (which
+		 * implies that all subsequent items also respected the high key
+		 * invariant if they pass the page order check).
+		 *
+		 * Ideally, we'd compare every item in the index against every other
+		 * item in the index, and not trust opclass obedience of the transitive
+		 * law to bridge the gap between children and their grandparents (as
+		 * well as great-grandparents, and so on).  We don't go to those
+		 * lengths because that would be prohibitively expensive, and probably
+		 * not markedly more effective in practice.
+		 */
+		if (!P_RIGHTMOST(topaque) &&
+			!invariant_key_less_than_equal_offset(state, skey, P_HIKEY))
+		{
+			char	   *itid,  *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("high key invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
+		/*
+		 *   ********************
+		 *   * Page order check *
+		 *   ********************
+		 *
+		 * Check that items are stored on page in logical order, by checking
+		 * current item is less than or equal to next item (if any).
+		 */
+		if (OffsetNumberNext(offset) <= max &&
+			!invariant_key_less_than_equal_offset(state, skey,
+												  OffsetNumberNext(offset)))
+		{
+			char	   *itid,  *htid,
+					   *nitid, *nhtid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			nitid = psprintf("(%u,%u)", state->targetblock,
+							 OffsetNumberNext(offset));
+
+			/* Reuse itup to get pointed-to heap location of second item */
+			itemid = PageGetItemId(state->target, OffsetNumberNext(offset));
+			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			nhtid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("page order invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Lower index tid=%s (points to %s tid=%s) "
+										"higher index tid=%s (points to %s tid=%s) "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										htid,
+										nitid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										nhtid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+		/*
+		 *   ********************
+		 *   * Last item check  *
+		 *   ********************
+		 *
+		 * Check last item against next/right page's first data item's when
+		 * last item on page is reached.
+		 *
+		 * The general idea here is that checking the ordering of items on the
+		 * page should still perform some check on the last item on the page,
+		 * if at all possible.  In other words, this is roughly the same
+		 * process as the page order check that has already been performed for
+		 * every other "real" item on target page by now; we just need to reach
+		 * into the next page to get a scankey to compare against lower bound
+		 * of max.
+		 */
+		else if (offset == max)
+		{
+			ScanKey		rightkey;
+
+			/* Get item in next/right page */
+			rightkey = bt_right_page_check_scankey(state);
+
+			if (rightkey &&
+				!invariant_key_greater_than_equal_offset(state, rightkey, max))
+			{
+				/*
+				 * As discussed in bt_right_page_check_scankey(),
+				 * non-ExclusiveLock case might have had target page deleted,
+				 * in which case no error is raised
+				 */
+				if (!state->exclusivelock)
+				{
+					/* Get fresh copy of target page */
+					state->target = palloc_btree_page(state, state->targetblock);
+					/* Note that we deliberately do not update target LSN */
+					topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+					/*
+					 * Because of RecentGlobalXmin interlock against VACUUM's
+					 * recycling of blocks, we can safely assume that this is
+					 * substantively the same target page as before.
+					 *
+					 * Just return, because all !exclusivelock checks already
+					 * performed against target.
+					 */
+					if (P_IGNORE(topaque))
+						return;
+				}
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("cross page order invariant violated for index \"%s\"",
+								RelationGetRelationName(state->rel)),
+						 errdetail_internal("Last item on page tid=(%u,%u) "
+											"page lsn=%X/%X.",
+											state->targetblock, offset,
+											(uint32) (state->targetlsn >> 32),
+											(uint32) state->targetlsn)));
+			}
+		}
+
+		/*
+		 *   ********************
+		 *   * Downlink check   *
+		 *   ********************
+		 *
+		 * Additional check of child items against target page (their parent),
+		 * iff this is internal page and caller holds ExclusiveLock on index
+		 * relation.  This involves a pass over each child page at the end of
+		 * each iteration (excluding the minus infinity iteration which
+		 * internal pages always have, which is immediately skipped).
+		 */
+		if (!P_ISLEAF(topaque) && state->exclusivelock)
+		{
+			BlockNumber childblock = ItemPointerGetBlockNumber(&(itup->t_tid));
+
+			bt_downlink_check(state, childblock, skey);
+		}
+	}
+}
+
+/*
+ * bt_right_page_check_scankey()
+ *
+ * Return a scankey for an item on page to right of current target (or the
+ * first non-ignorable page), sufficient to check ordering invariant on last
+ * item in current target page.  Returned scankey relies on local memory
+ * allocated for the child page, which caller cannot pfree().  Caller's memory
+ * context should be reset between calls here.
+ *
+ * This is the first data item, and so all adjacent items are checked against
+ * their immediate sibling item (which may be on a sibling page, or even a
+ * "cousin" page at parent boundaries where target's rightlink points to page
+ * with different parent page).  If no such valid item is available, return
+ * NULL instead.
+ *
+ * Note that !exclusivelock callers must reverify that target page has not been
+ * concurrently deleted.
+ */
+static ScanKey
+bt_right_page_check_scankey(BtreeCheckState *state)
+{
+	BTPageOpaque	opaque;
+	ItemId			rightitem;
+	BlockNumber		targetnext;
+	Page			rightpage;
+	OffsetNumber	nline;
+
+	/* Determine target's next block number */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+	/* If target is already rightmost, no right sibling; nothing to do here */
+	if (P_RIGHTMOST(opaque))
+		return NULL;
+
+	/*
+	 * General notes on concurrent page splits and page deletion:
+	 *
+	 * Concurrent page splits are not a problem for ordinary index scans, since
+	 * the key space always moves in a way that lets index scans not miss
+	 * things:  they might have to move right, but they never have to move left
+	 * (leaving aside index scans backwards, a special case).  A concurrent
+	 * page split could occur here, but just as with index scans we're
+	 * following the stale right link, which will reliably get us further along
+	 * in the key space, which is all we really need to get an item further
+	 * along in key space to check invariant in target page.
+	 *
+	 * (Note that routines like _bt_search() don't require *any* page split
+	 * interlock when descending the tree, including something very light like
+	 * a buffer pin.  That's why it's okay that we don't either.)
+	 *
+	 * A deleted page won't actually be recycled by VACUUM early enough for us
+	 * to fail to be able follow its right link (or left link, or downlink),
+	 * because it doesn't do so until it knows that no possible index scan
+	 * could land on the page with the expectation of at least being able to
+	 * move right and eventually find a non-ignorable page.  (see page
+	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 *
+	 * It's okay if we follow a rightlink and find a half-dead or dead
+	 * (ignorable) page.  Either way, there must be a sane further right link
+	 * to follow for these ignorable pages, because page deletion refuses to
+	 * merge the key space between adjacent pages that do not share a common
+	 * parent (that is, merging of the key space has to be among true sibling
+	 * pages, never cousin pages).  We should succeed in finding a page to the
+	 * right that isn't ignorable before too long.
+	 */
+	targetnext = opaque->btpo_next;
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		rightpage = palloc_btree_page(state, targetnext);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
+
+		if (!P_IGNORE(opaque) || P_RIGHTMOST(opaque))
+			break;
+
+		/* We landed on a deleted page, so step right to find a live page */
+		targetnext = opaque->btpo_next;
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
+						opaque->btpo.level, RelationGetRelationName(state->rel)),
+				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
+
+		/* Be slightly more pro-active in freeing this memory, just in case */
+		pfree(rightpage);
+	}
+
+	/*
+	 * No ExclusiveLock held case -- why it's safe to proceed.
+	 *
+	 * Problem:
+	 *
+	 * We must avoid false positive reports of corruption when caller treats
+	 * item returned here as an upper bound on target's last item.  In general,
+	 * false positives are disallowed.  Ensuring they don't happen in
+	 * !exclusivelock case is subtle.
+	 *
+	 * A concurrent page deletion by VACUUM of the target page can result in
+	 * the insertion of items on to this right sibling page that would
+	 * previously have been inserted on our target page.  There might have been
+	 * insertions that followed target's downlink after it was made to point to
+	 * right sibling instead of target by page deletion's first phase.  The
+	 * inserters insert items that would belong on target page.  This race is
+	 * very tight, but it's possible.  This is our only problem.
+	 *
+	 * Non-problems:
+	 *
+	 * We are not hindered by a concurrent page split of the target; we'll
+	 * never land on the second half of the page anyway.  A concurrent split of
+	 * the right page will also not matter, because the first data item remains
+	 * the same within the left half, which we'll reliably land on.  If we had
+	 * to skip over ignorable/deleted pages, it cannot matter because their key
+	 * space has already been atomically merged with the first non-ignorable
+	 * page we eventually find (doesn't matter whether the page we eventually
+	 * find is a true sibling or a cousin of target, which we go into below).
+	 *
+	 * Solution:
+	 *
+	 * Caller knows that it should reverify that target is not ignorable
+	 * (half-dead or deleted) when cross-page sibling item comparison appears
+	 * to indicate corruption (invariant fails).  This detects the single race
+	 * condition that exists for caller.  This is correct because the continued
+	 * existence of target block as non-ignorable (not half-dead or deleted)
+	 * implies that target page was not merged into from the right by deletion;
+	 * the key space at or after target never moved left.  Target's parent
+	 * either has the same downlink to target as before, or a <= downlink due
+	 * to deletion at the left of target.  Target either has the same highkey
+	 * as before, or a highkey <= before when there is a page split. (The
+	 * rightmost concurrently-split-from-target-page page will still have the
+	 * same highkey as target was originally found to have, which for our
+	 * purposes is equivalent to target's highkey itself never changing, since
+	 * we reliably skip over concurrently-split-from-target-page pages.)
+	 *
+	 * In simpler terms, we allow that the key space of the target may expand
+	 * left (the key space can move left on the left side of target only), but
+	 * the target key space cannot expand right and get ahead of us without our
+	 * detecting it.  The key space of the target cannot shrink, unless it
+	 * shrinks to zero due to the deletion of the original page, our canary
+	 * condition.  (To be very precise, we're a bit stricter than that because
+	 * it might just have been that the target page split and only the original
+	 * target page was deleted.  We can be more strict, just not more lax.)
+	 *
+	 * Top level tree walk caller moves on to next page (makes it the new
+	 * target) following recovery from this race.  (cf.  The rationale for
+	 * child/downlink verification needing an ExclusiveLock within
+	 * bt_downlink_check(), where page deletion is also the main source of
+	 * trouble.)
+	 *
+	 * Note that it doesn't matter if right sibling page here is actually a
+	 * cousin page, because in order for the key space to be readjusted in a
+	 * way that causes us issues in next level up (guiding problematic
+	 * concurrent insertions to the cousin from the grandparent rather than to
+	 * the sibling from the parent), there'd have to be page deletion of
+	 * target's parent page (affecting target's parent's downlink in target's
+	 * grandparent page).  Internal page deletion only occurs when there are no
+	 * child pages (they were all fully deleted), and caller is checking that
+	 * the target's parent has at least one non-deleted (so non-ignorable)
+	 * child: the target page.  (Note that the first phase of deletion
+	 * atomically marks the page to be deleted half-dead/ignorable at the same
+	 * time downlink in its parent is removed, so we'll definitely be able to
+	 * detect that this might have happened just from the target page.)
+	 *
+	 * This trick is inspired by the method backward scans use for dealing with
+	 * concurrent page splits; concurrent page deletion is a problem that
+	 * similarly receives special consideration sometimes (it's possible that
+	 * the backwards scan will re-read its "original" block after failing to
+	 * find a right-link to it, having already moved in the opposite direction
+	 * (right/"forwards") a few times to try to locate one).  Just like us,
+	 * that happens only to determine if there was a concurrent page deletion
+	 * of a reference page, and just like us if there was a page deletion of
+	 * that reference page it means we can move on from caring about the
+	 * reference page.  See the nbtree README for a full description of how
+	 * that works.
+	 */
+	nline = PageGetMaxOffsetNumber(rightpage);
+
+	/*
+	 * Get first data item.
+	 *
+	 * Importantly, this allows the verification of page order across target
+	 * and rightmost page when rightmost page is the target's right sibling.
+	 * Moreover, the only way logical inconsistencies can really be missed
+	 * across a given level is if the transitive law is broken by an opclass,
+	 * because contiguous pairs are always compared, even across page
+	 * boundaries. (Or, we don't have a stable snapshot of structure, as in
+	 * !exclusivelock case -- it can theoretically "just miss" corruption.)
+	 */
+	if (P_ISLEAF(opaque) && nline >= P_FIRSTDATAKEY(opaque))
+	{
+		/*
+		 * For leaf page, return first data item (if any), which cannot be a
+		 * "minus infinity" item
+		 */
+		rightitem = PageGetItemId(rightpage, P_FIRSTDATAKEY(opaque));
+	}
+	else if (!P_ISLEAF(opaque) &&
+			 nline >= OffsetNumberNext(P_FIRSTDATAKEY(opaque)))
+	{
+		/*
+		 * Return first item after the internal page's undefined "minus
+		 * infinity" item, if any.  Also skip the high key if that's necessary.
+		 */
+		rightitem = PageGetItemId(rightpage,
+								  OffsetNumberNext(P_FIRSTDATAKEY(opaque)));
+	}
+	else
+	{
+		/*
+		 * No first item.  Page is probably empty leaf page, but it's also
+		 * possible that it's an internal page with only a minus infinity item.
+		 */
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("%s block %u of index \"%s\" has no first data item",
+						P_ISLEAF(opaque) ? "leaf":"internal", targetnext,
+						RelationGetRelationName(state->rel))));
+		return NULL;
+	}
+
+	/*
+	 * Return first real item scankey.  Note that this relies on right page
+	 * memory remaining allocated.
+	 */
+	return _bt_mkscankey(state->rel,
+						 (IndexTuple) PageGetItem(rightpage, rightitem));
+}
+
+/*
+ * bt_downlink_check()
+ *
+ * Checks one of target's downlink against its child page.
+ *
+ * Conceptually, the target page continues to be what is checked here.  The
+ * target block is still blamed in the event of finding an invariant violation.
+ * The downlink insertion into the target is probably where any problem raised
+ * here arises, and there is no such thing as a parent link, so doing the
+ * verification this way around is much more practical.
+ */
+static void
+bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey)
+{
+	OffsetNumber	offset;
+	OffsetNumber	maxoffset;
+	Page			child;
+	BTPageOpaque	copaque;
+
+	/*
+	 * Caller must have ExclusiveLock on target relation, because of
+	 * considerations around page deletion by VACUUM.
+	 *
+	 * N.B.: In general, page deletion deletes the right sibling's downlink,
+	 * not the downlink of the page being deleted; the deleted page's downlink
+	 * is reused for its sibling.  The key space is thereby consolidated
+	 * between the deleted page and its right sibling.  (We cannot delete a
+	 * parent page's rightmost page unless it is the last child page, and we
+	 * intend to delete the parent itself.)
+	 *
+	 * If this verification happened without an ExclusiveLock, the following
+	 * race condition could cause false positives (which are generally
+	 * disallowed):
+	 *
+	 * Not having an ExclusiveLock would allow concurrent page deletion,
+	 * including deletion of the left sibling of the child page that is
+	 * examined here.  If such a page deletion occurred, and was then closely
+	 * followed by an insertion into the newly expanded key space of the child,
+	 * a false positive may result: our stale parent/target downlink would
+	 * legitimately not be a lower bound on all items in the page anymore,
+	 * because the key space was concurrently expanded "left" (insertion
+	 * followed the "new" downlink for the child, not our now-stale downlink,
+	 * which was concurrently physically removed in target/parent as part of
+	 * deletion's first phase).
+	 *
+	 * Note that while the cross-page-same-level check uses a trick that allows
+	 * it to perform verification for !exclusivelock callers, an analogous
+	 * trick seems very difficult here.  The trick that that other check uses
+	 * is, in essence, to lock down race conditions to those that occur due to
+	 * concurrent page deletion of the target; that's a race that can be
+	 * reliably detected before actually reporting corruption.  On the other
+	 * hand, we'd need to lock down race conditions involving deletion of
+	 * child's left page, at least for long enough to read the child page into
+	 * memory.  Any more granular locking schemes all seem to involve multiple
+	 * concurrently held buffer locks.  That's unacceptable for amcheck on
+	 * general principle, though; amcheck functions never hold more than one
+	 * buffer lock at a time.
+	 */
+	Assert(state->exclusivelock);
+
+	/*
+	 * Verify child page has the down-link key from target page (its parent) as
+	 * a lower bound.
+	 *
+	 * We prefer to check all items, rather than checking just the first and
+	 * trusting that the operator class obeys the transitive law (which implies
+	 * that all subsequent items also respected the downlink-as-lower-bound
+	 * invariant if they pass the page order check when the child later becomes
+	 * our target for verification).
+	 */
+	child = palloc_btree_page(state, childblock);
+	copaque = (BTPageOpaque) PageGetSpecialPointer(child);
+	maxoffset = PageGetMaxOffsetNumber(child);
+
+	for (offset = P_FIRSTDATAKEY(copaque);
+		 offset <= maxoffset;
+		 offset = OffsetNumberNext(offset))
+	{
+		/*
+		 * Skip comparison of target page key against "minus infinity" item, if
+		 * any.  Checking it would indicate that its not an upper bound, but
+		 * that's only because of the hard-coding within _bt_compare().
+		 */
+		if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
+			continue;
+
+		if (!invariant_key_less_than_equal_nontarget_offset(state, child,
+															targetkey, offset))
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("down-link lower bound invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Parent block=%u child index tid=(%u,%u) "
+										"parent page lsn=%X/%X.",
+										state->targetblock, childblock, offset,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+	}
+
+	pfree(child);
+}
+
+/*
+ * invariant_key_less_than_equal_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_offset(BtreeCheckState *state, ScanKey key,
+									 OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * invariant_key_greater_than_equal_offset()
+ *
+ * Does the invariant hold that the key is greater than or equal to a given
+ * lower bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_greater_than_equal_offset(BtreeCheckState *state, ScanKey key,
+										OffsetNumber lowerbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound);
+
+	return cmp >= 0;
+}
+
+/*
+ * invariant_key_less_than_equal_nontarget_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item, with the offset relating to a caller-supplied page that
+ * is not the current target page? Caller's non-target page is typically a
+ * child page of the target, checked as part of checking a property of the
+ * target page (i.e. the key comes from the target).
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+											   Page nontarget, ScanKey key,
+											   OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * palloc_btree_page()
+ *
+ * Given a block number of a B-Tree page, return page in palloc()'d memory.
+ * While at it, perform some basic checks of the page.
+ *
+ * There is never an attempt to get a consistent view of multiple pages using
+ * multiple concurrent buffer locks; in general, we prefer to have only one pin
+ * and buffer lock at a time, which is often all that the nbtree code requires.
+ *
+ * Operating on a copy of the page is useful because it prevents control
+ * getting stuck in an uninterruptible state when an underlying operator class
+ * misbehaves.
+ */
+static Page
+palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
+{
+	Buffer			buffer;
+	Page			page;
+	BTPageOpaque	opaque;
+
+	page = palloc(BLCKSZ);
+
+	/*
+	 * We copy the page into local storage to avoid holding pin on the buffer
+	 * longer than we must.
+	 */
+	buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
+								state->checkstrategy);
+	LockBuffer(buffer, BT_READ);
+
+	/*
+	 * Perform the same basic sanity checking that nbtree itself performs for
+	 * every page:
+	 */
+	_bt_checkpage(state->rel, buffer);
+
+	/* Only use copy of page in palloc()'d memory */
+	memcpy(page, BufferGetPage(buffer), BLCKSZ);
+	UnlockReleaseBuffer(buffer);
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (opaque->btpo_flags & BTP_META && blocknum != BTREE_METAPAGE)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid meta page found at block %u in index \"%s\"",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	/* Check page from block that ought to be meta page */
+	if (blocknum == BTREE_METAPAGE)
+	{
+		BTMetaPageData *metad = BTPageGetMeta(page);
+
+		if (!(opaque->btpo_flags & BTP_META) ||
+			metad->btm_magic != BTREE_MAGIC)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" meta page is corrupt",
+							RelationGetRelationName(state->rel))));
+
+		if (metad->btm_version != BTREE_VERSION)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("version mismatch in index \"%s\": file version %d, code version %d",
+							RelationGetRelationName(state->rel),
+							metad->btm_version, BTREE_VERSION)));
+	}
+
+	/*
+	 * Deleted pages have no sane "level" field, so can only check non-deleted
+	 * page level
+	 */
+	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+
+	if (blocknum != BTREE_METAPAGE && !P_ISLEAF(opaque) &&
+		!P_ISDELETED(opaque) && opaque->btpo.level == 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+						opaque->btpo.level, RelationGetRelationName(state->rel))));
+
+	if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("internal page block %u in index \"%s\" has garbage items",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	return page;
+}
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
new file mode 100644
index 0000000..180f150
--- /dev/null
+++ b/contrib/amcheck/amcheck.control
@@ -0,0 +1,5 @@
+# amcheck extension
+comment = 'verify access method invariants'
+default_version = '1.0'
+module_pathname = '$libdir/amcheck'
+relocatable = true
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
new file mode 100644
index 0000000..d3d6a96
--- /dev/null
+++ b/doc/src/sgml/amcheck.sgml
@@ -0,0 +1,278 @@
+<!-- doc/src/sgml/amcheck.sgml -->
+
+<sect1 id="amcheck" xreflabel="amcheck">
+ <title>amcheck</title>
+
+ <indexterm zone="amcheck">
+  <primary>amcheck</primary>
+ </indexterm>
+
+ <para>
+  The <filename>amcheck</> module provides functions that allow you to
+  verify the logical consistency of the structure of indexes.  If the
+  structure appears to be valid, no error is raised.
+ </para>
+
+ <para>
+  The functions verify various <emphasis>invariants</> in the
+  structure of the representation of particular indexes.  The
+  correctness of the access method functions behind index scans and
+  other important operations relies on these invariants always
+  holding.  For example, certain functions verify, among other things,
+  that all B-Tree pages have items in <quote>logical</> order (e.g.,
+  for B-Tree indexes on <type>text</>, index tuples should be in
+  collated lexical order).  If that particular invariant somehow fails
+  to hold, we can expect binary searches on the affected page to
+  incorrectly guide index scans, resulting in wrong answers to SQL
+  queries.  Problems like this can be very subtle.
+ </para>
+ <para>
+  Verification is performed using the same procedures as those used by
+  index scans themselves, which may be user-defined operator class
+  code.  For example, B-Tree index verification relies on comparisons
+  made with one or more B-Tree support function 1 routines.  See <xref
+  linkend="xindex-support"> for details of operator class support
+  functions.
+ </para>
+ <para>
+  <filename>amcheck</> functions may be used only by superusers.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>bt_index_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_check</function> tests that its target, a
+      B-Tree index, respects a variety of invariants.  Example usage:
+<screen>
+test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+FROM pg_index i
+JOIN pg_opclass op ON i.indclass[0] = op.oid
+JOIN pg_am am ON op.opcmethod = am.oid
+JOIN pg_class c ON i.indexrelid = c.oid
+JOIN pg_namespace n ON c.relnamespace = n.oid
+WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
+-- Don't check pg_class (bt_index_parent_check() requires this):
+AND c.relname NOT LIKE 'pg_class%'
+-- Don't check temp tables, which may be from another session:
+AND c.relpersistence != 't'
+-- Function may throw an error when this is omitted:
+AND i.indisready AND i.indisvalid
+ORDER BY c.relpages DESC LIMIT 10;
+ bt_index_check |             relname             | relpages 
+----------------+---------------------------------+----------
+                | pg_depend_reference_index       |       43
+                | pg_depend_depender_index        |       40
+                | pg_proc_proname_args_nsp_index  |       31
+                | pg_description_o_c_o_index      |       21
+                | pg_attribute_relid_attnam_index |       14
+                | pg_proc_oid_index               |       10
+                | pg_attribute_relid_attnum_index |        9
+                | pg_amproc_fam_proc_index        |        5
+                | pg_amop_opr_fam_index           |        5
+                | pg_amop_fam_strat_index         |        5
+(10 rows)
+</screen>
+      This example shows a session that performs verification of every
+      catalog index in the database <quote>test</> (except those
+      associated with the <structname>pg_class</structname> catalog).
+      Details of just the 10 largest indexes verified are displayed.
+      Since no error is raised, all indexes tested appear to be
+      logically consistent.  Naturally, this query could easily be
+      changed to call <function>bt_index_check</function> for every
+      index in the database where verification is supported.
+     </para>
+     <para>
+      An <literal>AccessShareLock</> is acquired on the target index
+      by <function>bt_index_check</function>.  This lock mode is the
+      same lock mode acquired on relations by simple
+      <literal>SELECT</> statements.
+      <function>bt_index_check</function> does not verify invariants
+      that span child/parent relationships, nor does it verify that
+      the target index is consistent with its heap relation.  When a
+      routine, lightweight test for corruption is required in a live
+      production environment, using
+      <function>bt_index_check</function> often provides the best
+      trade-off between thoroughness of verification and limiting the
+      impact on application performance and availability.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>bt_index_parent_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_parent_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_parent_check</function> tests that its
+      target, a B-Tree index, respects a variety of invariants.  The
+      checks performed by <function>bt_index_parent_check</function>
+      are a superset of the checks performed by
+      <function>bt_index_check</function>.
+      <function>bt_index_parent_check</function> can be thought of as
+      a more thorough variant of <function>bt_index_check</function>:
+      unlike <function>bt_index_check</function>,
+      <function>bt_index_parent_check</function> also checks
+      invariants that span parent/child relationships.  However, it
+      does not verify that the target index is consistent with its
+      heap relation.  <function>bt_index_parent_check</function>
+      follows the general convention of raising an error if it finds a
+      logical inconsistency or other problem.
+     </para>
+     <para>
+      An <literal>ExclusiveLock</> is required on the target index by
+      <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</> is also acquired on the heap relation).
+      These locks prevent concurrent data modification from
+      <command>INSERT</>, <command>UPDATE</>, and <command>DELETE</>
+      commands.  The locks also prevent the underlying relation from
+      being concurrently processed by <command>VACUUM</> (and certain
+      other utility commands).  Note that the function holds locks for
+      as short a duration as possible, so there is no advantage to
+      verifying each index individually in a series of transactions,
+      unless long running queries happen to be of particular concern.
+     </para>
+     <para>
+      <function>bt_index_parent_check</function>'s additional
+      verification is more likely to detect various pathological
+      cases.  These cases may involve an incorrectly implemented
+      B-Tree operator class used by the index that is checked, or,
+      hypothetically, undiscovered bugs in the underlying B-Tree index
+      access method code.  Note that
+      <function>bt_index_parent_check</function> cannot be used when
+      Hot Standby mode is enabled (i.e., on read-only physical
+      replicas), unlike <function>bt_index_check</function>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Using <filename>amcheck</> effectively</title>
+
+ <para>
+  <filename>amcheck</> can be effective at detecting various types of
+  failure modes that <link
+  linkend="app-initdb-data-checksums"><application>data page
+  checksums</></link> will always fail to catch.  These include:
+
+  <itemizedlist>
+   <listitem>
+    <para>
+     Structural inconsistencies caused by incorrect operator class
+     implementations.
+    </para>
+    <para>
+     This includes issues caused by the comparison rules of operating
+     system collations changing. Comparisons of datums of a collatable
+     type like <type>text</> must be immutable (just as all
+     comparisons used for B-Tree index scans must be immutable), which
+     implies that operating system collation rules must never change.
+     Though rare, updates to operating system collation rules can
+     cause these issues. More commonly, an inconsistency in the
+     collation order between a master server and a standby server is
+     implicated, possibly because the <emphasis>major</> operating
+     system version in use is inconsistent.  Such inconsistencies will
+     generally only arise on standby servers, and so can generally
+     only be detected on standby servers.
+    </para>
+    <para>
+     If a problem like this arises, it may not affect each individual
+     index that is ordered using an affected collation, simply because
+     <emphasis>indexed</> values might happen to have the same
+     absolute ordering regardless of the behavioral inconsistency. See
+     <xref linkend="locale"> and <xref linkend="collation"> for
+     further details about how <productname>PostgreSQL</> uses
+     operating system locales and collations.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by hypothetical undiscovered bugs in the
+     underlying <productname>PostgreSQL</> access method code or sort
+     code.
+    </para>
+    <para>
+     Automatic verification of the structural integrity of indexes
+     plays a role in the general testing of new or proposed
+     <productname>PostgreSQL</> features that could plausibly allow a
+     logical inconsistency to be introduced.  One obvious testing
+     strategy is to call <filename>amcheck</> functions continuously
+     when running the standard regression tests.  See <xref
+     linkend="regress-run"> for details on running the tests.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Filesystem or storage subsystem faults where checksums happen to
+     simply not be enabled.
+    </para>
+    <para>
+     Note that <filename>amcheck</> examines a page as represented in some
+     shared memory buffer at the time of verification if there is only a
+     shared buffer hit when accessing the block. Consequently,
+     <filename>amcheck</> does not necessarily examine data read from the
+     filesystem at the time of verification. Note that when checksums are
+     enabled, <filename>amcheck</> may raise an error due to a checksum
+     failure when a corrupt block is read into a buffer.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by faulty RAM, and the broader memory subsystem
+     and operating system.
+    </para>
+    <para>
+     <productname>PostgreSQL</> does not protect against correctable
+     memory errors and it is assumed you will operate using RAM that
+     uses industry standard Error Correcting Codes (ECC) or better
+     protection.  However, ECC memory is typically only immune to
+     single-bit errors, and should not be assumed to provide
+     <emphasis>absolute</emphasis> protection against failures that
+     result in memory corruption.
+    </para>
+   </listitem>
+  </itemizedlist>
+  In general, <filename>amcheck</> can only prove the presence of
+  corruption; it cannot prove its absence.
+ </para>
+
+ </sect2>
+ <sect2>
+  <title>Repairing corruption</title>
+ <para>
+  No error concerning corruption raised by <filename>amcheck</> should
+  ever be a false positive.  In practice, <filename>amcheck</> is more
+  likely to find software bugs than problems with hardware.
+  <filename>amcheck</> raises errors in the event of conditions that,
+  by definition, should never happen, and so careful analysis of
+  <filename>amcheck</> errors is often required.
+ </para>
+ <para>
+  There is no general method of repairing problems that
+  <filename>amcheck</> detects.  An explanation for the root cause of
+  an invariant violation should be sought.  <xref
+  linkend="pageinspect"> may play a useful role in diagnosing
+  corruption that <filename>amcheck</> detects. A <command>REINDEX</>
+  may or may not be effective in repairing corruption.
+ </para>
+
+ </sect2>
+
+</sect1>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index c8708ec..54102d6 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -103,6 +103,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  </para>
 
  &adminpack;
+ &amcheck;
  &auth-delay;
  &auto-explain;
  &bloom;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 4383711..3c7ec01 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -105,6 +105,7 @@
 <!-- contrib information -->
 <!ENTITY contrib         SYSTEM "contrib.sgml">
 <!ENTITY adminpack       SYSTEM "adminpack.sgml">
+<!ENTITY amcheck         SYSTEM "amcheck.sgml">
 <!ENTITY auth-delay      SYSTEM "auth-delay.sgml">
 <!ENTITY auto-explain    SYSTEM "auto-explain.sgml">
 <!ENTITY bloom           SYSTEM "bloom.sgml">
-- 
2.7.4

#45

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Peter Geoghegan (#44)

Re: amcheck (B-Tree integrity checking tool)

On Wed, Sep 7, 2016 at 11:44 AM, Peter Geoghegan <pg@heroku.com> wrote:

I attach my V3.

I should point out that I'm leaving to go on Vacation this weekend.
I'll be away for a week, and will not be answering mail during that
period. If anyone has any questions on this for me, it might be
preferable to ask them before I leave.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Peter Geoghegan (#44)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Sep 8, 2016 at 3:44 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner <kgrittn@gmail.com> wrote:

IMV the process is to post a patch to this list to certify that it
is yours to contribute and free of IP encumbrances that would
prevent us from using it. I will wait for that post.

I attach my V3. There are only very minor code changes here, such as
breaking out a separate elevel constant for DEBUG2 traces that show
information of how the B-Tree is accessed (e.g. current level, block
number). Tiny tweaks to about 3 comments were also performed. These
changes are trivial.

I've also removed the tests added, since external sort regression test
coverage is in a much better place now.

Okay, moved to next CF. I may look at it finally I got some use-cases
for it, similar to yours I guess..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Peter Geoghegan

pg@heroku.com

over 9 years ago

In reply to: Michael Paquier (#46)

Re: amcheck (B-Tree integrity checking tool)

On Sun, Oct 2, 2016 at 6:48 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

Okay, moved to next CF. I may look at it finally I got some use-cases
for it, similar to yours I guess..

Let me know how that goes.

One thing I've definitely noticed is that the tool is good at finding
corner-case bugs, so even if you can only test a small fraction of the
number of databases that I've been able to test, there could still be
significant value in your performing your own exercise. Your customer
databases might feature far more use of Japanese collations, for
example, which might be an important factor. (Well, the use of Arabic
code points turned out to be an important factor in the question of
whether or not en_US.utf8 could have been affected by the Glibc +
abbreviated keys bug.)

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Peter Geoghegan (#47)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Oct 7, 2016 at 8:05 AM, Peter Geoghegan <pg@heroku.com> wrote:

Your customer
databases might feature far more use of Japanese collations, for
example, which might be an important factor.

Not really :)
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Peter Geoghegan (#1)

Re: amcheck (B-Tree integrity checking tool)

On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Noah and I discussed possible future directions for amcheck in person
recently. I would like to get Noah's thoughts again here on how a tool
like amcheck might reasonably target other access methods for
verification. In particular, the heapam. MultiXacts were mentioned as
a structure that could receive verification in a future iteration of
this tool, but I lack expertise there.

I've placed a lot of emphasis on the importance of having a
low-overhead verification process, particularly in terms of the
strength of heavyweight lock that the verification process requires.
Ideally, it would be possible to run any new verification process in a
fairly indiscriminate way with only limited impact on live production
systems.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Noah Misch

noah@leadboat.com

about 9 years ago

In reply to: Peter Geoghegan (#49)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Oct 14, 2016 at 04:56:39PM -0700, Peter Geoghegan wrote:

On Mon, Feb 29, 2016 at 4:09 PM, Peter Geoghegan <pg@heroku.com> wrote:

To recap, the extension adds some SQL-callable functions that verify
certain invariant conditions hold within some particular B-Tree index.
These are the conditions that index scans rely on always being true.
The tool's scope may eventually cover other AMs, including heapam, but
nbtree seems like the best place to start.

Noah and I discussed possible future directions for amcheck in person
recently. I would like to get Noah's thoughts again here on how a tool
like amcheck might reasonably target other access methods for
verification. In particular, the heapam. MultiXacts were mentioned as
a structure that could receive verification in a future iteration of
this tool, but I lack expertise there.

Yes, a heap checker could examine plenty of things. Incomplete list:

- Detect impossible conditions in the hint bits. A tuple should not have both
HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID. Every tuple bearing
HEAP_ONLY_TUPLE should bear HEAP_UPDATED. HEAP_HASVARWIDTH should be true
if and only if the tuple has a non-NULL value in a negative-typlen column,
possibly a dropped column. A tuple should not have both HEAP_KEYS_UPDATED
and HEAP_XMAX_LOCK_ONLY.

- Report evidence of upgrades from older versions. If the tool sees
HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report that the cluster was
binary-upgraded from 8.3 or 8.4. If the user did not upgrade from such a
version, the user should assume corruption.

- Check VARSIZE() of each variable-length datum. Corrupt lengths might direct
you to seek past the end of the tuple, or they might imply excess free space
at the end of the tuple.

- Verify agreement between CLOG, MULTIXACT, and hint bits. If the hint bits
include HEAP_XMAX_LOCK_ONLY, the multixact should not contain a
MultiXactStatusUpdate member. README.tuplock documents other invariants.
If the tool sees a tuple passing HEAP_LOCKED_UPGRADED, it can report that
the cluster was binary-upgraded from a version in [8.3, 9.1].

- Verify that TOAST pointers (in non-dropped columns of visible tuples) point
to valid data in the TOAST relation. This is much more expensive than the
other checks I've named, so it should be optional.

- If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify
that the visibility data stored in the page is compatible with that claim.

- Examine PageHeaderData values. If pd_checksum is non-zero in a cluster with
checksums disabled, the cluster was binary-upgraded from [8.3, 9.2].

I've placed a lot of emphasis on the importance of having a
low-overhead verification process, particularly in terms of the
strength of heavyweight lock that the verification process requires.
Ideally, it would be possible to run any new verification process in a
fairly indiscriminate way with only limited impact on live production
systems.

I suspect you could keep heap checker overhead similar to the cost of "SELECT
count(*) FROM table_name".

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Michael Paquier

michael.paquier@gmail.com

about 9 years ago

In reply to: Noah Misch (#50)

Re: amcheck (B-Tree integrity checking tool)

On Mon, Oct 17, 2016 at 10:46 AM, Noah Misch <noah@leadboat.com> wrote:

- Detect impossible conditions in the hint bits. A tuple should not have both
HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID. Every tuple bearing
HEAP_ONLY_TUPLE should bear HEAP_UPDATED. HEAP_HASVARWIDTH should be true
if and only if the tuple has a non-NULL value in a negative-typlen column,
possibly a dropped column. A tuple should not have both HEAP_KEYS_UPDATED
and HEAP_XMAX_LOCK_ONLY.

- Report evidence of upgrades from older versions. If the tool sees
HEAP_MOVED_IN or HEAP_MOVED_OFF, it can report that the cluster was
binary-upgraded from 8.3 or 8.4. If the user did not upgrade from such a
version, the user should assume corruption.

- Check VARSIZE() of each variable-length datum. Corrupt lengths might direct
you to seek past the end of the tuple, or they might imply excess free space
at the end of the tuple.

- Verify agreement between CLOG, MULTIXACT, and hint bits. If the hint bits
include HEAP_XMAX_LOCK_ONLY, the multixact should not contain a
MultiXactStatusUpdate member. README.tuplock documents other invariants.
If the tool sees a tuple passing HEAP_LOCKED_UPGRADED, it can report that
the cluster was binary-upgraded from a version in [8.3, 9.1].

- Verify that TOAST pointers (in non-dropped columns of visible tuples) point
to valid data in the TOAST relation. This is much more expensive than the
other checks I've named, so it should be optional.

There is a lot of value in that actually. I got bitten lately by an
issue where this got corrupted because of an incorrect failover flow.

- If VM_ALL_VISIBLE() or VM_ALL_FROZEN() passes for a particular page, verify
that the visibility data stored in the page is compatible with that claim.

- Examine PageHeaderData values. If pd_checksum is non-zero in a cluster with
checksums disabled, the cluster was binary-upgraded from [8.3, 9.2].

Agreed basically on all those items, though the first version does not
need to do everything :)

Honestly just btree checks have enough value to validate the
integration of this module. You can get that as "I'll look at your
patch and provide a review soon because I have a hell lot of cases
where this will be useful".
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Andres Freund

andres@anarazel.de

about 9 years ago

In reply to: Peter Geoghegan (#44)

Re: amcheck (B-Tree integrity checking tool)

Hi,

I think the patch could use a pgindent run.

On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote:

diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 0000000..ebbd6ac
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT;

I'd really want a function that runs all check on a table.

+#define CHECK_SUPERUSER() { \
+		if (!superuser()) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+					 (errmsg("must be superuser to use verification functions")))); }

Why is this a macro?

+/*
+ * Callers to verification functions should never receive a false positive
+ * indication of corruption.  Therefore, when using amcheck functions for
+ * stress testing, it may be useful to temporally change the CORRUPTION elevel
+ * to PANIC, to immediately halt the server in the event of detecting an
+ * invariant condition violation.  This may preserve more information about the
+ * nature of the underlying problem.  Note that modifying the CORRUPTION
+ * constant to be an elevel < ERROR is not well tested.
+ */
+#ifdef NOT_USED
+#define CORRUPTION		PANIC
+#define CONCERN			WARNING
+#define POSITION		NOTICE
+#else
+#define CORRUPTION		ERROR
+#define CONCERN			DEBUG1
+#define POSITION		DEBUG2
+#endif

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.

+/*
+ * A B-Tree cannot possibly have this many levels, since there must be one
+ * block per level, which is bound by the range of BlockNumber:
+ */
+#define InvalidBtreeLevel	((uint32) InvalidBlockNumber)

Hm, I think it'd be, for the long term, better if we'd move the btree
check code to amcheck_nbtree.c or such.

+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+	Oid			indrelid = PG_GETARG_OID(0);
+	Oid			heapid;
+	Relation	indrel;
+	Relation	heaprel;
+
+	CHECK_SUPERUSER();
+
+	/*
+	 * We must lock table before index to avoid deadlocks.  However, if the
+	 * passed indrelid isn't an index then IndexGetRelation() will fail.
+	 * Rather than emitting a not-very-helpful error message, postpone
+	 * complaining, expecting that the is-it-an-index test below will fail.
+	 */
+	heapid = IndexGetRelation(indrelid, true);
+	if (OidIsValid(heapid))
+		heaprel = heap_open(heapid, ShareLock);
+	else
+		heaprel = NULL;
+
+	/*
+	 * Open the target index relations separately (like relation_openrv(), but
+	 * with heap relation locked first to prevent deadlocking).  In hot standby
+	 * mode this will raise an error.
+	 */
+	indrel = index_open(indrelid, ExclusiveLock);

Theoretically we should recheck that the oids still match, theoretically
the locking here allows the index->table mapping to change. It's not a
huge window tho...

+	/* Check for active uses of the index in the current transaction */
+	CheckTableNotInUse(indrel, "bt_index_parent_check");

Why do we actually care?

+static void
+btree_index_checkable(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_INDEX ||
+		rel->rd_rel->relam != BTREE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only nbtree access method indexes are supported"),
+				 errdetail("Relation \"%s\" is not a B-Tree index.",
+						   RelationGetRelationName(rel))));
+
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions"),
+				 errdetail("Index \"%s\" is associated with temporary relation.",
+						   RelationGetRelationName(rel))));
+
+	if (!rel->rd_index->indisready)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("Index is not yet ready for insertions")));

Why can't we check this? Seems valuable to allow checking after a
partial CONCURRENTLY index build?

Why don't you use IndexIsReady et al? Not sure if the patch compiles
against an earlier release (there weren't valid/ready/live dead fields,
but instead it was compressed into fewer)?

+/*
+ * bt_check_every_level()
+ *
+ * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
+ * logical order, verifying invariants as it goes.
+ *
+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physical modification to the index structure, including
+ * modifications that VACUUM may make.
+ */

Hm, what kind of locking does the kill_prior_tuple stuff use?

+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
+{

+	/*
+	 * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+	 * the fast root and true root differ.
+	 *
+	 * Start from the true root, not the fast root, unlike conventional index
+	 * scans.  This approach is more thorough, and removes the risk of
+	 * following a stale fast root from the meta page.
+	 */
+	if (metad->btm_fastroot != metad->btm_root)
+		ereport(CONCERN,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("fast root mismatch in index %s",
+						RelationGetRelationName(rel)),
+				 errdetail_internal("Fast block %u (level %u) "
+									"differs from true root "
+									"block %u (level %u).",
+									metad->btm_fastroot, metad->btm_fastlevel,
+									metad->btm_root, metad->btm_level)));

Hm, isn't that a potentially spurious report?

+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{

+	do
+	{
+		/* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Initialize state for this iteration */
+		state->targetblock = current;
+		state->target = palloc_btree_page(state, state->targetblock);
+		state->targetlsn = PageGetLSN(state->target);
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+		if (P_IGNORE(opaque))
+		{
+			if (P_RIGHTMOST(opaque))
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("block %u fell off the end of index \"%s\"",
+								current, RelationGetRelationName(state->rel))));
+			else
+				ereport(CONCERN,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("block %u of index \"%s\" ignored",
+								current, RelationGetRelationName(state->rel))));

Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
of these?

+			/*
+			 * Before beginning any non-trivial examination of level, establish
+			 * next level down's leftmost block number, which next call here
+			 * will pass as its leftmost (iff this isn't leaf level).

"next level down's leftmost block number" - it'd be good to simplify
that a bit.

+/*
+ * bt_target_page_check()

Why "target"?

+static void
+bt_target_page_check(BtreeCheckState *state)
+{

+		/*
+		 *   ********************
+		 *   * Page order check *
+		 *   ********************

Isn't that item order, rather than page order?

+		}
+		/*
+		 *   ********************
+		 *   * Last item check  *
+		 *   ********************

Newline before block wouldn't hurt.

+		 * Check last item against next/right page's first data item's when
+		 * last item on page is reached.

I think this need some polishing.

+		/*
+		 *   ********************
+		 *   * Downlink check   *
+		 *   ********************
+		 *
+		 * Additional check of child items against target page (their parent),
+		 * iff this is internal page and caller holds ExclusiveLock on index

s/is/is an/ /holds/holds an/

+static ScanKey
+bt_right_page_check_scankey(BtreeCheckState *state)
+{

+	 * A deleted page won't actually be recycled by VACUUM early enough for us
+	 * to fail to be able follow its right link (or left link, or downlink),

"for us to fail to be able" sounds odd.

+	/*
+	 * No ExclusiveLock held case -- why it's safe to proceed.
+	 *
+	 * Problem:
+	 *
+	 * We must avoid false positive reports of corruption when caller treats
+	 * item returned here as an upper bound on target's last item.  In general,
+	 * false positives are disallowed.  Ensuring they don't happen in
+	 * !exclusivelock case is subtle.

s/item/an item/?

This paragraph is too complicated to follow in the noisy environment
where we both currently are ;)

+/*
+ * bt_downlink_check()
+ *
+ * Checks one of target's downlink against its child page.

downlinks?

+ * Conceptually, the target page continues to be what is checked here.  The
+ * target block is still blamed in the event of finding an invariant violation.
+ * The downlink insertion into the target is probably where any problem raised
+ * here arises, and there is no such thing as a parent link, so doing the
+ * verification this way around is much more practical.
+ */
+static void
+bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey)
+{

+	 * between the deleted page and its right sibling.  (We cannot delete a
+	 * parent page's rightmost page unless it is the last child page, and we
+	 * intend to delete the parent itself.)

"parent page's rightmost page"?

I like this. Some of the more complex pieces towards the end of the
field need some attention, there's a fair amount of word-smithing
needed, and I do think we want to make the structural changes outlined
above, but besides these, imo fairly simple adaptions, I do think this
is useful and not that far from being committable.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Thomas Munro

thomas.munro@enterprisedb.com

about 9 years ago

In reply to: Peter Geoghegan (#44)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Sep 8, 2016 at 6:44 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Sep 2, 2016 at 11:19 AM, Kevin Grittner <kgrittn@gmail.com> wrote:

IMV the process is to post a patch to this list to certify that it
is yours to contribute and free of IP encumbrances that would
prevent us from using it. I will wait for that post.

I attach my V3

+ * Ideally, we'd compare every item in the index against every other
+ * item in the index, and not trust opclass obedience of the transitive
+ * law to bridge the gap between children and their grandparents (as
+ * well as great-grandparents, and so on).  We don't go to those
+ * lengths because that would be prohibitively expensive, and probably
+ * not markedly more effective in practice.
+ */

I skimmed the Modern B-Tree Techniques paper recently, and there was a
section on "the cousin problem" when verifying btrees, which this
comment reminded me of. I tried to come up with an example of a pair
of characters being switched in a collation that would introduce
undetectable corruption:

T1. Order = a < b < c < d

Btree = [a|c]
/ \
/ \
/ \
[a]-------[c]
| |
| |
[b]-------[d]

T2. Order = a < c < b < d

Now c and b have been switched around. Won't amcheck still pass since
we only verify immediate downlinks and siblings? Yet searches for b
will take a wrong turn at the root. Do I have this right? I wonder
if there is a way to verify that each page's high key < the 'next' key
for all ancestors.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Andres Freund (#52)

Re: amcheck (B-Tree integrity checking tool)

On Wed, Nov 16, 2016 at 1:03 PM, Andres Freund <andres@anarazel.de> wrote:

I think the patch could use a pgindent run.

Okay.

I'd really want a function that runs all check on a table.

Why not just use a custom SQL query? The docs have an example of one
such query, that verifies all user indexes, but there could be another
example.

+#define CHECK_SUPERUSER() { \
+             if (!superuser()) \
+                     ereport(ERROR, \
+                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+                                      (errmsg("must be superuser to use verification functions")))); }

Why is this a macro?

This was just copied from pageinspect, TBH. What would you prefer?

+#ifdef NOT_USED
+#define CORRUPTION           PANIC
+#define CONCERN                      WARNING
+#define POSITION             NOTICE
+#else
+#define CORRUPTION           ERROR
+#define CONCERN                      DEBUG1
+#define POSITION             DEBUG2
+#endif
Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.

I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
might be broken if it was allowed.

Hm, I think it'd be, for the long term, better if we'd move the btree
check code to amcheck_nbtree.c or such.

Agreed.

+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+     Oid                     indrelid = PG_GETARG_OID(0);

Theoretically we should recheck that the oids still match, theoretically
the locking here allows the index->table mapping to change. It's not a
huge window tho...

+     /* Check for active uses of the index in the current transaction */
+     CheckTableNotInUse(indrel, "bt_index_parent_check");

Why do we actually care?

This was left-over from duplicating REINDEX code. Indeed, we have no
reason to care at all.

+     if (!rel->rd_index->indisready)
+             ereport(ERROR,
+                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                              errmsg("cannot check index \"%s\"",
+                                             RelationGetRelationName(rel)),
+                              errdetail("Index is not yet ready for insertions")));

Why can't we check this? Seems valuable to allow checking after a
partial CONCURRENTLY index build?

Okay.

Why don't you use IndexIsReady et al? Not sure if the patch compiles
against an earlier release (there weren't valid/ready/live dead fields,
but instead it was compressed into fewer)?

This version is almost identical to the Github version of amcheck that
we've been using with 9.4+. Agreed on IndexIsReady().

+/*
+ * bt_check_every_level()
+ *
+ * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
+ * logical order, verifying invariants as it goes.
+ *
+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physical modification to the index structure, including
+ * modifications that VACUUM may make.
+ */

Hm, what kind of locking does the kill_prior_tuple stuff use?

The killing itself, or the setting of LP_DEAD bit? Ordinary index
scans could concurrently set LP_DEAD with only an AccessShareLock.
But, that's just metadata that amcheck doesn't care about. Anything
that needs to actually recyle space based on finding an LP_DEAD bit
set is a write operation, that is precluded from running when
bt_index_parent_check() is called by its ExclusiveLock.
_bt_vacuum_one_page() is only called by one nbtinsert.c code path,
when it looks like a page split might be needed.

+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
+{

+     /*
+      * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+      * the fast root and true root differ.
+      *
+      * Start from the true root, not the fast root, unlike conventional index
+      * scans.  This approach is more thorough, and removes the risk of
+      * following a stale fast root from the meta page.
+      */
+     if (metad->btm_fastroot != metad->btm_root)
+             ereport(CONCERN,
+                             (errcode(ERRCODE_DUPLICATE_OBJECT),
+                              errmsg("fast root mismatch in index %s",
+                                             RelationGetRelationName(rel)),
+                              errdetail_internal("Fast block %u (level %u) "
+                                                                     "differs from true root "
+                                                                     "block %u (level %u).",
+                                                                     metad->btm_fastroot, metad->btm_fastlevel,
+                                                                     metad->btm_root, metad->btm_level)));

Hm, isn't that a potentially spurious report?

Well, it's not an error, so no. I think that it would be better with a
lower elevel, though, to indicate the low severity.

+                             ereport(CONCERN,
+                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                              errmsg("block %u of index \"%s\" ignored",
+                                                             current, RelationGetRelationName(state->rel))));

Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
of these?

It is, but you shouldn't get very many of them, since pages that are
ignorable don't have sibling links at the second phase of deletion. I
guess WARNING might be too strong here, too.

"next level down's leftmost block number" - it'd be good to simplify
that a bit.

Okay.

+/*
+ * bt_target_page_check()

Why "target"?

Conceptually, the verification scan vists from true root to leaf, left
to right, and makes each page it finds along the way the target
exactly once. It's represented as such in the state that we pass
around -- there is some metadata stashed about the target there (e.g.,
LSN is kept there for easy retrieval by possible ereport(ERROR)
calls). I think that that makes things quite a lot clearer,
particularly when we need to discuss checks that happen against the
sibling or child of target (conceptually, we're still checking the
target). The description of what happens and why that is okay is a lot
clearer because we constantly refer to the current target page, IMV --
that's always our frame of reference.

+             /*
+              *   ********************
+              *   * Page order check *
+              *   ********************

Isn't that item order, rather than page order?

I don't know...it checks that the items on the page are in the right order.

(Various minor niggles about style)

Okay.

This paragraph is too complicated to follow in the noisy environment
where we both currently are ;)

As I pointed out yesterday, that paragraph is 95%+ of what makes the
patch hard to understand. It's worth it, though, to detect problems
with things like transposed pages while holding only an
AccessShareLock.

downlinks?

Downlinks are an accepted terminology for the index tuples in internal
pages, that point to child pages (rather than having TID pointer to
table).

"parent page's rightmost page"?

I guess that should be "parent page's rightmost child". This is
covered in the nbtree README:

"""
To preserve consistency on the parent level, we cannot merge the key space
of a page into its right sibling unless the right sibling is a child of
the same parent --- otherwise, the parent's key space assignment changes
too, meaning we'd have to make bounding-key updates in its parent, and
perhaps all the way up the tree. Since we can't possibly do that
atomically, we forbid this case. That means that the rightmost child of a
parent node can't be deleted unless it's the only remaining child, in which
case we will delete the parent too (see below).
''""

I like this. Some of the more complex pieces towards the end of the
field need some attention, there's a fair amount of word-smithing
needed, and I do think we want to make the structural changes outlined
above, but besides these, imo fairly simple adaptions, I do think this
is useful and not that far from being committable.

Cool. I'll try to get a revision out soon. I'm happy to do that much.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Thomas Munro (#53)

Re: amcheck (B-Tree integrity checking tool)

On Wed, Nov 16, 2016 at 5:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

+ * Ideally, we'd compare every item in the index against every other
+ * item in the index, and not trust opclass obedience of the transitive
+ * law to bridge the gap between children and their grandparents (as
+ * well as great-grandparents, and so on).  We don't go to those
+ * lengths because that would be prohibitively expensive, and probably
+ * not markedly more effective in practice.
+ */

I skimmed the Modern B-Tree Techniques paper recently, and there was a
section on "the cousin problem" when verifying btrees, which this
comment reminded me of.

I'm impressed that you made the connection. This comment reminded you
of that section because it was directly influenced by it. I actually
use the term "cousin page" elsewhere, because it's really useful to
distinguish cousins from "true siblings" when talking about page
deletion in particular...and I have a lot to say about page deletion.

I tried to come up with an example of a pair
of characters being switched in a collation that would introduce
undetectable corruption:

T1. Order = a < b < c < d

Btree = [a|c]
/ \
/ \
/ \
[a]-------[c]
| |
| |
[b]-------[d]

T2. Order = a < c < b < d

So, the Btree you show is supposed to be good? You've left it up to me
to draw T2, the corrupt version? If so, here is a drawing of T2 for
the sake of clarity:

Btree = [a|b]
/ \
/ \
/ \
[a]-------[b]
| |
| |
[c]-------[d]

Now c and b have been switched around. Won't amcheck still pass since
we only verify immediate downlinks and siblings? Yet searches for b
will take a wrong turn at the root. Do I have this right? I wonder
if there is a way to verify that each page's high key < the 'next' key
for all ancestors.

I don't think you have it right, because your argument is predicated
on classic L&Y, not the Postgres variant. The high key is literally
guaranteed to be *equal* to the initial first value of the right half
of a split page post-split (at least initially), which is where the
new downlink for the new right half comes from in turn, so it will be
exactly equal as well. There is a comment which says all this within
_bt_insert_parent(), actually.

In general, I think it's pretty bad that we're so loose about
representing the key space in downlinks compared to classic L&Y,
because it probably makes index bloat a lot worse in internal pages,
messing up the key space long term. As long as we're doing that,
though, we clearly cannot usefully "verify that each page's high key <
the 'next' key for all ancestors", because we deviate from L&Y in a
way that makes that check always fail. It's okay that it always fails
for us (it doesn't actually indicate corruption), because of our
duplicate-aware index scan logic (change to invariant).

In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
cousin problem a non-issue, because we don't follow downlinks that are
equal to our scankey -- we go to their immediate left and start
*there*, in order to support duplicates in leaf pages. Index scans for
'b' work for both T1 and T2 in Postgres, because we refuse to follow
the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
would actually like to change things to make the invariant the classic
L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
suffix truncation in internal pages work.

So, we don't have the cousin problem, but since I wish for us to adopt
the stricter classic L&Y invariant at some point, you could say that I
also wished that we had the cousin problem.

Confused yet? :-)

[1]: https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up -- Peter Geoghegan
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Peter Geoghegan (#54)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.

I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
might be broken if it was allowed.

What do you think about new argument with default vs. GUC? I guess
that the GUC might be a lot less of a foot-gun. We might even give it
a suitably scary name, to indicate that it will make the server PANIC.
(I gather that you don't care about other aspects of verbosity -- just
about the ability to make amcheck PANIC in the event of an invariant
violation without recompiling it.)

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Peter Geoghegan (#56)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Nov 18, 2016 at 6:51 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Thu, Nov 17, 2016 at 12:04 PM, Peter Geoghegan <pg@heroku.com> wrote:

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.

I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That
might be broken if it was allowed.

What do you think about new argument with default vs. GUC? I guess
that the GUC might be a lot less of a foot-gun. We might even give it
a suitably scary name, to indicate that it will make the server PANIC.
(I gather that you don't care about other aspects of verbosity -- just
about the ability to make amcheck PANIC in the event of an invariant
violation without recompiling it.)

Yikes. I don't think I want to expose any kind of API that lets the
user PANIC the server. A value < ERROR sounds far more reasonable
than a value > ERROR.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Robert Haas (#57)

Re: amcheck (B-Tree integrity checking tool)

On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

What do you think about new argument with default vs. GUC? I guess
that the GUC might be a lot less of a foot-gun. We might even give it
a suitably scary name, to indicate that it will make the server PANIC.
(I gather that you don't care about other aspects of verbosity -- just
about the ability to make amcheck PANIC in the event of an invariant
violation without recompiling it.)

Yikes. I don't think I want to expose any kind of API that lets the
user PANIC the server. A value < ERROR sounds far more reasonable
than a value > ERROR.

In general, I don't want to get into the business of reasoning about
how well we can limp along when there is a would-be error condition
within amcheck. Once "the impossible" has actually occurred, it's very
difficult to reason about what still works. Also, I actually agree
that making it possible for the tool to force a PANIC through a
user-visible interface is a bad idea.

Maybe we should just leave it as it is -- experts can recompile the
tool after modifying it to use an elevel that is != ERROR (the thing I
mention about elevel < ERROR is already documented in code comments).
If that breaks, they get to keep both halves.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Peter Geoghegan (#58)

Re: amcheck (B-Tree integrity checking tool)

On Sat, Nov 19, 2016 at 11:38 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Sat, Nov 19, 2016 at 6:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

What do you think about new argument with default vs. GUC? I guess
that the GUC might be a lot less of a foot-gun. We might even give it
a suitably scary name, to indicate that it will make the server PANIC.
(I gather that you don't care about other aspects of verbosity -- just
about the ability to make amcheck PANIC in the event of an invariant
violation without recompiling it.)

Yikes. I don't think I want to expose any kind of API that lets the
user PANIC the server. A value < ERROR sounds far more reasonable
than a value > ERROR.

In general, I don't want to get into the business of reasoning about
how well we can limp along when there is a would-be error condition
within amcheck. Once "the impossible" has actually occurred, it's very
difficult to reason about what still works. Also, I actually agree
that making it possible for the tool to force a PANIC through a
user-visible interface is a bad idea.

Maybe we should just leave it as it is -- experts can recompile the
tool after modifying it to use an elevel that is != ERROR (the thing I
mention about elevel < ERROR is already documented in code comments).
If that breaks, they get to keep both halves.

OK. If it's not reasonable to continue checking after an ERROR, then
I think ERROR is the way to go. If somebody really doesn't like that
lack of flexibility (in either direction), they can propose a change
later for separate consideration. That limitation is not, in my view,
a sufficient reason to hold up the patch on the table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Robert Haas (#59)

1 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

On Sun, Nov 20, 2016 at 3:42 PM, Robert Haas <robertmhaas@gmail.com> wrote:

OK. If it's not reasonable to continue checking after an ERROR, then
I think ERROR is the way to go. If somebody really doesn't like that
lack of flexibility (in either direction), they can propose a change
later for separate consideration. That limitation is not, in my view,
a sufficient reason to hold up the patch on the table.

I attach V4 of amcheck. I decided to leave things as they are, as far
as forcing a different elevel goes (you still need to modify a macro).
I didn't change the elevel macro from CONCERN in a couple of cases, as
Andres suggested, because in fact that's generally the same as DEBUG1,
not WARNING (I agreed with Andres that it was WARNING before, but it
isn't unless you #define NOT_USED).

Andres said: "Theoretically we should recheck that the oids still
match, theoretically the locking here allows the index->table mapping
to change". I don't know how to act on this feedback, since comparable
index + heap locking code should have the same problem, but doesn't
consider it. How is this any different to reindex_index()?

Other than that, all feedback items were worked through. I made the
functions PARALLEL SAFE, too, since I noticed that that wasn't the
case in passing.

--
Peter Geoghegan

Attachments:

0001-Add-amcheck-extension-to-contrib.patchtext/x-patch; charset=US-ASCII; name=0001-Add-amcheck-extension-to-contrib.patchDownload

From 21c843ae0193ed17b2e5234d67f2e73f7015b3cd Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 10 Jun 2014 22:20:40 -0700
Subject: [PATCH] Add amcheck extension to contrib

The new extension makes available two SQL-callable functions, each of
which accept a single argument (a regclass/nbtree index relation
argument) and return void.

The SQL-callable function bt_index_check() checks invariants of the
target index by walking the tree, performing various checks of the
sanity of the page space using insertion scankeys to compare items.  The
SQL-callable function bt_index_parent_check() performs a superset of the
checks performed by bt_index_check(): the same checks, as well as
another check verifying that each downlink is actually respected as a
lower bound on its child page.

bt_index_check() requires only an AccessShareLock on the target.
bt_index_parent_check() requires an ExclusiveLock on the target,
and ShareLock on its heap relation.
---
 contrib/Makefile                 |    1 +
 contrib/amcheck/Makefile         |   19 +
 contrib/amcheck/amcheck--1.0.sql |   20 +
 contrib/amcheck/amcheck.control  |    5 +
 contrib/amcheck/verify_nbtree.c  | 1227 ++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/amcheck.sgml        |  275 +++++++++
 doc/src/sgml/contrib.sgml        |    1 +
 doc/src/sgml/filelist.sgml       |    1 +
 8 files changed, 1549 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 contrib/amcheck/verify_nbtree.c
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 25263c0..4acd50e 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 0000000..f14c7aa
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= verify_nbtree.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 0000000..b415f0e
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,20 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
new file mode 100644
index 0000000..180f150
--- /dev/null
+++ b/contrib/amcheck/amcheck.control
@@ -0,0 +1,5 @@
+# amcheck extension
+comment = 'verify access method invariants'
+default_version = '1.0'
+module_pathname = '$libdir/amcheck'
+relocatable = true
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
new file mode 100644
index 0000000..295ccb8
--- /dev/null
+++ b/contrib/amcheck/verify_nbtree.c
@@ -0,0 +1,1227 @@
+/*-------------------------------------------------------------------------
+ *
+ * verify_nbtree.c
+ *		Verifies the integrity of nbtree indexes based on invariants.
+ *
+ * For B-Tree indexes, verification includes the invariant that each page in
+ * the target index has items in logical order as reported by an insertion
+ * scankey (the insertion scankey sort-wise NULL semantics are needed for
+ * verification).
+ *
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/amcheck/verify_nbtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/transam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "commands/tablecmds.h"
+#include "miscadmin.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+
+PG_MODULE_MAGIC;
+
+#define CHECK_SUPERUSER() { \
+		if (!superuser()) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+					 (errmsg("must be superuser to use verification functions")))); }
+
+/*
+ * As noted in comments above _bt_compare(), there is special handling of the
+ * first data item (that is, the first item with a valid downlink -- not the
+ * high key item) on a non-leaf (internal) page.  There is clearly no point in
+ * having amcheck functions make any comparison of or against these "minus
+ * infinity" items, because they contain no actual key to compare. (Only the
+ * child block pointer is meaningful).
+ */
+#define OFFSET_IS_MINUS_INFINITY(opaque, offset) \
+	(!P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque))
+
+/*
+ * Callers to verification functions should never receive a false positive
+ * indication of corruption.  Therefore, when using amcheck functions for
+ * stress testing, it may be useful to temporally change the CORRUPTION elevel
+ * to PANIC, to immediately halt the server in the event of detecting an
+ * invariant condition violation.  This may preserve more information about the
+ * nature of the underlying problem.  Note that modifying the CORRUPTION
+ * constant to be an elevel < ERROR is not well tested.
+ */
+#ifdef NOT_USED
+#define CORRUPTION		PANIC
+#define CONCERN			WARNING
+#define POSITION		NOTICE
+#else
+#define CORRUPTION		ERROR
+#define CONCERN			DEBUG1
+#define POSITION		DEBUG2
+#endif
+
+/*
+ * A B-Tree cannot possibly have this many levels, since there must be one
+ * block per level, which is bound by the range of BlockNumber:
+ */
+#define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+
+/*
+ * State associated with verifying a B-Tree index
+ */
+typedef struct BtreeCheckState
+{
+	/*
+	 * Unchanging state, established at start of verification:
+	 *
+	 * rel:                 B-Tree Index Relation
+	 * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
+	 * checkstrategy:       Buffer access strategy
+	 * targetcontext:       Target page memory context
+	 */
+	Relation				rel;
+	bool					exclusivelock;
+	BufferAccessStrategy 	checkstrategy;
+	MemoryContext			targetcontext;
+
+	/*
+	 * Mutable state, for verification of particular page:
+	 *
+	 * target:              Main target page
+	 * targetblock:         Main target page's block number
+	 * targetlsn:           Main target page's LSN
+	 *
+	 * target is the point of reference for a verification operation.
+	 *
+	 * Note: targetlsn is always from "target".  It is stashed here out of
+	 * convenience.
+	 *
+	 * Other B-Tree pages may be allocated, but those are always auxiliary
+	 * (e.g. they are target's child pages).  Conceptually, only the target
+	 * page is checked.  Each page found by verification's left/right,
+	 * top/bottom scan becomes the target exactly once.
+	 *
+	 * Memory is managed by resetting targetcontext after verification of some
+	 * target page finishes (possibly including target verification that
+	 * depends on non-target page state).
+	 */
+	Page					target;
+	BlockNumber				targetblock;
+	XLogRecPtr				targetlsn;
+
+} BtreeCheckState;
+
+/*
+ * Starting point for verifying an entire B-Tree index level
+ */
+typedef struct BtreeLevel
+{
+	/* Level number (0 is leaf page level). */
+	uint32			level;
+
+	/* Left most block on level.  Scan of level begins here. */
+	BlockNumber		leftmost;
+
+	/* Is this level reported as "true" root level by meta page? */
+	bool			istruerootlevel;
+} BtreeLevel;
+
+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);
+
+static void btree_index_checkable(Relation rel);
+static void bt_check_every_level(Relation rel, bool exclusivelock);
+static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
+											   BtreeLevel level);
+static void bt_target_page_check(BtreeCheckState *state);
+static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+							  ScanKey targetkey);
+static bool invariant_key_less_than_equal_offset(BtreeCheckState *state,
+												 ScanKey key,
+												 OffsetNumber upperbound);
+static bool invariant_key_greater_than_equal_offset(BtreeCheckState *state,
+													ScanKey key,
+													OffsetNumber lowerbound);
+static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+														   Page other,
+														   ScanKey key,
+														   OffsetNumber upperbound);
+static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+
+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */
+Datum
+bt_index_check(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	indrel;
+
+	CHECK_SUPERUSER();
+
+	indrel = relation_open(relid, AccessShareLock);
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, false);
+
+	relation_close(indrel, AccessShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * bt_index_parent_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Acquires ExclusiveLock on index relation, and ShareLock on the associated
+ * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
+ * are valid lower bounds on child pages.
+ */
+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+	Oid			indrelid = PG_GETARG_OID(0);
+	Oid			heapid;
+	Relation	indrel;
+	Relation	heaprel;
+
+	CHECK_SUPERUSER();
+
+	/*
+	 * We must lock table before index to avoid deadlocks.  However, if the
+	 * passed indrelid isn't an index then IndexGetRelation() will fail.
+	 * Rather than emitting a not-very-helpful error message, postpone
+	 * complaining, expecting that the is-it-an-index test below will fail.
+	 */
+	heapid = IndexGetRelation(indrelid, true);
+	if (OidIsValid(heapid))
+		heaprel = heap_open(heapid, ShareLock);
+	else
+		heaprel = NULL;
+
+	/*
+	 * Open the target index relations separately (like relation_openrv(), but
+	 * with heap relation locked first to prevent deadlocking).  In hot standby
+	 * mode this will raise an error.
+	 */
+	indrel = index_open(indrelid, ExclusiveLock);
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, true);
+
+	index_close(indrel, ExclusiveLock);
+	if (heaprel)
+		heap_close(heaprel, ShareLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * btree_index_checkable()
+ *
+ * Basic checks about the suitability of a relation for checking as a B-Tree
+ * index.
+ */
+static void
+btree_index_checkable(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_INDEX ||
+		rel->rd_rel->relam != BTREE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only nbtree access method indexes are supported"),
+				 errdetail("Relation \"%s\" is not a B-Tree index.",
+						   RelationGetRelationName(rel))));
+
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions"),
+				 errdetail("Index \"%s\" is associated with temporary relation.",
+						   RelationGetRelationName(rel))));
+
+	if (!IndexIsValid(rel->rd_index))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("Index is not valid")));
+}
+
+/*
+ * bt_check_every_level()
+ *
+ * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
+ * logical order, verifying invariants as it goes.
+ *
+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physical modification to the index structure, including
+ * modifications that VACUUM may make.
+ */
+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
+{
+	BtreeCheckState	   *state;
+	Page				metapage;
+	BTMetaPageData	   *metad;
+	uint32				previouslevel;
+	BtreeLevel			current;
+
+	/*
+	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
+	 * RecentGlobalXmin/B-Tree page deletion.
+	 */
+	Assert(TransactionIdIsValid(RecentGlobalXmin));
+
+	/*
+	 * Initialize state for entire verification operation
+	 */
+	state = palloc(sizeof(BtreeCheckState));
+	state->rel = rel;
+	state->exclusivelock = exclusivelock;
+	state->checkstrategy = GetAccessStrategy(BAS_BULKREAD);
+	/* Create context for page */
+	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
+												 "amcheck page data",
+												 ALLOCSET_DEFAULT_MINSIZE,
+												 ALLOCSET_DEFAULT_INITSIZE,
+												 ALLOCSET_DEFAULT_MAXSIZE);
+
+	/* Get true root block from meta-page */
+	metapage = palloc_btree_page(state, BTREE_METAPAGE);
+	metad = BTPageGetMeta(metapage);
+
+	/*
+	 * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+	 * the fast root and true root differ.
+	 *
+	 * Start from the true root, not the fast root, unlike conventional index
+	 * scans.  This approach is more thorough, and removes the risk of
+	 * following a stale fast root from the meta page.
+	 */
+	if (metad->btm_fastroot != metad->btm_root)
+		ereport(CONCERN,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("fast root mismatch in index %s",
+						RelationGetRelationName(rel)),
+				 errdetail_internal("Fast block %u (level %u) "
+									"differs from true root "
+									"block %u (level %u).",
+									metad->btm_fastroot, metad->btm_fastlevel,
+									metad->btm_root, metad->btm_level)));
+
+	/*
+	 * Starting at the root, verify every level.  Move left to right, top to
+	 * bottom.  Note that there may be no pages other than the meta page (meta
+	 * page can indicate that root is P_NONE when the index is totally empty).
+	 */
+	previouslevel = InvalidBtreeLevel;
+	current.level = metad->btm_level;
+	current.leftmost = metad->btm_root;
+	current.istruerootlevel =  true;
+	while (current.leftmost != P_NONE)
+	{
+		/*
+		 * Verify this level, and get left most page for next level down, if
+		 * not at leaf level
+		 */
+		current = bt_check_level_from_leftmost(state, current);
+
+		if (current.leftmost == InvalidBlockNumber)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" has no valid pages on level below %u or first level",
+							RelationGetRelationName(rel), previouslevel)));
+
+		previouslevel = current.level;
+	}
+
+	/* Be tidy: */
+	MemoryContextDelete(state->targetcontext);
+}
+
+/*
+ * bt_check_level_from_leftmost()
+ *
+ * Given a left-most block at some level, move right, verifying each page
+ * individually (with more verification across pages for "exclusivelock"
+ * callers).  Caller should pass the true root page as the leftmost initially,
+ * working their way down by passing what is returned for the last call here
+ * until level 0 (leaf page level) was reached.
+ *
+ * Returns state for next call, if any.  This includes left-most block number
+ * one level lower that should be passed on next level/call, which is set to
+ * P_NONE on last call here (when leaf level is verified).  Level numbers
+ * follow the nbtree convention: higher levels have higher numbers, because new
+ * levels are added only due to a root page split.  Note that prior to the
+ * first root page split, the root is also a leaf page, so there is always a
+ * level 0 (leaf level), and it's always the last level processed.
+ *
+ * Note on memory management:  State's per-page context is reset here, between
+ * each call to bt_target_page_check().
+ */
+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{
+	/* State to establish early, concerning entire level */
+	BTPageOpaque	opaque;
+	MemoryContext 	oldcontext;
+	BtreeLevel		nextleveldown;
+
+	/* Variables for iterating across level using right links */
+	BlockNumber		leftcurrent = P_NONE;
+	BlockNumber		current = level.leftmost;
+
+	/* Initialize return state */
+	nextleveldown.leftmost = InvalidBlockNumber;
+	nextleveldown.level = InvalidBtreeLevel;
+	nextleveldown.istruerootlevel = false;
+
+	/* Use page-level context for duration of this call */
+	oldcontext = MemoryContextSwitchTo(state->targetcontext);
+
+	elog(POSITION, "verifying level %u%s", level.level,
+		 level.istruerootlevel?
+		 " (true root level)" : level.level == 0 ? " (leaf level)" : "");
+
+	do
+	{
+		/* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Initialize state for this iteration */
+		state->targetblock = current;
+		state->target = palloc_btree_page(state, state->targetblock);
+		state->targetlsn = PageGetLSN(state->target);
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+		if (P_IGNORE(opaque))
+		{
+			if (P_RIGHTMOST(opaque))
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("block %u fell off the end of index \"%s\"",
+								current, RelationGetRelationName(state->rel))));
+			else
+				ereport(CONCERN,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("block %u of index \"%s\" ignored",
+								current, RelationGetRelationName(state->rel))));
+			goto nextpage;
+		}
+		else if (nextleveldown.leftmost == InvalidBlockNumber)
+		{
+			/*
+			 * A concurrent page split could make the caller supplied leftmost
+			 * block no longer contain the leftmost page, or no longer be the
+			 * true root, but where that isn't possible due to heavyweight
+			 * locking, check that the first valid page meets caller's
+			 * expectations.
+			 */
+			if (state->exclusivelock)
+			{
+				if (!P_LEFTMOST(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not leftmost in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+
+				if (level.istruerootlevel && !P_ISROOT(opaque))
+					ereport(CORRUPTION,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("block %u is not true root in index \"%s\"",
+									current, RelationGetRelationName(state->rel))));
+			}
+
+			/*
+			 * Before beginning any non-trivial examination of level, establish
+			 * state to be passed by caller for next call here, if this isn't
+			 * last call/leaf level.
+			 *
+			 * There should be at least one non-ignorable page per level.
+			 */
+			if (!P_ISLEAF(opaque))
+			{
+				IndexTuple		itup;
+				ItemId			itemid;
+
+				/* Internal page -- downlink gets leftmost on next level */
+				itemid = PageGetItemId(state->target, P_FIRSTDATAKEY(opaque));
+				itup = (IndexTuple) PageGetItem(state->target, itemid);
+				nextleveldown.leftmost = ItemPointerGetBlockNumber(&(itup->t_tid));
+				nextleveldown.level = opaque->btpo.level - 1;
+			}
+			else
+			{
+				/*
+				 * Leaf page -- final level caller must process.
+				 *
+				 * Note that this could also be the root page, if there has
+				 * been no root page split yet.
+				 */
+				nextleveldown.leftmost = P_NONE;
+				nextleveldown.level = InvalidBtreeLevel;
+			}
+
+			/*
+			 * Finished setting up state for this call/level.  Control will
+			 * never end up back here in any future loop iteration for this
+			 * level.
+			 */
+		}
+
+		if (state->exclusivelock && opaque->btpo_prev != leftcurrent)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("right link/left link pair in index \"%s\" not in mutual agreement",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Deleted block=%u left block=%u link reported block=%u.",
+										current, leftcurrent, opaque->btpo_prev)));
+
+		/* Verify invariants for page -- all important checks occur here */
+		bt_target_page_check(state);
+
+nextpage:
+
+		/* Try to detect circular links */
+		if (current == leftcurrent || current == opaque->btpo_prev)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("circular link chain found in block %u of index \"%s\"",
+							current, RelationGetRelationName(state->rel))));
+
+		leftcurrent = current;
+		current = opaque->btpo_next;
+
+		/* Free page and associated memory for this iteration */
+		MemoryContextReset(state->targetcontext);
+	}
+	while (current != P_NONE);
+
+	/* Don't change context for caller */
+	MemoryContextSwitchTo(oldcontext);
+
+	return nextleveldown;
+}
+
+/*
+ * bt_target_page_check()
+ *
+ * Function performs the following checks on target page, or pages ancillary to
+ * target page:
+ *
+ * - That every "real" data item is less than or equal to the high key, which
+ *   is an upper bound on the items on the pages (where there is a high key at
+ *   all -- pages that are rightmost lack one).
+ *
+ * - That within the page, every "real" item is less than or equal to the item
+ *   immediately to its right, if any (i.e., that the items are in order within
+ *   the page, so that the binary searches performed by index scans are sane).
+ *
+ * - That the last item stored on the page is less than or equal to the first
+ *   "real" data item on the page to the right (if such a first item is
+ *   available).
+ *
+ * Furthermore, when state passed shows ExclusiveLock held, and target page is
+ * internal page, function also checks:
+ *
+ * - That all child pages respect downlinks lower bound.
+ *
+ * Note:  This routine is not especially proactive in freeing memory.  High
+ * watermark memory consumption is bound to some small fixed multiple of
+ * BLCKSZ, though.  Caller should reset the current context between calls here.
+ */
+static void
+bt_target_page_check(BtreeCheckState *state)
+{
+	OffsetNumber	offset;
+	OffsetNumber	max;
+	BTPageOpaque	topaque;
+
+	topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+	max = PageGetMaxOffsetNumber(state->target);
+
+	elog(POSITION, "verifying %u items on %s block %u", max,
+		 P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);
+
+	/*
+	 * Loop over page items, but don't start from P_HIKEY (don't have iteration
+	 * directly considering high key item, if any).  That's something that is
+	 * used as part of verifying all other items, but doesn't get its own
+	 * iteration.
+	 */
+	for (offset = P_FIRSTDATAKEY(topaque);
+		 offset <= max;
+		 offset = OffsetNumberNext(offset))
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+		ScanKey		skey;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Don't try to generate scankey using "minus infinity" garbage data */
+		if (OFFSET_IS_MINUS_INFINITY(topaque, offset))
+			continue;
+
+		/* Build insertion scankey for current page offset */
+		itemid = PageGetItemId(state->target, offset);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
+		skey = _bt_mkscankey(state->rel, itup);
+
+		/*
+		 *   ********************
+		 *   * High key check   *
+		 *   ********************
+		 *
+		 * If there is a high key, which there must be for a non-rightmost
+		 * page, check that it actually is upper bound on all page items.
+		 *
+		 * We prefer to check all items, rather than checking just the first
+		 * and trusting that the operator class obeys the transitive law (which
+		 * implies that all subsequent items also respected the high key
+		 * invariant if they pass the item order check).
+		 *
+		 * Ideally, we'd compare every item in the index against every other
+		 * item in the index, and not trust opclass obedience of the transitive
+		 * law to bridge the gap between children and their grandparents (as
+		 * well as great-grandparents, and so on).  We don't go to those
+		 * lengths because that would be prohibitively expensive, and probably
+		 * not markedly more effective in practice.
+		 */
+		if (!P_RIGHTMOST(topaque) &&
+			!invariant_key_less_than_equal_offset(state, skey, P_HIKEY))
+		{
+			char	   *itid,  *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("high key invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
+		/*
+		 *   ********************
+		 *   * Item order check *
+		 *   ********************
+		 *
+		 * Check that items are stored on page in logical order, by checking
+		 * current item is less than or equal to next item (if any).
+		 */
+		if (OffsetNumberNext(offset) <= max &&
+			!invariant_key_less_than_equal_offset(state, skey,
+												  OffsetNumberNext(offset)))
+		{
+			char	   *itid,  *htid,
+					   *nitid, *nhtid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			nitid = psprintf("(%u,%u)", state->targetblock,
+							 OffsetNumberNext(offset));
+
+			/* Reuse itup to get pointed-to heap location of second item */
+			itemid = PageGetItemId(state->target, OffsetNumberNext(offset));
+			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			nhtid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("item order invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Lower index tid=%s (points to %s tid=%s) "
+										"higher index tid=%s (points to %s tid=%s) "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										htid,
+										nitid,
+										P_ISLEAF(topaque) ? "heap":"index",
+										nhtid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
+		/*
+		 *   ********************
+		 *   * Last item check  *
+		 *   ********************
+		 *
+		 * Check last item against next/right page's first data item's when
+		 * last item on page is reached.  This additional check can detect
+		 * transposed pages.
+		 *
+		 * This check is similar to the item order check that will have already
+		 * been performed for every other "real" item on target page when last
+		 * item is checked.  The difference is that the next item (the item
+		 * that is compared to target's last item) needs to come from the
+		 * next/sibling page.  There may not be such an item available from
+		 * sibling for various reasons, though (e.g., target is the rightmost
+		 * page on level).
+		 */
+		else if (offset == max)
+		{
+			ScanKey		rightkey;
+
+			/* Get item in next/right page */
+			rightkey = bt_right_page_check_scankey(state);
+
+			if (rightkey &&
+				!invariant_key_greater_than_equal_offset(state, rightkey, max))
+			{
+				/*
+				 * As explained at length in bt_right_page_check_scankey(),
+				 * there is a known !exclusivelock race that could account for
+				 * apparent violation of invariant, which we must check for
+				 * before actually proceeding with raising error.  Our canary
+				 * condition is that target page was deleted.
+				 */
+				if (!state->exclusivelock)
+				{
+					/* Get fresh copy of target page */
+					state->target = palloc_btree_page(state, state->targetblock);
+					/* Note that we deliberately do not update target LSN */
+					topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+					/* All !exclusivelock checks now performed; just return */
+					if (P_IGNORE(topaque))
+						return;
+				}
+
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("cross page item order invariant violated for index \"%s\"",
+								RelationGetRelationName(state->rel)),
+						 errdetail_internal("Last item on page tid=(%u,%u) "
+											"page lsn=%X/%X.",
+											state->targetblock, offset,
+											(uint32) (state->targetlsn >> 32),
+											(uint32) state->targetlsn)));
+			}
+		}
+
+		/*
+		 *   ********************
+		 *   * Downlink check   *
+		 *   ********************
+		 *
+		 * Additional check of child items iff this is an internal page and
+		 * caller holds an ExclusiveLock.  This happens for every downlink
+		 * (item) in target excluding the minus-infinity downlink (again, this
+		 * is because it has no useful value to compare).
+		 */
+		if (!P_ISLEAF(topaque) && state->exclusivelock)
+		{
+			BlockNumber childblock = ItemPointerGetBlockNumber(&(itup->t_tid));
+
+			bt_downlink_check(state, childblock, skey);
+		}
+	}
+}
+
+/*
+ * bt_right_page_check_scankey()
+ *
+ * Return a scankey for an item on page to right of current target (or the
+ * first non-ignorable page), sufficient to check ordering invariant on last
+ * item in current target page.  Returned scankey relies on local memory
+ * allocated for the child page, which caller cannot pfree().  Caller's memory
+ * context should be reset between calls here.
+ *
+ * This is the first data item, and so all adjacent items are checked against
+ * their immediate sibling item (which may be on a sibling page, or even a
+ * "cousin" page at parent boundaries where target's rightlink points to page
+ * with different parent page).  If no such valid item is available, return
+ * NULL instead.
+ *
+ * Note that !exclusivelock callers must reverify that target page has not been
+ * concurrently deleted.
+ */
+static ScanKey
+bt_right_page_check_scankey(BtreeCheckState *state)
+{
+	BTPageOpaque	opaque;
+	ItemId			rightitem;
+	BlockNumber		targetnext;
+	Page			rightpage;
+	OffsetNumber	nline;
+
+	/* Determine target's next block number */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+	/* If target is already rightmost, no right sibling; nothing to do here */
+	if (P_RIGHTMOST(opaque))
+		return NULL;
+
+	/*
+	 * General notes on concurrent page splits and page deletion:
+	 *
+	 * Routines like _bt_search() don't require *any* page split interlock when
+	 * descending the tree, including something very light like a buffer pin.
+	 * That's why it's okay that we don't either.  This avoidance of any need
+	 * to "couple" buffer locks is the raison d' etre of the Lehman & Yao
+	 * algorithm, in fact.
+	 *
+	 * That leaves deletion.  A deleted page won't actually be recycled by
+	 * VACUUM early enough for us to fail to at least follow its right link (or
+	 * left link, or downlink) and find its sibling, because recycling does not
+	 * occur until no possible index scan could land on the page.  Index scans
+	 * can follow links with nothing more than their snapshot as an interlock
+	 * and be sure of at least that much.  (See page recycling/RecentGlobalXmin
+	 * notes in nbtree README.)
+	 *
+	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
+	 * dead (ignorable) page one or more times.  There will either be a further
+	 * right link to follow that leads to a live page before too long (before
+	 * passing by parent's rightmost child), or we will find the end of the
+	 * entire level instead (possible when parent page is itself the rightmost
+	 * on its level).
+	 */
+	targetnext = opaque->btpo_next;
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		rightpage = palloc_btree_page(state, targetnext);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
+
+		if (!P_IGNORE(opaque) || P_RIGHTMOST(opaque))
+			break;
+
+		/* We landed on a deleted page, so step right to find a live page */
+		targetnext = opaque->btpo_next;
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
+						opaque->btpo.level, RelationGetRelationName(state->rel)),
+				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
+
+		/* Be slightly more pro-active in freeing this memory, just in case */
+		pfree(rightpage);
+	}
+
+	/*
+	 * No ExclusiveLock held case -- why it's safe to proceed.
+	 *
+	 * Problem:
+	 *
+	 * We must avoid false positive reports of corruption when caller treats
+	 * item returned here as an upper bound on target's last item.  In general,
+	 * false positives are disallowed.  Ensuring they don't happen in
+	 * !exclusivelock case is subtle.
+	 *
+	 * A concurrent page deletion by VACUUM of the target page can result in
+	 * the insertion of items on to this right sibling page that would
+	 * previously have been inserted on our target page.  There might have been
+	 * insertions that followed target's downlink after it was made to point to
+	 * right sibling instead of target by page deletion's first phase.  The
+	 * inserters insert items that would belong on target page.  This race is
+	 * very tight, but it's possible.  This is our only problem.
+	 *
+	 * Non-problems:
+	 *
+	 * We are not hindered by a concurrent page split of the target; we'll
+	 * never land on the second half of the page anyway.  A concurrent split of
+	 * the right page will also not matter, because the first data item remains
+	 * the same within the left half, which we'll reliably land on.  If we had
+	 * to skip over ignorable/deleted pages, it cannot matter because their key
+	 * space has already been atomically merged with the first non-ignorable
+	 * page we eventually find (doesn't matter whether the page we eventually
+	 * find is a true sibling or a cousin of target, which we go into below).
+	 *
+	 * Solution:
+	 *
+	 * Caller knows that it should reverify that target is not ignorable
+	 * (half-dead or deleted) when cross-page sibling item comparison appears
+	 * to indicate corruption (invariant fails).  This detects the single race
+	 * condition that exists for caller.  This is correct because the continued
+	 * existence of target block as non-ignorable (not half-dead or deleted)
+	 * implies that target page was not merged into from the right by deletion;
+	 * the key space at or after target never moved left.  Target's parent
+	 * either has the same downlink to target as before, or a <= downlink due
+	 * to deletion at the left of target.  Target either has the same highkey
+	 * as before, or a highkey <= before when there is a page split. (The
+	 * rightmost concurrently-split-from-target-page page will still have the
+	 * same highkey as target was originally found to have, which for our
+	 * purposes is equivalent to target's highkey itself never changing, since
+	 * we reliably skip over concurrently-split-from-target-page pages.)
+	 *
+	 * In simpler terms, we allow that the key space of the target may expand
+	 * left (the key space can move left on the left side of target only), but
+	 * the target key space cannot expand right and get ahead of us without our
+	 * detecting it.  The key space of the target cannot shrink, unless it
+	 * shrinks to zero due to the deletion of the original page, our canary
+	 * condition.  (To be very precise, we're a bit stricter than that because
+	 * it might just have been that the target page split and only the original
+	 * target page was deleted.  We can be more strict, just not more lax.)
+	 *
+	 * Top level tree walk caller moves on to next page (makes it the new
+	 * target) following recovery from this race.  (cf.  The rationale for
+	 * child/downlink verification needing an ExclusiveLock within
+	 * bt_downlink_check(), where page deletion is also the main source of
+	 * trouble.)
+	 *
+	 * Note that it doesn't matter if right sibling page here is actually a
+	 * cousin page, because in order for the key space to be readjusted in a
+	 * way that causes us issues in next level up (guiding problematic
+	 * concurrent insertions to the cousin from the grandparent rather than to
+	 * the sibling from the parent), there'd have to be page deletion of
+	 * target's parent page (affecting target's parent's downlink in target's
+	 * grandparent page).  Internal page deletion only occurs when there are no
+	 * child pages (they were all fully deleted), and caller is checking that
+	 * the target's parent has at least one non-deleted (so non-ignorable)
+	 * child: the target page.  (Note that the first phase of deletion
+	 * atomically marks the page to be deleted half-dead/ignorable at the same
+	 * time downlink in its parent is removed, so caller will definitely not
+	 * fail to detect that this happened.)
+	 *
+	 * This trick is inspired by the method backward scans use for dealing with
+	 * concurrent page splits; concurrent page deletion is a problem that
+	 * similarly receives special consideration sometimes (it's possible that
+	 * the backwards scan will re-read its "original" block after failing to
+	 * find a right-link to it, having already moved in the opposite direction
+	 * (right/"forwards") a few times to try to locate one).  Just like us,
+	 * that happens only to determine if there was a concurrent page deletion
+	 * of a reference page, and just like us if there was a page deletion of
+	 * that reference page it means we can move on from caring about the
+	 * reference page.  See the nbtree README for a full description of how
+	 * that works.
+	 */
+	nline = PageGetMaxOffsetNumber(rightpage);
+
+	/*
+	 * Get first data item, if any
+	 */
+	if (P_ISLEAF(opaque) && nline >= P_FIRSTDATAKEY(opaque))
+	{
+		/* Return first data item (if any) */
+		rightitem = PageGetItemId(rightpage, P_FIRSTDATAKEY(opaque));
+	}
+	else if (!P_ISLEAF(opaque) &&
+			 nline >= OffsetNumberNext(P_FIRSTDATAKEY(opaque)))
+	{
+		/* Return first item after the internal page's "minus infinity" item */
+		rightitem = PageGetItemId(rightpage,
+								  OffsetNumberNext(P_FIRSTDATAKEY(opaque)));
+	}
+	else
+	{
+		/*
+		 * No first item.  Page is probably empty leaf page, but it's also
+		 * possible that it's an internal page with only a minus infinity item.
+		 */
+		ereport(CONCERN,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("%s block %u of index \"%s\" has no first data item",
+						P_ISLEAF(opaque) ? "leaf":"internal", targetnext,
+						RelationGetRelationName(state->rel))));
+		return NULL;
+	}
+
+	/*
+	 * Return first real item scankey.  Note that this relies on right page
+	 * memory remaining allocated.
+	 */
+	return _bt_mkscankey(state->rel,
+						 (IndexTuple) PageGetItem(rightpage, rightitem));
+}
+
+/*
+ * bt_downlink_check()
+ *
+ * Checks one of target's downlink against its child page.
+ *
+ * Conceptually, the target page continues to be what is checked here.  The
+ * target block is still blamed in the event of finding an invariant violation.
+ * The downlink insertion into the target is probably where any problem raised
+ * here arises, and there is no such thing as a parent link, so doing the
+ * verification this way around is much more practical.
+ */
+static void
+bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey)
+{
+	OffsetNumber	offset;
+	OffsetNumber	maxoffset;
+	Page			child;
+	BTPageOpaque	copaque;
+
+	/*
+	 * Caller must have ExclusiveLock on target relation, because of
+	 * considerations around page deletion by VACUUM.
+	 *
+	 * N.B.: In general, page deletion deletes the right sibling's downlink,
+	 * not the downlink of the page being deleted; the deleted page's downlink
+	 * is reused for its sibling.  The key space is thereby consolidated
+	 * between the deleted page and its right sibling.  (We cannot delete a
+	 * parent page's rightmost child unless it is the last child page, and we
+	 * intend to also delete the parent itself.)
+	 *
+	 * If this verification happened without an ExclusiveLock, the following
+	 * race condition could cause false positives:
+	 *
+	 * In general, concurrent page deletion might occur, including deletion of
+	 * the left sibling of the child page that is examined here.  If such a
+	 * page deletion were to occur, closely followed by an insertion into the
+	 * newly expanded key space of the child, a window for the false positive
+	 * opens up: the stale parent/target downlink originally followed to get to
+	 * the child legitimately ceases to be a lower bound on all items in the
+	 * page, since the key space was concurrently expanded "left". (Insertion
+	 * followed the "new" downlink for the child, not our now-stale downlink,
+	 * which was concurrently physically removed in target/parent as part of
+	 * deletion's first phase.)
+	 *
+	 * Note that while the cross-page-same-level last item check uses a trick
+	 * that allows it to perform verification for !exclusivelock callers, a
+	 * similar trick seems difficult here.  The trick that that other check
+	 * uses is, in essence, to lock down race conditions to those that occur
+	 * due to concurrent page deletion of the target; that's a race that can be
+	 * reliably detected before actually reporting corruption.
+	 *
+	 * On the other hand, we'd need to lock down race conditions involving
+	 * deletion of child's left page, for long enough to read the child page
+	 * into memory (in other words, a scheme with concurrently held buffer
+	 * locks on both child and left-of-child pages).  That's unacceptable for
+	 * amcheck functions on general principle, though.
+	 */
+	Assert(state->exclusivelock);
+
+	/*
+	 * Verify child page has the downlink key from target page (its parent) as
+	 * a lower bound.
+	 *
+	 * We prefer to check all items, rather than checking just the first and
+	 * trusting that the operator class obeys the transitive law.
+	 */
+	child = palloc_btree_page(state, childblock);
+	copaque = (BTPageOpaque) PageGetSpecialPointer(child);
+	maxoffset = PageGetMaxOffsetNumber(child);
+
+	for (offset = P_FIRSTDATAKEY(copaque);
+		 offset <= maxoffset;
+		 offset = OffsetNumberNext(offset))
+	{
+		/*
+		 * Skip comparison of target page key against "minus infinity" item, if
+		 * any.  Checking it would indicate that it's not an upper bound, but
+		 * that's only because of the hard-coding within _bt_compare().
+		 */
+		if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
+			continue;
+
+		if (!invariant_key_less_than_equal_nontarget_offset(state, child,
+															targetkey, offset))
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("down-link lower bound invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Parent block=%u child index tid=(%u,%u) "
+										"parent page lsn=%X/%X.",
+										state->targetblock, childblock, offset,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+	}
+
+	pfree(child);
+}
+
+/*
+ * invariant_key_less_than_equal_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_offset(BtreeCheckState *state, ScanKey key,
+									 OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * invariant_key_greater_than_equal_offset()
+ *
+ * Does the invariant hold that the key is greater than or equal to a given
+ * lower bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_greater_than_equal_offset(BtreeCheckState *state, ScanKey key,
+										OffsetNumber lowerbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound);
+
+	return cmp >= 0;
+}
+
+/*
+ * invariant_key_less_than_equal_nontarget_offset()
+ *
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item, with the offset relating to a caller-supplied page that
+ * is not the current target page? Caller's non-target page is typically a
+ * child page of the target, checked as part of checking a property of the
+ * target page (i.e. the key comes from the target).
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static bool
+invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
+											   Page nontarget, ScanKey key,
+											   OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * palloc_btree_page()
+ *
+ * Given a block number of a B-Tree page, return page in palloc()'d memory.
+ * While at it, perform some basic checks of the page.
+ *
+ * There is never an attempt to get a consistent view of multiple pages using
+ * multiple concurrent buffer locks; in general, we prefer to have only one pin
+ * and buffer lock at a time, which is often all that the nbtree code requires.
+ *
+ * Operating on a copy of the page is useful because it prevents control
+ * getting stuck in an uninterruptible state when an underlying operator class
+ * misbehaves.
+ */
+static Page
+palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
+{
+	Buffer			buffer;
+	Page			page;
+	BTPageOpaque	opaque;
+
+	page = palloc(BLCKSZ);
+
+	/*
+	 * We copy the page into local storage to avoid holding pin on the buffer
+	 * longer than we must.
+	 */
+	buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
+								state->checkstrategy);
+	LockBuffer(buffer, BT_READ);
+
+	/*
+	 * Perform the same basic sanity checking that nbtree itself performs for
+	 * every page:
+	 */
+	_bt_checkpage(state->rel, buffer);
+
+	/* Only use copy of page in palloc()'d memory */
+	memcpy(page, BufferGetPage(buffer), BLCKSZ);
+	UnlockReleaseBuffer(buffer);
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (opaque->btpo_flags & BTP_META && blocknum != BTREE_METAPAGE)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid meta page found at block %u in index \"%s\"",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	/* Check page from block that ought to be meta page */
+	if (blocknum == BTREE_METAPAGE)
+	{
+		BTMetaPageData *metad = BTPageGetMeta(page);
+
+		if (!(opaque->btpo_flags & BTP_META) ||
+			metad->btm_magic != BTREE_MAGIC)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" meta page is corrupt",
+							RelationGetRelationName(state->rel))));
+
+		if (metad->btm_version != BTREE_VERSION)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("version mismatch in index \"%s\": file version %d, code version %d",
+							RelationGetRelationName(state->rel),
+							metad->btm_version, BTREE_VERSION)));
+	}
+
+	/*
+	 * Deleted pages have no sane "level" field, so can only check non-deleted
+	 * page level
+	 */
+	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+						opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+
+	if (blocknum != BTREE_METAPAGE && !P_ISLEAF(opaque) &&
+		!P_ISDELETED(opaque) && opaque->btpo.level == 0)
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+						opaque->btpo.level, RelationGetRelationName(state->rel))));
+
+	if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
+		ereport(CORRUPTION,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("internal page block %u in index \"%s\" has garbage items",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	return page;
+}
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
new file mode 100644
index 0000000..58bd0bd
--- /dev/null
+++ b/doc/src/sgml/amcheck.sgml
@@ -0,0 +1,275 @@
+<!-- doc/src/sgml/amcheck.sgml -->
+
+<sect1 id="amcheck" xreflabel="amcheck">
+ <title>amcheck</title>
+
+ <indexterm zone="amcheck">
+  <primary>amcheck</primary>
+ </indexterm>
+
+ <para>
+  The <filename>amcheck</> module provides functions that allow you to
+  verify the logical consistency of the structure of indexes.  If the
+  structure appears to be valid, no error is raised.
+ </para>
+
+ <para>
+  The functions verify various <emphasis>invariants</> in the
+  structure of the representation of particular indexes.  The
+  correctness of the access method functions behind index scans and
+  other important operations relies on these invariants always
+  holding.  For example, certain functions verify, among other things,
+  that all B-Tree pages have items in <quote>logical</> order (e.g.,
+  for B-Tree indexes on <type>text</>, index tuples should be in
+  collated lexical order).  If that particular invariant somehow fails
+  to hold, we can expect binary searches on the affected page to
+  incorrectly guide index scans, resulting in wrong answers to SQL
+  queries.  Problems like this can be very subtle.
+ </para>
+ <para>
+  Verification is performed using the same procedures as those used by
+  index scans themselves, which may be user-defined operator class
+  code.  For example, B-Tree index verification relies on comparisons
+  made with one or more B-Tree support function 1 routines.  See <xref
+  linkend="xindex-support"> for details of operator class support
+  functions.
+ </para>
+ <para>
+  <filename>amcheck</> functions may be used only by superusers.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>bt_index_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_check</function> tests that its target, a
+      B-Tree index, respects a variety of invariants.  Example usage:
+<screen>
+test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+FROM pg_index i
+JOIN pg_opclass op ON i.indclass[0] = op.oid
+JOIN pg_am am ON op.opcmethod = am.oid
+JOIN pg_class c ON i.indexrelid = c.oid
+JOIN pg_namespace n ON c.relnamespace = n.oid
+WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
+-- Don't check temp tables, which may be from another session:
+AND c.relpersistence != 't'
+-- Function may throw an error when this is omitted:
+AND i.indisready AND i.indisvalid
+ORDER BY c.relpages DESC LIMIT 10;
+ bt_index_check |             relname             | relpages 
+----------------+---------------------------------+----------
+                | pg_depend_reference_index       |       43
+                | pg_depend_depender_index        |       40
+                | pg_proc_proname_args_nsp_index  |       31
+                | pg_description_o_c_o_index      |       21
+                | pg_attribute_relid_attnam_index |       14
+                | pg_proc_oid_index               |       10
+                | pg_attribute_relid_attnum_index |        9
+                | pg_amproc_fam_proc_index        |        5
+                | pg_amop_opr_fam_index           |        5
+                | pg_amop_fam_strat_index         |        5
+(10 rows)
+</screen>
+      This example shows a session that performs verification of every
+      catalog index in the database <quote>test</>.  Details of just
+      the 10 largest indexes verified are displayed.  Since no error
+      is raised, all indexes tested appear to be logically consistent.
+      Naturally, this query could easily be changed to call
+      <function>bt_index_check</function> for every index in the
+      database where verification is supported.
+     </para>
+     <para>
+      An <literal>AccessShareLock</> is acquired on the target index
+      by <function>bt_index_check</function>.  This lock mode is the
+      same lock mode acquired on relations by simple
+      <literal>SELECT</> statements.
+      <function>bt_index_check</function> does not verify invariants
+      that span child/parent relationships, nor does it verify that
+      the target index is consistent with its heap relation.  When a
+      routine, lightweight test for corruption is required in a live
+      production environment, using
+      <function>bt_index_check</function> often provides the best
+      trade-off between thoroughness of verification and limiting the
+      impact on application performance and availability.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>bt_index_parent_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_parent_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_parent_check</function> tests that its
+      target, a B-Tree index, respects a variety of invariants.  The
+      checks performed by <function>bt_index_parent_check</function>
+      are a superset of the checks performed by
+      <function>bt_index_check</function>.
+      <function>bt_index_parent_check</function> can be thought of as
+      a more thorough variant of <function>bt_index_check</function>:
+      unlike <function>bt_index_check</function>,
+      <function>bt_index_parent_check</function> also checks
+      invariants that span parent/child relationships.  However, it
+      does not verify that the target index is consistent with its
+      heap relation.  <function>bt_index_parent_check</function>
+      follows the general convention of raising an error if it finds a
+      logical inconsistency or other problem.
+     </para>
+     <para>
+      An <literal>ExclusiveLock</> is required on the target index by
+      <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</> is also acquired on the heap relation).
+      These locks prevent concurrent data modification from
+      <command>INSERT</>, <command>UPDATE</>, and <command>DELETE</>
+      commands.  The locks also prevent the underlying relation from
+      being concurrently processed by <command>VACUUM</> (and certain
+      other utility commands).  Note that the function holds locks for
+      as short a duration as possible, so there is no advantage to
+      verifying each index individually in a series of transactions,
+      unless long running queries happen to be of particular concern.
+     </para>
+     <para>
+      <function>bt_index_parent_check</function>'s additional
+      verification is more likely to detect various pathological
+      cases.  These cases may involve an incorrectly implemented
+      B-Tree operator class used by the index that is checked, or,
+      hypothetically, undiscovered bugs in the underlying B-Tree index
+      access method code.  Note that
+      <function>bt_index_parent_check</function> cannot be used when
+      Hot Standby mode is enabled (i.e., on read-only physical
+      replicas), unlike <function>bt_index_check</function>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Using <filename>amcheck</> effectively</title>
+
+ <para>
+  <filename>amcheck</> can be effective at detecting various types of
+  failure modes that <link
+  linkend="app-initdb-data-checksums"><application>data page
+  checksums</></link> will always fail to catch.  These include:
+
+  <itemizedlist>
+   <listitem>
+    <para>
+     Structural inconsistencies caused by incorrect operator class
+     implementations.
+    </para>
+    <para>
+     This includes issues caused by the comparison rules of operating
+     system collations changing. Comparisons of datums of a collatable
+     type like <type>text</> must be immutable (just as all
+     comparisons used for B-Tree index scans must be immutable), which
+     implies that operating system collation rules must never change.
+     Though rare, updates to operating system collation rules can
+     cause these issues. More commonly, an inconsistency in the
+     collation order between a master server and a standby server is
+     implicated, possibly because the <emphasis>major</> operating
+     system version in use is inconsistent.  Such inconsistencies will
+     generally only arise on standby servers, and so can generally
+     only be detected on standby servers.
+    </para>
+    <para>
+     If a problem like this arises, it may not affect each individual
+     index that is ordered using an affected collation, simply because
+     <emphasis>indexed</> values might happen to have the same
+     absolute ordering regardless of the behavioral inconsistency. See
+     <xref linkend="locale"> and <xref linkend="collation"> for
+     further details about how <productname>PostgreSQL</> uses
+     operating system locales and collations.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by hypothetical undiscovered bugs in the
+     underlying <productname>PostgreSQL</> access method code or sort
+     code.
+    </para>
+    <para>
+     Automatic verification of the structural integrity of indexes
+     plays a role in the general testing of new or proposed
+     <productname>PostgreSQL</> features that could plausibly allow a
+     logical inconsistency to be introduced.  One obvious testing
+     strategy is to call <filename>amcheck</> functions continuously
+     when running the standard regression tests.  See <xref
+     linkend="regress-run"> for details on running the tests.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Filesystem or storage subsystem faults where checksums happen to
+     simply not be enabled.
+    </para>
+    <para>
+     Note that <filename>amcheck</> examines a page as represented in some
+     shared memory buffer at the time of verification if there is only a
+     shared buffer hit when accessing the block. Consequently,
+     <filename>amcheck</> does not necessarily examine data read from the
+     filesystem at the time of verification. Note that when checksums are
+     enabled, <filename>amcheck</> may raise an error due to a checksum
+     failure when a corrupt block is read into a buffer.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by faulty RAM, and the broader memory subsystem
+     and operating system.
+    </para>
+    <para>
+     <productname>PostgreSQL</> does not protect against correctable
+     memory errors and it is assumed you will operate using RAM that
+     uses industry standard Error Correcting Codes (ECC) or better
+     protection.  However, ECC memory is typically only immune to
+     single-bit errors, and should not be assumed to provide
+     <emphasis>absolute</emphasis> protection against failures that
+     result in memory corruption.
+    </para>
+   </listitem>
+  </itemizedlist>
+  In general, <filename>amcheck</> can only prove the presence of
+  corruption; it cannot prove its absence.
+ </para>
+
+ </sect2>
+ <sect2>
+  <title>Repairing corruption</title>
+ <para>
+  No error concerning corruption raised by <filename>amcheck</> should
+  ever be a false positive.  In practice, <filename>amcheck</> is more
+  likely to find software bugs than problems with hardware.
+  <filename>amcheck</> raises errors in the event of conditions that,
+  by definition, should never happen, and so careful analysis of
+  <filename>amcheck</> errors is often required.
+ </para>
+ <para>
+  There is no general method of repairing problems that
+  <filename>amcheck</> detects.  An explanation for the root cause of
+  an invariant violation should be sought.  <xref
+  linkend="pageinspect"> may play a useful role in diagnosing
+  corruption that <filename>amcheck</> detects.  A <command>REINDEX</>
+  may not be effective in repairing corruption.
+ </para>
+
+ </sect2>
+
+</sect1>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index c8708ec..54102d6 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -103,6 +103,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  </para>
 
  &adminpack;
+ &amcheck;
  &auth-delay;
  &auto-explain;
  &bloom;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 69649a7..39827be 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -106,6 +106,7 @@
 <!-- contrib information -->
 <!ENTITY contrib         SYSTEM "contrib.sgml">
 <!ENTITY adminpack       SYSTEM "adminpack.sgml">
+<!ENTITY amcheck         SYSTEM "amcheck.sgml">
 <!ENTITY auth-delay      SYSTEM "auth-delay.sgml">
 <!ENTITY auto-explain    SYSTEM "auto-explain.sgml">
 <!ENTITY bloom           SYSTEM "bloom.sgml">
-- 
2.7.4

#61

Thomas Munro

thomas.munro@enterprisedb.com

about 9 years ago

In reply to: Peter Geoghegan (#55)

1 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Nov 18, 2016 at 12:17 PM, Peter Geoghegan <pg@heroku.com> wrote:

So, the Btree you show is supposed to be good? You've left it up to me
to draw T2, the corrupt version? If so, here is a drawing of T2 for
the sake of clarity:

Btree = [a|b]
/ \
/ \
/ \
[a]-------[b]
| |
| |
[c]-------[d]

Actually I meant that at T2 the btree is exactly same as it was at T1,
but the order of the values has changed according to the comparator
(for example, because a collation definition changed when you updated
your operating system), so that the btree is now corrupted. Based on
Goetz Graefe's paper I was wondering if amcheck would be unable to
detect this due to the cousin problem.

Now c and b have been switched around. Won't amcheck still pass since
we only verify immediate downlinks and siblings? Yet searches for b
will take a wrong turn at the root. Do I have this right? I wonder
if there is a way to verify that each page's high key < the 'next' key
for all ancestors.

I don't think you have it right, because your argument is predicated
on classic L&Y, not the Postgres variant. The high key is literally
guaranteed to be *equal* to the initial first value of the right half
of a split page post-split (at least initially), which is where the
new downlink for the new right half comes from in turn, so it will be
exactly equal as well. There is a comment which says all this within
_bt_insert_parent(), actually.

In general, I think it's pretty bad that we're so loose about
representing the key space in downlinks compared to classic L&Y,
because it probably makes index bloat a lot worse in internal pages,
messing up the key space long term. As long as we're doing that,
though, we clearly cannot usefully "verify that each page's high key <
the 'next' key for all ancestors", because we deviate from L&Y in a
way that makes that check always fail. It's okay that it always fails
for us (it doesn't actually indicate corruption), because of our
duplicate-aware index scan logic (change to invariant).

In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
cousin problem a non-issue, because we don't follow downlinks that are
equal to our scankey -- we go to their immediate left and start
*there*, in order to support duplicates in leaf pages. Index scans for
'b' work for both T1 and T2 in Postgres, because we refuse to follow
the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
would actually like to change things to make the invariant the classic
L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
suffix truncation in internal pages work.

So, we don't have the cousin problem, but since I wish for us to adopt
the stricter classic L&Y invariant at some point, you could say that I
also wished that we had the cousin problem.

Thank you for your patient explanation! Please see my humble attempt
to test this scenario and learn something about btrees in the process,
attached. amcheck does detect the violation of the high key
invariant.

Confused yet? :-)

Yes. But it's getting better slowly :-)

--
Thomas Munro
http://www.enterprisedb.com

#62

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Thomas Munro (#61)

Re: amcheck (B-Tree integrity checking tool)

On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Actually I meant that at T2 the btree is exactly same as it was at T1,
but the order of the values has changed according to the comparator
(for example, because a collation definition changed when you updated
your operating system), so that the btree is now corrupted. Based on
Goetz Graefe's paper I was wondering if amcheck would be unable to
detect this due to the cousin problem.

I love that paper (it's more like a book, really). I'm glad that at
least one other person in the community has read some of it, because I
think it has a lot of clues as to how we can begin to address some of
our problems around bloat without novel redesign. Most of the
techniques are quite complementary.

I wrote a prototype of suffix truncation for text in two days last
week. The regression tests pass, and the technique works well for
certain cases, particularly with many column indexes without
correlation (the index-only-scan case). Let's see if that becomes a
real patch.

Thank you for your patient explanation!

Your questions were excellent, and so deserved my careful consideration.

Please see my humble attempt
to test this scenario and learn something about btrees in the process,
attached. amcheck does detect the violation of the high key
invariant.

This is a nice pattern for testing out if the keyspace is sane, and
how things work. I might incorporate it into my own testing in the
future.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63

Haribabu Kommi

kommi.haribabu@gmail.com

about 9 years ago

In reply to: Peter Geoghegan (#62)

Re: amcheck (B-Tree integrity checking tool)

On Wed, Nov 30, 2016 at 9:06 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Nov 23, 2016 at 2:47 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Actually I meant that at T2 the btree is exactly same as it was at T1,
but the order of the values has changed according to the comparator
(for example, because a collation definition changed when you updated
your operating system), so that the btree is now corrupted. Based on
Goetz Graefe's paper I was wondering if amcheck would be unable to
detect this due to the cousin problem.

I love that paper (it's more like a book, really). I'm glad that at
least one other person in the community has read some of it, because I
think it has a lot of clues as to how we can begin to address some of
our problems around bloat without novel redesign. Most of the
techniques are quite complementary.

I wrote a prototype of suffix truncation for text in two days last
week. The regression tests pass, and the technique works well for
certain cases, particularly with many column indexes without
correlation (the index-only-scan case). Let's see if that becomes a
real patch.

Thank you for your patient explanation!

Your questions were excellent, and so deserved my careful consideration.

Please see my humble attempt
to test this scenario and learn something about btrees in the process,
attached. amcheck does detect the violation of the high key
invariant.

This is a nice pattern for testing out if the keyspace is sane, and
how things work. I might incorporate it into my own testing in the
future.

Moved to next CF with " needs review" status.

Regards,
Hari Babu
Fujitsu Australia

#64

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Haribabu Kommi (#63)

Re: amcheck (B-Tree integrity checking tool)

On Mon, Dec 5, 2016 at 2:15 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Moved to next CF with " needs review" status.

Same, to CF 2017-03 this time.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Peter Geoghegan (#60)

Re: amcheck (B-Tree integrity checking tool)

Hi Heikki,

On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote:

I attach V4 of amcheck.

Is there any chance you could look at the concurrency related parts of
amcheck? I'd like to push this to be mergeable, but that area is a bit
outside of my area of expertise...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Peter Geoghegan (#60)

Re: amcheck (B-Tree integrity checking tool)

Hi,

On 2016-11-22 10:56:07 -0800, Peter Geoghegan wrote:

Andres said: "Theoretically we should recheck that the oids still
match, theoretically the locking here allows the index->table mapping
to change". I don't know how to act on this feedback, since comparable
index + heap locking code should have the same problem, but doesn't
consider it. How is this any different to reindex_index()?

reindex_index isn't necessarily comparable, because
RangeVarCallbackForReindexIndex() is, on first blush at least, careful
enough about the locking, and reindex_relation only gets the index list
after building the index:
/*
* Find and lock index, and check permissions on table; use callback to
* obtain lock on table first, to avoid deadlock hazard. The lock level
* used here must match the index lock obtained in reindex_index().
*/

I think you ought to do something similar.

I'd really like to have something like relation_check(), index_check()
which calls the correct functions for the relevan index types (and
possibly warns about !nbtree indexes not being checked?).

@@ -0,0 +1,1227 @@
+/*-------------------------------------------------------------------------
+ *
+ * verify_nbtree.c
+ *		Verifies the integrity of nbtree indexes based on invariants.
+ *
+ * For B-Tree indexes, verification includes the invariant that each page in
+ * the target index has items in logical order as reported by an insertion
+ * scankey (the insertion scankey sort-wise NULL semantics are needed for
+ * verification).

I'd appreciate some rephrasing of this paragraph. "includes the
invariant that" isn't that easy to understand for an ESL person.

+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group

... ;)

+#define CHECK_SUPERUSER() { \
+		if (!superuser()) \
+			ereport(ERROR, \
+					(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+					 (errmsg("must be superuser to use verification functions")))); }

Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.

+/*
+ * Callers to verification functions should never receive a false positive
+ * indication of corruption.  Therefore, when using amcheck functions for
+ * stress testing, it may be useful to temporally change the CORRUPTION elevel
+ * to PANIC, to immediately halt the server in the event of detecting an
+ * invariant condition violation.  This may preserve more information about the
+ * nature of the underlying problem.  Note that modifying the CORRUPTION
+ * constant to be an elevel < ERROR is not well tested.
+ */
+#ifdef NOT_USED
+#define CORRUPTION		PANIC
+#define CONCERN			WARNING
+#define POSITION		NOTICE
+#else
+#define CORRUPTION		ERROR
+#define CONCERN			DEBUG1
+#define POSITION		DEBUG2
+#endif

Please remove this block and use the normal elevels.

+/*
+ * State associated with verifying a B-Tree index
+ */
+typedef struct BtreeCheckState
+{
+	/*
+	 * Unchanging state, established at start of verification:
+	 *
+	 * rel:                 B-Tree Index Relation
+	 * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
+	 * checkstrategy:       Buffer access strategy
+	 * targetcontext:       Target page memory context
+	 */
+	Relation				rel;
+	bool					exclusivelock;
+	BufferAccessStrategy 	checkstrategy;
+	MemoryContext			targetcontext;

Can you move the field comments to the individual fields? Separating
them makes it more likely to get out of date (obviously keep the
header). I'd also rename 'exclusivelock' to 'exclusivelylocked' or
such.

+	/*
+	 * Mutable state, for verification of particular page:
+	 *
+	 * target:              Main target page
+	 * targetblock:         Main target page's block number
+	 * targetlsn:           Main target page's LSN

Same here.

+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);

I think it'd be good to avoid adding a lot of functions for each
additional type of check we can think of. I think adding a named 'bool
check_parents = false' argument or such would be better. Then we can
simply add more and more checks subsequently, and we can have
'bool all_checks = false' so people can trivially check everything,
instead of having to add new checks in each new function.

+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */
+Datum
+bt_index_check(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	indrel;
+
+	CHECK_SUPERUSER();
+
+	indrel = relation_open(relid, AccessShareLock);

+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, false);
+
+	relation_close(indrel, AccessShareLock);

I'm *VERY* strongly against releasing locks on user relations early.

+/*
+ * bt_index_parent_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Acquires ExclusiveLock on index relation, and ShareLock on the associated
+ * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
+ * are valid lower bounds on child pages.
+ */
+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+	Oid			indrelid = PG_GETARG_OID(0);
+	Oid			heapid;
+	Relation	indrel;
+	Relation	heaprel;
+
+	CHECK_SUPERUSER();
+
+	/*
+	 * We must lock table before index to avoid deadlocks.  However, if the
+	 * passed indrelid isn't an index then IndexGetRelation() will fail.
+	 * Rather than emitting a not-very-helpful error message, postpone
+	 * complaining, expecting that the is-it-an-index test below will fail.
+	 */
+	heapid = IndexGetRelation(indrelid, true);
+	if (OidIsValid(heapid))
+		heaprel = heap_open(heapid, ShareLock);
+	else
+		heaprel = NULL;
+
+	/*
+	 * Open the target index relations separately (like relation_openrv(), but
+	 * with heap relation locked first to prevent deadlocking).  In hot standby
+	 * mode this will raise an error.
+	 */
+	indrel = index_open(indrelid, ExclusiveLock);

So yea, this is what I was complaining about re locking / index & table
relation IIRC. Note you're also deadlock prone by violating the rule
about always locking the relation first.

+	index_close(indrel, ExclusiveLock);
+	if (heaprel)
+		heap_close(heaprel, ShareLock);

Same thing about releasing locks early.

+/*
+ * btree_index_checkable()
+ *
+ * Basic checks about the suitability of a relation for checking as a B-Tree
+ * index.
+ */
+static void
+btree_index_checkable(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_INDEX ||
+		rel->rd_rel->relam != BTREE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only nbtree access method indexes are supported"),
+				 errdetail("Relation \"%s\" is not a B-Tree index.",
+						   RelationGetRelationName(rel))));

I think this message could stand some rephrasing. There's a significant
overlap between msg and detail, and they use different phrasing for
btree.

+	if (!IndexIsValid(rel->rd_index))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("Index is not valid")));

I'm perfectly fine with not changing that, but IIUC it'd be fine to
weaken this to IndexIsLive() - right?

+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physical modification to the index structure, including
+ * modifications that VACUUM may make.

It might make sense to be a bit more explicit about the structure bit
(which I assume refers to the fact that modifications to individual
pages are possible (re killitems)?).

+ */
+static void
+bt_check_every_level(Relation rel, bool exclusivelock)

Hm, is every level accurate - this is more like checking every
"reachable" page, right?

+	/*
+	 * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+	 * the fast root and true root differ.
+	 *
+	 * Start from the true root, not the fast root, unlike conventional index
+	 * scans.  This approach is more thorough, and removes the risk of
+	 * following a stale fast root from the meta page.
+	 */
+	if (metad->btm_fastroot != metad->btm_root)
+		ereport(CONCERN,
+				(errcode(ERRCODE_DUPLICATE_OBJECT),
+				 errmsg("fast root mismatch in index %s",
+						RelationGetRelationName(rel)),
+				 errdetail_internal("Fast block %u (level %u) "
+									"differs from true root "
+									"block %u (level %u).",
+									metad->btm_fastroot, metad->btm_fastlevel,
+									metad->btm_root, metad->btm_level)));

I'd weaken the message to something that sounds less like an actual
problem... - which this is not, right?

+/*
+ * bt_check_level_from_leftmost()
+ *
+ * Given a left-most block at some level, move right, verifying each page
+ * individually (with more verification across pages for "exclusivelock"
+ * callers).  Caller should pass the true root page as the leftmost initially,
+ * working their way down by passing what is returned for the last call here
+ * until level 0 (leaf page level) was reached.
+ *
+ * Returns state for next call, if any.

Wouldn't "'sets state for the next call..." be more accurate than
returning?

+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{

+		if (P_IGNORE(opaque))
+		{
+			if (P_RIGHTMOST(opaque))
+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("block %u fell off the end of index \"%s\"",
+								current, RelationGetRelationName(state->rel))));

I think this could stand a bit more formal language.

+			else
+				ereport(CONCERN,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("block %u of index \"%s\" ignored",
+								current, RelationGetRelationName(state->rel))));

missing verb. Also, is this actually worth logging (even at such a low level)?

+			/*
+			 * Before beginning any non-trivial examination of level, establish
+			 * state to be passed by caller for next call here, if this isn't
+			 * last call/leaf level.

"state to be passed by caller for next call here" is hard to
understand. "Prepare state for next bt_check_level_from_leftmost
invocation for the next level"?

+			 * There should be at least one non-ignorable page per level.
+			 */

I might have missed something, but you're not checking that, are you?

+		if (state->exclusivelock && opaque->btpo_prev != leftcurrent)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("right link/left link pair in index \"%s\" not in mutual agreement",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Deleted block=%u left block=%u link reported block=%u.",
+										current, leftcurrent, opaque->btpo_prev)));

"mutual" doesn't seem to add much here.

+		/* Try to detect circular links */
+		if (current == leftcurrent || current == opaque->btpo_prev)
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("circular link chain found in block %u of index \"%s\"",
+							current, RelationGetRelationName(state->rel))));

That's not really reliable though, is it? I've seen corrupted link
chains (leading to endless vacuums and such) a number of times, so maybe
explicitly keeping track of already visited pages within a level would
be worthwhile?

+/*
+ * bt_target_page_check()
+ *
+ * Function performs the following checks on target page, or pages ancillary to
+ * target page:
+ *
+ * - That every "real" data item is less than or equal to the high key, which
+ *   is an upper bound on the items on the pages (where there is a high key at
+ *   all -- pages that are rightmost lack one).
+ *
+ * - That within the page, every "real" item is less than or equal to the item
+ *   immediately to its right, if any (i.e., that the items are in order within
+ *   the page, so that the binary searches performed by index scans are sane).
+ *
+ * - That the last item stored on the page is less than or equal to the first
+ *   "real" data item on the page to the right (if such a first item is
+ *   available).

Hm. I think it'd be useful if we also made sure (or in the caller) that
the page actually is at the correct level. I seem to recall seing a
case where downlinks were "circular" due to corruption, leading to stuck
scans.

+ * Note:  This routine is not especially proactive in freeing memory.  High
+ * watermark memory consumption is bound to some small fixed multiple of
+ * BLCKSZ, though.  Caller should reset the current context between calls here.

I'd shorten this to "Memory allocated in this routine is expected to be
released by the caller resetting the ->xxx/current context".

+static void
+bt_target_page_check(BtreeCheckState *state)
+{
+	OffsetNumber	offset;
+	OffsetNumber	max;
+	BTPageOpaque	topaque;
+
+	topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+	max = PageGetMaxOffsetNumber(state->target);
+
+	elog(POSITION, "verifying %u items on %s block %u", max,
+		 P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);

I think it's fine if you don't, but this could theoretically integrate
with the progress stuff added in 9.6 (as in
pgstat_progress_start_command()). That'd allow for a lot more
convenient display of this than thousands of debug messages or nothing.

+	/*
+	 * Loop over page items, but don't start from P_HIKEY (don't have iteration
+	 * directly considering high key item, if any).

I don't understand the parenthetical comment as phrased.

+		/*
+		 *   ********************
+		 *   * High key check   *
+		 *   ********************
+		 *
+		 * If there is a high key, which there must be for a non-rightmost
+		 * page, check that it actually is upper bound on all page items.

"which there must be for" sounds weird.

+		 * We prefer to check all items, rather than checking just the first
+		 * and trusting that the operator class obeys the transitive law (which
+		 * implies that all subsequent items also respected the high key
+		 * invariant if they pass the item order check).

s/prefer to//

+		/*
+		 *   ********************
+		 *   * Item order check *
+		 *   ********************
+		 *
+		 * Check that items are stored on page in logical order, by checking
+		 * current item is less than or equal to next item (if any).
+		 */

Does that kind of comment look reasonable after pgindent? Actually,
could you just pindent this patch (including typedefs.list additions)
and repair the bad looking damage?

+		if (OffsetNumberNext(offset) <= max &&
+			!invariant_key_less_than_equal_offset(state, skey,
+												  OffsetNumberNext(offset)))
+		{
+			char	   *itid,  *htid,
+					   *nitid, *nhtid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			nitid = psprintf("(%u,%u)", state->targetblock,
+							 OffsetNumberNext(offset));

Do all these separate psprintfs have much of a point? It seems just as
easy to includ ethem in the actual ereport?

+		 * next/sibling page.  There may not be such an item available from
+		 * sibling for various reasons, though (e.g., target is the rightmost
+		 * page on level).
+		 */

Is the comma after e.g. correct? </mega-pedanticism-from-esl-person>

+				ereport(CORRUPTION,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("cross page item order invariant violated for index \"%s\"",
+								RelationGetRelationName(state->rel)),
+						 errdetail_internal("Last item on page tid=(%u,%u) "
+											"page lsn=%X/%X.",
+											state->targetblock, offset,
+											(uint32) (state->targetlsn >> 32),
+											(uint32) state->targetlsn)));

I'm am *wondering* (as in not sure), if we could get rid of all the "for
index ..." bits and use an errcontext for that instead, which'd be
included in all messages.

+	/*
+	 * General notes on concurrent page splits and page deletion:
+	 *
+	 * Routines like _bt_search() don't require *any* page split interlock when
+	 * descending the tree, including something very light like a buffer pin.
+	 * That's why it's okay that we don't either.  This avoidance of any need
+	 * to "couple" buffer locks is the raison d' etre of the Lehman & Yao
+	 * algorithm, in fact.

raison d'etre comment seems more like something belonging to
nbtree/README than this, but I don't care much.

+	/*
+	 * No ExclusiveLock held case -- why it's safe to proceed.
+	 *
+	 * Problem:
+	 *
+	 * We must avoid false positive reports of corruption when caller treats
+	 * item returned here as an upper bound on target's last item.  In general,
+	 * false positives are disallowed.  Ensuring they don't happen in
+	 * !exclusivelock case is subtle.

Maybe s/Ensuring/The reasons why they don't happen in ...'?

+	 * A concurrent page deletion by VACUUM of the target page can result in
+	 * the insertion of items on to this right sibling page that would

"on to this right sibling page"?

+	 * previously have been inserted on our target page.  There might have been
+	 * insertions that followed target's downlink after it was made to point to

missing trailing 'the'? I'm really not a native speaker (and thus might
be off the mark), but making these sentences a bit clearer really
wouldn't hurt.

+static void
+bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey)
+{
+	OffsetNumber	offset;
+	OffsetNumber	maxoffset;
+	Page			child;
+	BTPageOpaque	copaque;
+
+	/*
+	 * Caller must have ExclusiveLock on target relation, because of
+	 * considerations around page deletion by VACUUM.
+	 *
+	 * N.B.: In general, page deletion deletes the right sibling's downlink,

s/N.B./NB/ by precedence ;)

+	 * We prefer to check all items, rather than checking just the first and
+	 * trusting that the operator class obeys the transitive law.
+	 */

s/We prefer//

+	child = palloc_btree_page(state, childblock);
+	copaque = (BTPageOpaque) PageGetSpecialPointer(child);
+	maxoffset = PageGetMaxOffsetNumber(child);
+
+	for (offset = P_FIRSTDATAKEY(copaque);
+		 offset <= maxoffset;
+		 offset = OffsetNumberNext(offset))
+	{
+		/*
+		 * Skip comparison of target page key against "minus infinity" item, if
+		 * any.  Checking it would indicate that it's not an upper bound, but
+		 * that's only because of the hard-coding within _bt_compare().
+		 */
+		if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
+			continue;

FWIW, I'd even convert this to an inline function rather than a macro.
Macros are useful when you have to do stuff you can't do otherwise, but
shouldn't be used much otherwise.

+		if (!invariant_key_less_than_equal_nontarget_offset(state, child,
+															targetkey, offset))
+			ereport(CORRUPTION,
+					(errcode(ERRCODE_INDEX_CORRUPTED),

What's your reasoning for differing sometimes using
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?

+ errmsg("down-link lower bound invaariant violated for index \"%s\"",

s/invaariant/invariant/

+/*
+ * invariant_key_greater_than_equal_offset()

I'd personally just remove the function name in these, but that's personal
taste and i'll leave you the decision.

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Andres Freund (#66)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:

+#define CHECK_SUPERUSER() { \
+             if (!superuser()) \
+                     ereport(ERROR, \
+                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+                                      (errmsg("must be superuser to use verification functions")))); }

Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.

Better yet, remove the check altogether and just revoke access from
public. Then the default policy is superuser-only, but the DBA can
grant access to who they wish.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Robert Haas (#67)

Re: amcheck (B-Tree integrity checking tool)

On 2017-02-09 14:48:18 -0500, Robert Haas wrote:

On Thu, Feb 9, 2017 at 3:05 AM, Andres Freund <andres@anarazel.de> wrote:
+#define CHECK_SUPERUSER() { \
+             if (!superuser()) \
+                     ereport(ERROR, \
+                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+                                      (errmsg("must be superuser to use verification functions")))); }
Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.
Better yet, remove the check altogether and just revoke access from
public. Then the default policy is superuser-only, but the DBA can
grant access to who they wish.

That works for me as well.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Andres Freund (#66)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 12:05 AM, Andres Freund <andres@anarazel.de> wrote:

reindex_index isn't necessarily comparable, because
RangeVarCallbackForReindexIndex() is, on first blush at least, careful
enough about the locking, and reindex_relation only gets the index list
after building the index:
/*
* Find and lock index, and check permissions on table; use callback to
* obtain lock on table first, to avoid deadlock hazard. The lock level
* used here must match the index lock obtained in reindex_index().
*/

I think you ought to do something similar.

Okay. Looking into it, but see my response to your later remarks below.

I'd really like to have something like relation_check(), index_check()
which calls the correct functions for the relevan index types (and
possibly warns about !nbtree indexes not being checked?).

I feel that that's something for an in-core facility that exposes this
through DDL. One good reason to not do that just yet is that I'm not
completely comfortable with what a uniform interface looks like here.
And frankly, I really just want to get this out of the way, to prove
the general concept of verification tooling. amcheck v1 is a good
start, but much work remains. I think that no one will take the idea
completely seriously until it is usable as a core extension. Let's get
the ball rolling.

+#define CHECK_SUPERUSER() { \
+             if (!superuser()) \
+                     ereport(ERROR, \
+                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+                                      (errmsg("must be superuser to use verification functions")))); }

Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.

It's actually from pageinspect, but okay.

+#define CORRUPTION           ERROR
+#define CONCERN                      DEBUG1
+#define POSITION             DEBUG2
+#endif

Please remove this block and use the normal elevels.

Okay.

+/*
+ * State associated with verifying a B-Tree index
+ */
+typedef struct BtreeCheckState
+{
+     /*
+      * Unchanging state, established at start of verification:
+      *
+      * rel:                 B-Tree Index Relation
+      * exclusivelock:       ExclusiveLock held on rel; else AccessShareLock
+      * checkstrategy:       Buffer access strategy
+      * targetcontext:       Target page memory context
+      */
+     Relation                                rel;
+     bool                                    exclusivelock;
+     BufferAccessStrategy    checkstrategy;
+     MemoryContext                   targetcontext;

Okay.

+     /*
+      * Mutable state, for verification of particular page:
+      *
+      * target:              Main target page
+      * targetblock:         Main target page's block number
+      * targetlsn:           Main target page's LSN

Same here.

Okay.

+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);
I think it'd be good to avoid adding a lot of functions for each
additional type of check we can think of. I think adding a named 'bool
check_parents = false' argument or such would be better. Then we can
simply add more and more checks subsequently, and we can have
'bool all_checks = false' so people can trivially check everything,
instead of having to add new checks in each new function.

I strongly disagree. I would agree if the lock strength were the same
in both cases, but it isn't. Varying the strength of heavyweight lock
taken based on an argument to a function is a massive foot-gun IMV. I
do think that that's the logical way to split out functionality. I
don't think that there is likely to be a problem with adding stuff in
the future. It'll either be something that we can do in passing
anyway, in which case we can just add it to the existing checks
harmlessly. Or, it will be something like a tool that verifies the
heap is consistent with a given index, which is vastly more expensive
in terms of CPU and I/O costs, and yet is probably no worse than
bt_index_check() in terms of lock strength (AccessShareLock).

My thinking here is that users really ought to use bt_index_check() in
almost all cases, not bt_index_parent_check(), which is more of a tool
for hackers that are developing new B-Tree features. The distinction
is blurry, in fact, but the lock strength requirements of
bt_index_parent_check() make it vastly less useful. Especially
considering that it seems fairly unlikely that it will ever catch a
problem that bt_index_check() would not have. bt_index_parent_check()
will become a lot more important when we implement suffix truncation
in internal pages. (I have a very rough patch that does this, actually
-- it passes the regression tests, and verification by amcheck, but
isn't anywhere near ready to post.)

+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */

I'm *VERY* strongly against releasing locks on user relations early.

Why? pageinspect does this. I think it could be really handy for users
that want to write SQL queries when using bt_index_parent_check().
It's a bit weird, in the sense that it's a special exception that has
to be explained sometimes, but I think it is worth it for something
like this.

+     /*
+      * We must lock table before index to avoid deadlocks.  However, if the
+      * passed indrelid isn't an index then IndexGetRelation() will fail.
+      * Rather than emitting a not-very-helpful error message, postpone
+      * complaining, expecting that the is-it-an-index test below will fail.
+      */
+     heapid = IndexGetRelation(indrelid, true);
+     if (OidIsValid(heapid))
+             heaprel = heap_open(heapid, ShareLock);
+     else
+             heaprel = NULL;
+
+     /*
+      * Open the target index relations separately (like relation_openrv(), but
+      * with heap relation locked first to prevent deadlocking).  In hot standby
+      * mode this will raise an error.
+      */
+     indrel = index_open(indrelid, ExclusiveLock);

So yea, this is what I was complaining about re locking / index & table
relation IIRC. Note you're also deadlock prone by violating the rule
about always locking the relation first.

This is based on brin_summarize_new_values(). I'm not doing a recheck,
though, because I don't actually need the heap relation for anything
(I suppose I imagined that that was okay, possibly failing to consider
some subtlety beyond "consistency is good"). If I added a recheck for
the heap relation in the style of brin_summarize_new_values(), would
you consider this item closed, and all heavyweight lock concerns
addressed?

The comments here are almost a direct lift -- is
brin_summarize_new_values() wrong to itself claim to be
deadlock-safe?.

(BTW, brin_summarize_new_values(), an SQL-callable function, also
releases heavyweight locks early.)

+/*
+ * btree_index_checkable()
+ *
+ * Basic checks about the suitability of a relation for checking as a B-Tree
+ * index.
+ */
+static void
+btree_index_checkable(Relation rel)
+{
+     if (rel->rd_rel->relkind != RELKIND_INDEX ||
+             rel->rd_rel->relam != BTREE_AM_OID)
+             ereport(ERROR,
+                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                              errmsg("only nbtree access method indexes are supported"),
+                              errdetail("Relation \"%s\" is not a B-Tree index.",
+                                                RelationGetRelationName(rel))));

I think this message could stand some rephrasing. There's a significant
overlap between msg and detail, and they use different phrasing for
btree.

Okay.

+     if (!IndexIsValid(rel->rd_index))
+             ereport(ERROR,
+                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                              errmsg("cannot check index \"%s\"",
+                                             RelationGetRelationName(rel)),
+                              errdetail("Index is not valid")));

I'm perfectly fine with not changing that, but IIUC it'd be fine to
weaken this to IndexIsLive() - right?

I believe so, but prefer to be more restrictive in the absence of a
good reason not to be.

+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent any
+ * kind of physical modification to the index structure, including
+ * modifications that VACUUM may make.
It might make sense to be a bit more explicit about the structure bit
(which I assume refers to the fact that modifications to individual
pages are possible (re killitems)?).

It's not really about killitems, which I explained won't matter here,
but I will add a bit about that.

+ */
+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
Hm, is every level accurate - this is more like checking every
"reachable" page, right?

Every level should have at least one page that is reachable (not
half-dead or fully dead), unless perhaps VACUUM marks the last leaf
page in the entire index as half-dead and we land on that. I think
that that's such a rare case that it shouldn't alter how we name this
function at all, which really does do what it says almost always. Note
that there are pretty severe restrictions on when vacuum can do
page-level deletion of an internal page. This is what leads to index
bloat with sparse deletion patterns (one problem I want to address by
adding suffix truncation to internal pages at a later date).

+     /*
+      * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+      * the fast root and true root differ.
+      *
+      * Start from the true root, not the fast root, unlike conventional index
+      * scans.  This approach is more thorough, and removes the risk of
+      * following a stale fast root from the meta page.
+      */
+     if (metad->btm_fastroot != metad->btm_root)
+             ereport(CONCERN,
+                             (errcode(ERRCODE_DUPLICATE_OBJECT),
+                              errmsg("fast root mismatch in index %s",
+                                             RelationGetRelationName(rel)),
+                              errdetail_internal("Fast block %u (level %u) "
+                                                                     "differs from true root "
+                                                                     "block %u (level %u).",
+                                                                     metad->btm_fastroot, metad->btm_fastlevel,
+                                                                     metad->btm_root, metad->btm_level)));

I'd weaken the message to something that sounds less like an actual
problem... - which this is not, right?

No, it isn't a real problem. I'll do something about it.

+/*
+ * bt_check_level_from_leftmost()
+ *
+ * Given a left-most block at some level, move right, verifying each page
+ * individually (with more verification across pages for "exclusivelock"
+ * callers).  Caller should pass the true root page as the leftmost initially,
+ * working their way down by passing what is returned for the last call here
+ * until level 0 (leaf page level) was reached.
+ *
+ * Returns state for next call, if any.

Wouldn't "'sets state for the next call..." be more accurate than
returning?

Okay.

+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{

+             if (P_IGNORE(opaque))
+             {
+                     if (P_RIGHTMOST(opaque))
+                             ereport(CORRUPTION,
+                                             (errcode(ERRCODE_INDEX_CORRUPTED),
+                                              errmsg("block %u fell off the end of index \"%s\"",
+                                                             current, RelationGetRelationName(state->rel))));

I think this could stand a bit more formal language.

I disagree, because it's a direct lift from nbtree itself.

+                     else
+                             ereport(CONCERN,
+                                             (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                              errmsg("block %u of index \"%s\" ignored",
+                                                             current, RelationGetRelationName(state->rel))));

missing verb. Also, is this actually worth logging (even at such a low level)?

I'm not following -- "ignore" is a verb.

It may not be worth logging. I just don't know, and prefer to err on
the side of including it.

+                     /*
+                      * Before beginning any non-trivial examination of level, establish
+                      * state to be passed by caller for next call here, if this isn't
+                      * last call/leaf level.
"state to be passed by caller for next call here" is hard to
understand. "Prepare state for next bt_check_level_from_leftmost
invocation for the next level"?

Okay.

+                      * There should be at least one non-ignorable page per level.
+                      */
I might have missed something, but you're not checking that, are you?

No. As I said just now, I think it might not be true iff every single
item (every item at the leaf level) has been deleted and is being
vacuumed. If there was even one live item in one leaf page, it would
be guaranteed to have a live parent, which would itself be guaranteed
to if it wasn't the true root, and so on, until the true root. It's
really hard to "lose" a level by having the true root and fast root
differ, in general (of course, we can never truly lose a level -- the
pages can never be reclaimed entirely from a deleted level's last
page). I guess that this fast root stuff was added only for the
benefit of cases where there is continuous bulk deletion and
insertion.

"mutual" doesn't seem to add much here.

Okay.

+             /* Try to detect circular links */
+             if (current == leftcurrent || current == opaque->btpo_prev)
+                     ereport(CORRUPTION,
+                                     (errcode(ERRCODE_INDEX_CORRUPTED),
+                                      errmsg("circular link chain found in block %u of index \"%s\"",
+                                                     current, RelationGetRelationName(state->rel))));
That's not really reliable though, is it? I've seen corrupted link
chains (leading to endless vacuums and such) a number of times, so maybe
explicitly keeping track of already visited pages within a level would
be worthwhile?

It's not reliable -- it's one of the "shot in the dark, but why not?"
checks, which are the majority -- no reason to not be defensive.
However, it's highly likely that cases like the one you describe would
be caught by the cross-page check, or even high-key check, because
they're probably explained by an underlying problem with transposed
pages, maybe not even 8KiB aligned pages (e.g., buggy firmware
controllers have been known to do this with their 4KiB pages, but you
wouldn't necessarily notice that easily with an int4 index or
similar).

I feel relatively confident that we don't need to do better with that
one, since it will never make a difference; the only case where that
might be wrong that I can think of is where you have an awful lot of
duplicates, which is unfortunately not that uncommon with Postgres.
Even if it could help, that sounds like something that won't pay for
itself.

Hm. I think it'd be useful if we also made sure (or in the caller) that
the page actually is at the correct level. I seem to recall seing a
case where downlinks were "circular" due to corruption, leading to stuck
scans.

That's a good idea, because it is, if nothing else, an easy,
low-overhead additional check.

+ * Note:  This routine is not especially proactive in freeing memory.  High
+ * watermark memory consumption is bound to some small fixed multiple of
+ * BLCKSZ, though.  Caller should reset the current context between calls here.
I'd shorten this to "Memory allocated in this routine is expected to be
released by the caller resetting the ->xxx/current context".

Okay.

+static void
+bt_target_page_check(BtreeCheckState *state)
+{
+     OffsetNumber    offset;
+     OffsetNumber    max;
+     BTPageOpaque    topaque;
+
+     topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+     max = PageGetMaxOffsetNumber(state->target);
+
+     elog(POSITION, "verifying %u items on %s block %u", max,
+              P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock);
I think it's fine if you don't, but this could theoretically integrate
with the progress stuff added in 9.6 (as in
pgstat_progress_start_command()). That'd allow for a lot more
convenient display of this than thousands of debug messages or nothing.

Never enough time. :-)

+     /*
+      * Loop over page items, but don't start from P_HIKEY (don't have iteration
+      * directly considering high key item, if any).

I don't understand the parenthetical comment as phrased.

It means that the high key item, which is physically the first item on
the leaf level, and the second item on internal levels. (That have an
explicit "minus infinity" downlink item), is skipped in the sense that
it doesn't get its own iteration of the loop (note that there is only
a high key on non-rightmost-at-level pages -- I like to say that the
rightmost pages have an implicit/imaginary "positive infinity" high
key, to illustrate the symmetry between high keys and downlinks.
Although, we don't say that in any code.)

"which there must be for" sounds weird.

s/prefer to//

Okay.

Does that kind of comment look reasonable after pgindent? Actually,
could you just pindent this patch (including typedefs.list additions)
and repair the bad looking damage?

I'll check.

Do all these separate psprintfs have much of a point? It seems just as
easy to includ ethem in the actual ereport?

I thought it looked very convoluted that way.

+              * next/sibling page.  There may not be such an item available from
+              * sibling for various reasons, though (e.g., target is the rightmost
+              * page on level).
+              */

Is the comma after e.g. correct? </mega-pedanticism-from-esl-person>

This is a matter of style. "e.g.," is generally considered American style.

+                             ereport(CORRUPTION,
+                                             (errcode(ERRCODE_INDEX_CORRUPTED),
+                                              errmsg("cross page item order invariant violated for index \"%s\"",
+                                                             RelationGetRelationName(state->rel)),
+                                              errdetail_internal("Last item on page tid=(%u,%u) "
+                                                                                     "page lsn=%X/%X.",
+                                                                                     state->targetblock, offset,
+                                                                                     (uint32) (state->targetlsn >> 32),
+                                                                                     (uint32) state->targetlsn)));

I'm am *wondering* (as in not sure), if we could get rid of all the "for
index ..." bits and use an errcontext for that instead, which'd be
included in all messages.

Doesn't seem worth it to me.

+     /*
+      * General notes on concurrent page splits and page deletion:
+      *
+      * Routines like _bt_search() don't require *any* page split interlock when
+      * descending the tree, including something very light like a buffer pin.
+      * That's why it's okay that we don't either.  This avoidance of any need
+      * to "couple" buffer locks is the raison d' etre of the Lehman & Yao
+      * algorithm, in fact.

raison d'etre comment seems more like something belonging to
nbtree/README than this, but I don't care much.

I think it adds value. This is a learning tool for nbtree. That's a
secondary goal, at least. We're *way* too shy at pointing out simple
facts like this, that are of very significant consequence. It's only
one sentence.

Maybe s/Ensuring/The reasons why they don't happen in ...'?

"on to this right sibling page"?

missing trailing 'the'? I'm really not a native speaker (and thus might
be off the mark), but making these sentences a bit clearer really
wouldn't hurt.

Okay. The sentences are very complicated, but that's justified by the
complexity of the underlying issue. I rewrote that text several times.

s/N.B./NB/ by precedence ;)

s/We prefer//

Okay.

+             if (OFFSET_IS_MINUS_INFINITY(copaque, offset))
+                     continue;
FWIW, I'd even convert this to an inline function rather than a macro.
Macros are useful when you have to do stuff you can't do otherwise, but
shouldn't be used much otherwise.

Okay.

+             if (!invariant_key_less_than_equal_nontarget_offset(state, child,
+                                                                                                                     targetkey, offset))
+                     ereport(CORRUPTION,
+                                     (errcode(ERRCODE_INDEX_CORRUPTED),

What's your reasoning for differing sometimes using
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?

The ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE ereports are not errors,
and do not indicate corruption.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Peter Geoghegan (#69)

Re: amcheck (B-Tree integrity checking tool)

On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote:

I'd really like to have something like relation_check(), index_check()
which calls the correct functions for the relevan index types (and
possibly warns about !nbtree indexes not being checked?).

I feel that that's something for an in-core facility that exposes this
through DDL. One good reason to not do that just yet is that I'm not
completely comfortable with what a uniform interface looks like here.
And frankly, I really just want to get this out of the way, to prove
the general concept of verification tooling. amcheck v1 is a good
start, but much work remains. I think that no one will take the idea
completely seriously until it is usable as a core extension. Let's get
the ball rolling.

I'm not convinced at all by that argument. If we want something in
core, having this in contrib doesn't help much. Nor do I see much
point in explicit syntax.

+#define CHECK_SUPERUSER() { \
+             if (!superuser()) \
+                     ereport(ERROR, \
+                                     (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
+                                      (errmsg("must be superuser to use verification functions")))); }
Please make this either a function or just copy - the macro doesn't buy
us anything except weirder looking syntax.
It's actually from pageinspect, but okay.

We have a lot of dodgy stuff in places ;)

+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);
I think it'd be good to avoid adding a lot of functions for each
additional type of check we can think of. I think adding a named 'bool
check_parents = false' argument or such would be better. Then we can
simply add more and more checks subsequently, and we can have
'bool all_checks = false' so people can trivially check everything,
instead of having to add new checks in each new function.
I strongly disagree. I would agree if the lock strength were the same
in both cases, but it isn't. Varying the strength of heavyweight lock
taken based on an argument to a function is a massive foot-gun IMV.

Meh. I don't think that's a serious problem, nor is without precedent
(say same toplevel DDL with differing lock levels depending on
options). Nor do the function names confer that any more efficiently
than parameters.

I do think that that's the logical way to split out functionality. I
don't think that there is likely to be a problem with adding stuff in
the future. It'll either be something that we can do in passing
anyway, in which case we can just add it to the existing checks
harmlessly. Or, it will be something like a tool that verifies the
heap is consistent with a given index, which is vastly more expensive
in terms of CPU and I/O costs, and yet is probably no worse than
bt_index_check() in terms of lock strength (AccessShareLock).

But you quite possibly want to do all three in one pass - and that'll be
ugly with your API. With your API I'll not be able to do the parent
check and the heap check at once. Lest we add fourth function that does
all of it.

My thinking here is that users really ought to use bt_index_check() in
almost all cases, not bt_index_parent_check(), which is more of a tool
for hackers that are developing new B-Tree features. The distinction
is blurry, in fact, but the lock strength requirements of
bt_index_parent_check() make it vastly less useful.

A quite reasonable patter is to do this verification on standbys that
you can either bring up separately and/or just temporarily pause. Or
just backups before declaring them valid.

+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */
I'm *VERY* strongly against releasing locks on user relations early.

Why? pageinspect does this.

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.

+     /*
+      * We must lock table before index to avoid deadlocks.  However, if the
+      * passed indrelid isn't an index then IndexGetRelation() will fail.
+      * Rather than emitting a not-very-helpful error message, postpone
+      * complaining, expecting that the is-it-an-index test below will fail.
+      */
+     heapid = IndexGetRelation(indrelid, true);
+     if (OidIsValid(heapid))
+             heaprel = heap_open(heapid, ShareLock);
+     else
+             heaprel = NULL;
+
+     /*
+      * Open the target index relations separately (like relation_openrv(), but
+      * with heap relation locked first to prevent deadlocking).  In hot standby
+      * mode this will raise an error.
+      */
+     indrel = index_open(indrelid, ExclusiveLock);
So yea, this is what I was complaining about re locking / index & table
relation IIRC. Note you're also deadlock prone by violating the rule
about always locking the relation first.
This is based on brin_summarize_new_values().
I'm not doing a recheck,
though, because I don't actually need the heap relation for anything
(I suppose I imagined that that was okay, possibly failing to consider
some subtlety beyond "consistency is good").

We iirc rely on indexes not being locked without the heap also being
locked... I see no reason to be careful here.

The deadlock part of my complaint is bogus.

The comments here are almost a direct lift -- is
brin_summarize_new_values() wrong to itself claim to be
deadlock-safe?.

No, that seems ok.

(BTW, brin_summarize_new_values(), an SQL-callable function, also
releases heavyweight locks early.)

And that's afaics actually broken and a bug.

+     if (!IndexIsValid(rel->rd_index))
+             ereport(ERROR,
+                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                              errmsg("cannot check index \"%s\"",
+                                             RelationGetRelationName(rel)),
+                              errdetail("Index is not valid")));
I'm perfectly fine with not changing that, but IIUC it'd be fine to
weaken this to IndexIsLive() - right?
I believe so, but prefer to be more restrictive in the absence of a
good reason not to be.

Seems useful to be able to verify an index during CONCURRENTLY, after
the initial build, but before the update pass, it's not like we didn't
have bugs in it :(

+static void
+bt_check_every_level(Relation rel, bool exclusivelock)
Hm, is every level accurate - this is more like checking every
"reachable" page, right?
Every level should have at least one page that is reachable (not
half-dead or fully dead), unless perhaps VACUUM marks the last leaf
page in the entire index as half-dead and we land on that. I think
that that's such a rare case that it shouldn't alter how we name this
function at all, which really does do what it says almost always. Note
that there are pretty severe restrictions on when vacuum can do
page-level deletion of an internal page. This is what leads to index
bloat with sparse deletion patterns (one problem I want to address by
adding suffix truncation to internal pages at a later date).

I think my point was different - it's not that we're not checking every
level, it's that we're checking *more* than just "levels".

What's your reasoning for differing sometimes using
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED?

The ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE ereports are not errors,
and do not indicate corruption.

Don't think errcodes make much sense for debug stuff, nor do I think
that ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE would be a good fit for
something that's not an error.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Andres Freund (#70)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-02-09 13:29:42 -0800, Peter Geoghegan wrote:

I'd really like to have something like relation_check(), index_check()
which calls the correct functions for the relevan index types (and
possibly warns about !nbtree indexes not being checked?).

I feel that that's something for an in-core facility that exposes this
through DDL. One good reason to not do that just yet is that I'm not
completely comfortable with what a uniform interface looks like here.
And frankly, I really just want to get this out of the way, to prove
the general concept of verification tooling. amcheck v1 is a good
start, but much work remains. I think that no one will take the idea
completely seriously until it is usable as a core extension. Let's get
the ball rolling.

I'm not convinced at all by that argument. If we want something in
core, having this in contrib doesn't help much. Nor do I see much
point in explicit syntax.

I don't see any need for this to be in core, nor do I think we need
explicit syntax. But given that, I also don't think that we need the
generalized infrastructure you proposed in the quoted section above.

I strongly disagree. I would agree if the lock strength were the same
in both cases, but it isn't. Varying the strength of heavyweight lock
taken based on an argument to a function is a massive foot-gun IMV.

Meh. I don't think that's a serious problem, nor is without precedent
(say same toplevel DDL with differing lock levels depending on
options). Nor do the function names confer that any more efficiently
than parameters.

Really? check_btree(blah, true) vs. check_btree(blah, false) is no
more or less clear than check_btree_for_plausible_errors(blah) vs.
check_btree_in_unreasonable_detail(blah)?

I do think that that's the logical way to split out functionality. I
don't think that there is likely to be a problem with adding stuff in
the future. It'll either be something that we can do in passing
anyway, in which case we can just add it to the existing checks
harmlessly. Or, it will be something like a tool that verifies the
heap is consistent with a given index, which is vastly more expensive
in terms of CPU and I/O costs, and yet is probably no worse than
bt_index_check() in terms of lock strength (AccessShareLock).

But you quite possibly want to do all three in one pass - and that'll be
ugly with your API. With your API I'll not be able to do the parent
check and the heap check at once. Lest we add fourth function that does
all of it.

This however seems like a good point.

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.

I don't agree with that. The reason we keep relation locks until the
end of the transaction is to avoid surprising application code, not
because the system can't tolerate it. But here it's more likely that
retaining the lock will surprise the user.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Robert Haas (#71)

Re: amcheck (B-Tree integrity checking tool)

On 2017-02-09 17:12:58 -0500, Robert Haas wrote:

On Thu, Feb 9, 2017 at 4:57 PM, Andres Freund <andres@anarazel.de> wrote:

Meh. I don't think that's a serious problem, nor is without precedent
(say same toplevel DDL with differing lock levels depending on
options). Nor do the function names confer that any more efficiently
than parameters.

Really? check_btree(blah, true) vs. check_btree(blah, false) is no
more or less clear than check_btree_for_plausible_errors(blah) vs.
check_btree_in_unreasonable_detail(blah)?

Except that the proposed names aren't remotely like that... ;).

And I proposed documenting named parameters, and
check_btree(performing_check_requiring_exclusive_locks => true) is just
about as expressive.

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.

I don't agree with that. The reason we keep relation locks until the
end of the transaction is to avoid surprising application code, not
because the system can't tolerate it. But here it's more likely that
retaining the lock will surprise the user.

It's both. A bunch of code paths rely on early release ony happening on
catalogs. E.g., and that's just the first thing that comes to mind,
cache invalidations which really only are guaranteed to be delivered and
visibile for other cache accesses, if the lock is only released after
the end of the transaction. Object accesses like relation_open() accept
invalidations *after* acquiring the lock. Combined with lock releases
only happening *after* transaction commits that guarantees an up2date
view of the cache; but if others release the lock earlier and have cache
invalidations, that doesn't work anymore. Now you can argue that it's
possibly harmless here because there's presumably no way this can ever
trigger cache invals - and you're probably right - but this isn't the
only place with such assumption and you'd definitely have to argue *why*
it's right.

There were other issues than this, but right of the top of my head I
only remember:
double
IndexBuildHeapRangeScan(Relation heapRelation,
..
/*
* Since caller should hold ShareLock or better, normally
* the only way to see this is if it was inserted earlier
* in our own transaction. However, it can happen in
* system catalogs, since we tend to release write lock
* before commit there. Give a warning if neither case
* applies.
*/
xwait = HeapTupleHeaderGetXmin(heapTuple->t_data);
if (!TransactionIdIsCurrentTransactionId(xwait))
{
if (!is_system_catalog)
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(heapRelation));

which isn't an issue here, but reinforces my point about the (badly
documented) assumption that we don't release locks on user relations
early.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Andres Freund (#72)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:

Except that the proposed names aren't remotely like that... ;).

And I proposed documenting named parameters, and
check_btree(performing_check_requiring_exclusive_locks => true) is just
about as expressive.

I cannot think of an example, offhand, of where it isn't self-evident
what relation level locks will be acquired when a statement is
prepared. I guess you could say that VACUUM truncation is a bit like
that, but it's not the same IMV.

Maybe there is an easy compromise here. We can add an argument with a
default, "level". It's an "enum" of the style of pg_prewarm (i.e.,
it's actually a text argument). The default is 'medium' or something
like that. Maybe we don't add any other flags for now, or maybe we add
some generic ones that don't actually changed the behavior. Existing
names are retained.

Doesn't that address your concern about an inflexible API? I really
dislike the idea of having the lock strength be determined on the
basis of a function arguments value.

I don't agree with that. The reason we keep relation locks until the
end of the transaction is to avoid surprising application code, not
because the system can't tolerate it. But here it's more likely that
retaining the lock will surprise the user.

There were other issues than this, but right of the top of my head I
only remember:
double
IndexBuildHeapRangeScan(Relation heapRelation,
..
/*
* Since caller should hold ShareLock or better, normally
* the only way to see this is if it was inserted earlier
* in our own transaction. However, it can happen in
* system catalogs, since we tend to release write lock
* before commit there. Give a warning if neither case
* applies.
*/
xwait = HeapTupleHeaderGetXmin(heapTuple->t_data);
if (!TransactionIdIsCurrentTransactionId(xwait))
{
if (!is_system_catalog)
elog(WARNING, "concurrent insert in progress within table \"%s\"",
RelationGetRelationName(heapRelation));

which isn't an issue here, but reinforces my point about the (badly
documented) assumption that we don't release locks on user relations
early.

You are right about the substantive issue, I assume, but I have a hard
time agreeing with the idea that it's even badly documented when there
are at least 10 counter-examples of blithely doing this. In any case,
I find it very confusing when you talk about things as established
practice/coding standards when they're very clearly not consistently
adhered to at all. You should at least say "let's not make a bad
situation any worse", or something, so that I don't need to spend 10
minutes pulling my hair out.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Peter Geoghegan (#73)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan <pg@bowt.ie> wrote:

which isn't an issue here, but reinforces my point about the (badly
documented) assumption that we don't release locks on user relations
early.

You are right about the substantive issue, I assume, but I have a hard
time agreeing with the idea that it's even badly documented when there
are at least 10 counter-examples of blithely doing this. In any case,
I find it very confusing when you talk about things as established
practice/coding standards when they're very clearly not consistently
adhered to at all. You should at least say "let's not make a bad
situation any worse", or something, so that I don't need to spend 10
minutes pulling my hair out.

BTW, aren't there cases where a PARALLEL SAFE function that needs to
acquire locks on some arbitrary relation not referenced in the query
can get locks on its own, which may only last as long as the parallel
worker's lifetime? This could happen *despite* the fact that the
author of the function may have imagined that callers could not
release relation level locks early (i.e., by not releasing them
directly, and so correctly following this "badly documented
assumption").

It seems like the existence of PARALLEL RESTRICTED is primarily
motivated by making stuff that definitely needs the locks to last
until xact end being sure that that happens -- the docs say so. This
seems wholly inconsistent with the idea that you're not supposed to
let that happen under any circumstances. I find all this highly
confusing. Have I missed some further subtlety that applies with
PARALLEL RESTRICTED?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Andres Freund (#72)

1 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:

Except that the proposed names aren't remotely like that... ;).

Revision attached -- V5. We now REVOKE ALL on both functions, as
Robert suggested, instead of the previous approach of having a
hard-coded superuser check with enforcement.

And I proposed documenting named parameters, and
check_btree(performing_check_requiring_exclusive_locks => true) is just
about as expressive.

I have not done this, nor have I renamed the functions. I still think
that this is something that we can fix by adding a boolean argument to
each function in the future, or something along those lines. I
*really* hate the idea of having one function with non-obvious,
variable requirements on locking, with locking implications that are
not knowable when we PREPARE an SQL statement calling the function. It
also removes a useful way of have superusers discriminate against the
stronger locking variant bt_index_parent_check() by not granting
execute on it (as an anti-footgun measure). I don't think that it's
too much to ask for you to concede this one point to me (and to
Robert).

I have acted on every other item you raised, without exception,
including retaining the locks for the duration of the transaction. I
also made the functions PARALLEL RESTRICTED, which seemed like a good
idea.

Other notes:

* A nice thing about the additional level check that you suggested is
that it's a cross-level check that is perfectly safe to perform with
only an AccessShareLock, unlike the extra bt_index_parent_check()
check, since the number of levels only ever increases, no matter what.
There is no possible reason why we should ever find that a child page
does not agree with its self-described parent about the level it's on.
I'm glad you thought of this.

* I kept ereport() DEBUG1 calls, but now use non-error errcode for
those. I thought that preserving the various ereport() fields was
worth it, since these traces are rather information dense in a way
that is harder to deal with when using raw elog()s. If you hate this,
you can still change it.

* I have not added code to check the relation permissions, which might
matter if and when the superuser grants execute privs on amcheck
functions. Anyone want to argue that I should have added
relation-level permissions checks? I tend to think it's pointless,
since the superuser almost certainly won't think to consider that
additional factor. I cannot imagine how either function could be used
to leak information, in any case. No error message dumps the contents
of pages (only item pointers).

* pgindent was run against this revision. You were right to want to
get that out of the way now, since it needed to be fixed up by hand to
look reasonable. typedef list also updated.

--
Peter Geoghegan

#76

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Peter Geoghegan (#74)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Feb 10, 2017 at 6:45 AM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 9, 2017 at 2:47 PM, Peter Geoghegan <pg@bowt.ie> wrote:

which isn't an issue here, but reinforces my point about the (badly
documented) assumption that we don't release locks on user relations
early.

I think even if we don't need to retain locks till transaction end in
the present case, but Andres's point has a merit that it is good to
follow an undocumented rule of releasing locks at transaction end for
user relations unless there is some benefit in releasing the locks
early in some situation. I think that can avoid unnecessary worry
that whether it safe to do in some situation.

You are right about the substantive issue, I assume, but I have a hard
time agreeing with the idea that it's even badly documented when there
are at least 10 counter-examples of blithely doing this. In any case,
I find it very confusing when you talk about things as established
practice/coding standards when they're very clearly not consistently
adhered to at all. You should at least say "let's not make a bad
situation any worse", or something, so that I don't need to spend 10
minutes pulling my hair out.

BTW, aren't there cases where a PARALLEL SAFE function that needs to
acquire locks on some arbitrary relation not referenced in the query
can get locks on its own, which may only last as long as the parallel
worker's lifetime? This could happen *despite* the fact that the
author of the function may have imagined that callers could not
release relation level locks early (i.e., by not releasing them
directly, and so correctly following this "badly documented
assumption").

It seems like the existence of PARALLEL RESTRICTED is primarily
motivated by making stuff that definitely needs the locks to last
until xact end being sure that that happens -- the docs say so.

I don't think so. I think one of the primary reason of releasing
locks in workers is that we don't have any way to transfer them to the
leader which can release them at transaction end.

This
seems wholly inconsistent with the idea that you're not supposed to
let that happen under any circumstances. I find all this highly
confusing. Have I missed some further subtlety that applies with
PARALLEL RESTRICTED?

The basic idea of parallel restricted is that any expression will be
considered parallel restricted if that can't be executed in the
worker, for example, a reference to initplans in a query.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Peter Geoghegan (#74)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 8:15 PM, Peter Geoghegan <pg@bowt.ie> wrote:

BTW, aren't there cases where a PARALLEL SAFE function that needs to
acquire locks on some arbitrary relation not referenced in the query
can get locks on its own, which may only last as long as the parallel
worker's lifetime? This could happen *despite* the fact that the
author of the function may have imagined that callers could not
release relation level locks early (i.e., by not releasing them
directly, and so correctly following this "badly documented
assumption").

Yes.

It seems like the existence of PARALLEL RESTRICTED is primarily
motivated by making stuff that definitely needs the locks to last
until xact end being sure that that happens -- the docs say so. This
seems wholly inconsistent with the idea that you're not supposed to
let that happen under any circumstances. I find all this highly
confusing. Have I missed some further subtlety that applies with
PARALLEL RESTRICTED?

That's by no means the only thing that could cause you to mark
something PARALLEL RESTRICTED. I think you might have missed this bit
of the documentation, a few paragraphs up:

Similarly, functions must be marked <literal>PARALLEL
RESTRICTED</> if they access temporary tables, client connection state,
cursors, prepared statements, or miscellaneous backend-local state which
the system cannot synchronize across workers. For example,
<literal>setseed</> and <literal>random</> are parallel restricted for
this last reason.

That stuff is the main motivation. The early lock release stuff could
come up for user-written functions, but the things listed in the
paragraph I've quoted here come up for plenty of built-in functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Andres Freund (#72)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund <andres@anarazel.de> wrote:

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.

I don't agree with that. The reason we keep relation locks until the
end of the transaction is to avoid surprising application code, not
because the system can't tolerate it. But here it's more likely that
retaining the lock will surprise the user.

It's both. A bunch of code paths rely on early release ony happening on
catalogs. E.g., and that's just the first thing that comes to mind,
cache invalidations which really only are guaranteed to be delivered and
visibile for other cache accesses, if the lock is only released after
the end of the transaction. Object accesses like relation_open() accept
invalidations *after* acquiring the lock. Combined with lock releases
only happening *after* transaction commits that guarantees an up2date
view of the cache; but if others release the lock earlier and have cache
invalidations, that doesn't work anymore. Now you can argue that it's
possibly harmless here because there's presumably no way this can ever
trigger cache invals - and you're probably right - but this isn't the
only place with such assumption and you'd definitely have to argue *why*
it's right.

I agree that code which issues cache invalidations needs to hold a
blocking lock until commit IF it's critical for other backends to see
the update. That's not always the case; for example, if an ALTER
TABLE statement only changes the fillfactor the worst thing that
happens if some other backend fails to see the invalidations is that
some transactions continue to use the old value for a while. That's
why we now allow fillfactor and some other things using only
ShareUpdateExclusiveLock. That's not quite the same thing but the
effect is the same: we issue invalidations that other backends aren't
guaranteed to see in a timely fashion, and it works OK. Yeah, some
other session might not see the change for a while, but we don't view
that as a big problem.

But I think in this case that argument is basically a straw man. A
utility that's called amcheck is obviously not issuing any cache
invalidations. It's probably not even changing anything at all; if it
is, maybe it should be called amrepair. So the fact that CLUSTER has
to hold AEL until commit time really doesn't have anything to do with
whether amcheck needs to hold AES until commit time.

I don't really understand your demand for "an argument why it's
right". My argument is simple: I can't think of a single thing that
it would break, and I don't believe there is any such thing.
Furthermore, as Peter points out, we do this kind of thing all the
time in other places and it indeed does not break things. I think
you're trying to invent a coding rule that doesn't exist, but I can't
prove a negative. Your argument so far is that "yeah, well, maybe
inval doesn't emit sinval traffic (duh?) but it might do something
else that breaks, prove to me that there is no such thing". But
that's just like asking me to prove that it's impossible to exceed the
speed of light. On information and belief, nobody currently knows how
to do that and there is good scientific evidence that it cannot be
done, but there's no way to prove that someone won't in the future
discover a refinement in the physical laws of the universe as we
understand them today which sheds a different light on the situation.

From my point of view, it's likely to be common to want to run amcheck
in series on a bunch of indexes, like by writing a SELECT query
against pg_class or pg_index and passing the OIDs to amcheck. With
the retain-locks design, that query accumulates locks on a large
number of objects, possibly blowing out the lock table or blocking
DDL. I think that readily-forseeable problem ought to trump concerns
about purely hypothetical issues caused by releasing locks early. If
you can come up with an actual issue with releasing locks early that
applies to this particular case, then that's different, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Peter Geoghegan (#75)

Re: amcheck (B-Tree integrity checking tool)

On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:

Except that the proposed names aren't remotely like that... ;).

Revision attached -- V5. We now REVOKE ALL on both functions, as
Robert suggested, instead of the previous approach of having a
hard-coded superuser check with enforcement.

And I proposed documenting named parameters, and
check_btree(performing_check_requiring_exclusive_locks => true) is just
about as expressive.

I have not done this, nor have I renamed the functions. I still think
that this is something that we can fix by adding a boolean argument to
each function in the future, or something along those lines. I
*really* hate the idea of having one function with non-obvious,
variable requirements on locking, with locking implications that are
not knowable when we PREPARE an SQL statement calling the function. It
also removes a useful way of have superusers discriminate against the
stronger locking variant bt_index_parent_check() by not granting
execute on it (as an anti-footgun measure).

I think Andres is more or less correct that
"performing_check_requiring_exclusive_locks => true" is just about as
expressive as calling a different function, but I think that your
point that the superuser might want to grant access to one function
but not the other is a good one. On the other hand, I think Andres
has a concern that we might have more modes in the future and we don't
want to end up with 2^n entrypoints. That also seems valid. Hmm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Robert Haas (#78)

Re: amcheck (B-Tree integrity checking tool)

Hi,

On 2017-02-13 11:52:53 -0500, Robert Haas wrote:

On Thu, Feb 9, 2017 at 5:32 PM, Andres Freund <andres@anarazel.de> wrote:

Again, some parts of the code doing something bad isn't a good argument
for doing again. Releasing locks early is a bad pattern, because backend
code isn't generally safe against it, we have special code-paths for
catalog tables to support it for those.

I don't agree with that. The reason we keep relation locks until the
end of the transaction is to avoid surprising application code, not
because the system can't tolerate it. But here it's more likely that
retaining the lock will surprise the user.

It's both. A bunch of code paths rely on early release ony happening on
catalogs. E.g., and that's just the first thing that comes to mind,
cache invalidations which really only are guaranteed to be delivered and
visibile for other cache accesses, if the lock is only released after
the end of the transaction. Object accesses like relation_open() accept
invalidations *after* acquiring the lock. Combined with lock releases
only happening *after* transaction commits that guarantees an up2date
view of the cache; but if others release the lock earlier and have cache
invalidations, that doesn't work anymore. Now you can argue that it's
possibly harmless here because there's presumably no way this can ever
trigger cache invals - and you're probably right - but this isn't the
only place with such assumption and you'd definitely have to argue *why*
it's right.

I agree that code which issues cache invalidations needs to hold a
blocking lock until commit IF it's critical for other backends to see
the update.

Right. And I think that needs to be the default assumption, because
otherwise you need to be careful about all the relevant code paths. I'm
fine with adding something along the lines of
/*
* Release lock on relation early, this function doesn't generate any
* cache invalidations, and it's convenient for callers to be able to
* check multiple functions in one go.
*/
(after checking that's true of course) and then skipping the lock
release. What I'm absolutely not ok with is doing so without
consideration and documentation.

I don't really understand your demand for "an argument why it's
right". My argument is simple: I can't think of a single thing that
it would break, and I don't believe there is any such thing.

That's pretty much an argument. I just want documentation that calls
attention that we're doing something that requires a modicum of care. I
fail to see why that's unreasonable. I don't think it's that unlikely
will come around adding something like "hey we can update the free page
statistic here, because we just read the whole damn thing", suddenly
invalidating the logic. Or we add a heap check that then might trigger
pruning, which quite conceivably could trigger an invalidation. Or ...

I think you're trying to invent a coding rule that doesn't exist, but
I can't prove a negative.

We had bugs around this in the past, so ...

Your argument so far is that "yeah, well, maybe inval doesn't emit
sinval traffic (duh?) but it might do something else that breaks,
prove to me that there is no such thing".

I have no idea what you're trying to say with this. The rest of that
paragraph just seems like bunch of odd hyperbole ("impossible to exceed
the speed of light...").

From my point of view, it's likely to be common to want to run amcheck
in series on a bunch of indexes, like by writing a SELECT query
against pg_class or pg_index and passing the OIDs to amcheck. With
the retain-locks design, that query accumulates locks on a large
number of objects, possibly blowing out the lock table or blocking
DDL.

On the other hand, retaining the locks actually increases the chance to
run a check to completion. If you e.g. write a query checking the
indexes for a table, you suddenly can't be sure that subsequent indexes
for the same table can be checked, because the relation lock has been
released inbetween. Which then'd lead to erroring out after doing part
of the expensive work.

I don't think that refutes your argument, but I also don't think it's
as clear cut as you make it out to be.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Robert Haas (#79)

Re: amcheck (B-Tree integrity checking tool)

On 2017-02-13 12:05:21 -0500, Robert Haas wrote:

On Fri, Feb 10, 2017 at 8:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Feb 9, 2017 at 2:32 PM, Andres Freund <andres@anarazel.de> wrote:

Except that the proposed names aren't remotely like that... ;).

Revision attached -- V5. We now REVOKE ALL on both functions, as
Robert suggested, instead of the previous approach of having a
hard-coded superuser check with enforcement.

And I proposed documenting named parameters, and
check_btree(performing_check_requiring_exclusive_locks => true) is just
about as expressive.

I have not done this, nor have I renamed the functions. I still think
that this is something that we can fix by adding a boolean argument to
each function in the future, or something along those lines. I
*really* hate the idea of having one function with non-obvious,
variable requirements on locking, with locking implications that are
not knowable when we PREPARE an SQL statement calling the function. It
also removes a useful way of have superusers discriminate against the
stronger locking variant bt_index_parent_check() by not granting
execute on it (as an anti-footgun measure).

I think Andres is more or less correct that
"performing_check_requiring_exclusive_locks => true" is just about as
expressive as calling a different function, but I think that your
point that the superuser might want to grant access to one function
but not the other is a good one. On the other hand, I think Andres
has a concern that we might have more modes in the future and we don't
want to end up with 2^n entrypoints. That also seems valid. Hmm.

That's part of my concern. The second part is that I really want to be
able to have a check_relation() (and check_database())function that I
can pass a bunch of arguments determining how expensive checks are going
to be performed.

E.g. I'd like to be able to do something like

SELECT *
FROM check_relation(
'my_table'::regclass,
test_btree => 'true',
test_btree_heap_interlock => 'true',
test_gin => 'true');

SELECT *
FROM check_current_database(
test_heap_update_chains => 'true',
test_heap_clog_interlock => 'true',
test_btree => 'true',
test_gin => 'false');

etc.

You can't really trivially replace these with a larger query and/or
function, because of the locking considerations (consider what happens
if somebody concurrently drops a table/index - your whole query errors
out, wasting hours of work).

I'm ok with not immediately doing so, but I think Peter's design isn't
in line with achieving something like this.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Andres Freund (#81)

Re: amcheck (B-Tree integrity checking tool)

On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund <andres@anarazel.de> wrote:

I'm ok with not immediately doing so, but I think Peter's design isn't
in line with achieving something like this.

I would be okay with doing this if we had a grab-bag of expensive
checks, that all pretty much require some combination of
ExclusiveLocks and/or ShareLocks anyway, and are amenable to being
combined. I could see that happening, but we're a long way off from
that. When it does happen, we might have other things that are a bit
like bt_index_parent_check(), in terms of locking requirements and
degree of thoroughness, that might themselves require other parameters
specific to the verification that they perform that we cannot
anticipate. For example, maybe bt_index_parent_check() could have a
parameter that is a probability that any given IndexTuple will be
verified against its heap tuple. Then you could have something like a
PRNG seed argument, so you can check different tuples every day. There
are many possibilities, and I hope that eventually amcheck has this
kind of very rich functionality.

When you use the interface you describe to "stack" several checks at
once, less important parameters, like the ones I suggest are going to
be an awkward fit, and so will probably be excluded. I'm not opposed
to having what amounts to a second interface to what bt_index_check()
does, to make this kind of thing work where it makes sense. Really,
bt_index_parent_check() is almost an alternative interface to the same
functionality today. There can be another one in the future, if it
serves a purpose, and the locking requirements are roughly the same
for all the checks. I'd be fine with that. Let's just get the basic
feature in for now, though.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Peter Geoghegan (#82)

2 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

On 2017-03-06 20:40:59 -0800, Peter Geoghegan wrote:

On Mon, Mar 6, 2017 at 3:57 PM, Andres Freund <andres@anarazel.de> wrote:

I'm ok with not immediately doing so, but I think Peter's design isn't
in line with achieving something like this.

I would be okay with doing this if we had a grab-bag of expensive
checks, that all pretty much require some combination of
ExclusiveLocks and/or ShareLocks anyway, and are amenable to being
combined. I could see that happening, but we're a long way off from
that. When it does happen, we might have other things that are a bit
like bt_index_parent_check(), in terms of locking requirements and
degree of thoroughness, that might themselves require other parameters
specific to the verification that they perform that we cannot
anticipate. For example, maybe bt_index_parent_check() could have a
parameter that is a probability that any given IndexTuple will be
verified against its heap tuple. Then you could have something like a
PRNG seed argument, so you can check different tuples every day. There
are many possibilities, and I hope that eventually amcheck has this
kind of very rich functionality.

When you use the interface you describe to "stack" several checks at
once, less important parameters, like the ones I suggest are going to
be an awkward fit, and so will probably be excluded. I'm not opposed
to having what amounts to a second interface to what bt_index_check()
does, to make this kind of thing work where it makes sense. Really,
bt_index_parent_check() is almost an alternative interface to the same
functionality today. There can be another one in the future, if it
serves a purpose, and the locking requirements are roughly the same
for all the checks. I'd be fine with that. Let's just get the basic
feature in for now, though.

I disagree quite strongly with pretty much all of that.

But I also think it's more important to get some initial verification
functionality in, than resolving this conflict. I do, also quite
strongly, think we'll be better served with having something like what
you're proposing than nothing, and I don't have time/energy to turn your
patch into what I'm envisioning, nor necessarily the buyin. Hence I'm
planning to commit the current interface.

Attached is your original patch, and a patch editorializing things. I do
plan to merge those before pushing.

Who would you like to see added to Reviewed-By?

Regards,

Andres

Attachments:

0001-Add-amcheck-extension-to-contrib.patchtext/x-patch; charset=us-asciiDownload

From 60c670d251dd4244a09215e4f6c22e444d2bff7c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 9 Mar 2017 15:50:40 -0800
Subject: [PATCH 1/2] Add amcheck extension to contrib.

This is the beginning of a collection of SQL-callable functions to
verify the integrity of data files.  For now it only contains code to
verify B-Tree indexes.

This adds two SQL-callable functions, validating B-Tree consistency to
a varying degree.  Check the, extensive, docs for details.

The goal is to later extend the coverage of the module to further
access methods, possibly including the heap.  Once checks for
additional access methods exist, we'll likely add some "dispatch"
functions that cover multiple access methods.

Author: Peter Geoghegan, editorialized by Andres Freund
Reviewed-By: Andres Freund, Thomas Vondra
Discussion: CAM3SWZQzLMhMwmBqjzK+pRKXrNUZ4w90wYMUWfkeV8mZ3Debvw@mail.gmail.com
---
 contrib/Makefile                 |    1 +
 contrib/amcheck/Makefile         |   19 +
 contrib/amcheck/amcheck--1.0.sql |   24 +
 contrib/amcheck/amcheck.control  |    5 +
 contrib/amcheck/verify_nbtree.c  | 1238 ++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/amcheck.sgml        |  275 +++++++++
 doc/src/sgml/contrib.sgml        |    1 +
 doc/src/sgml/filelist.sgml       |    1 +
 src/tools/pgindent/typedefs.list |    2 +
 9 files changed, 1566 insertions(+)
 create mode 100644 contrib/amcheck/Makefile
 create mode 100644 contrib/amcheck/amcheck--1.0.sql
 create mode 100644 contrib/amcheck/amcheck.control
 create mode 100644 contrib/amcheck/verify_nbtree.c
 create mode 100644 doc/src/sgml/amcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index 9a74e1b99f..e84eb67008 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -6,6 +6,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 		adminpack	\
+		amcheck		\
 		auth_delay	\
 		auto_explain	\
 		bloom		\
diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
new file mode 100644
index 0000000000..f14c7aa58b
--- /dev/null
+++ b/contrib/amcheck/Makefile
@@ -0,0 +1,19 @@
+# contrib/amcheck/Makefile
+
+MODULE_big	= amcheck
+OBJS		= verify_nbtree.o $(WIN32RES)
+
+EXTENSION = amcheck
+DATA = amcheck--1.0.sql
+PGFILEDESC = "amcheck - functions to verify access method invariants"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql
new file mode 100644
index 0000000000..a6612d130c
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0.sql
@@ -0,0 +1,24 @@
+/* contrib/amcheck/amcheck--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+CREATE FUNCTION bt_index_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+CREATE FUNCTION bt_index_parent_check(index regclass)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
new file mode 100644
index 0000000000..180f150e7f
--- /dev/null
+++ b/contrib/amcheck/amcheck.control
@@ -0,0 +1,5 @@
+# amcheck extension
+comment = 'verify access method invariants'
+default_version = '1.0'
+module_pathname = '$libdir/amcheck'
+relocatable = true
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
new file mode 100644
index 0000000000..4cef4c44ff
--- /dev/null
+++ b/contrib/amcheck/verify_nbtree.c
@@ -0,0 +1,1238 @@
+/*-------------------------------------------------------------------------
+ *
+ * verify_nbtree.c
+ *		Verifies the integrity of nbtree indexes based on invariants.
+ *
+ * For B-Tree indexes, verification includes checking that each page in the
+ * target index has items in logical order as reported by an insertion scankey
+ * (the insertion scankey sort-wise NULL semantics are needed for
+ * verification).
+ *
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/amcheck/verify_nbtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/transam.h"
+#include "catalog/index.h"
+#include "catalog/pg_am.h"
+#include "commands/tablecmds.h"
+#include "miscadmin.h"
+#include "storage/lmgr.h"
+#include "utils/memutils.h"
+#include "utils/snapmgr.h"
+
+
+PG_MODULE_MAGIC;
+
+/*
+ * A B-Tree cannot possibly have this many levels, since there must be one
+ * block per level, which is bound by the range of BlockNumber:
+ */
+#define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
+
+/*
+ * State associated with verifying a B-Tree index
+ *
+ * target is the point of reference for a verification operation.
+ *
+ * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
+ * they are current target's child pages).  Conceptually, problems are only
+ * ever found in the current target page.	Each page found by verification's
+ * left/right, top/bottom scan becomes the target exactly once.
+ */
+typedef struct BtreeCheckState
+{
+	/*
+	 * Unchanging state, established at start of verification:
+	 */
+
+	/* B-Tree Index Relation */
+	Relation	rel;
+	/* ExclusiveLock held? */
+	bool		exclusivelylocked;
+	/* Per-page context */
+	MemoryContext targetcontext;
+	/* Buffer access strategy */
+	BufferAccessStrategy checkstrategy;
+
+	/*
+	 * Mutable state, for verification of particular page:
+	 */
+
+	/* Current target page */
+	Page		target;
+	/* Target block number */
+	BlockNumber targetblock;
+	/* Target page's LSN */
+	XLogRecPtr	targetlsn;
+} BtreeCheckState;
+
+/*
+ * Starting point for verifying an entire B-Tree index level
+ */
+typedef struct BtreeLevel
+{
+	/* Level number (0 is leaf page level). */
+	uint32		level;
+
+	/* Left most block on level.  Scan of level begins here. */
+	BlockNumber leftmost;
+
+	/* Is this level reported as "true" root level by meta page? */
+	bool		istruerootlevel;
+} BtreeLevel;
+
+PG_FUNCTION_INFO_V1(bt_index_check);
+PG_FUNCTION_INFO_V1(bt_index_parent_check);
+
+static inline void btree_index_checkable(Relation rel);
+static void bt_check_every_level(Relation rel, bool exclusivelylocked);
+static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
+							 BtreeLevel level);
+static void bt_target_page_check(BtreeCheckState *state);
+static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey);
+static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
+							OffsetNumber offset);
+static inline bool invariant_leq_offset(BtreeCheckState *state,
+					 ScanKey key,
+					 OffsetNumber upperbound);
+static inline bool invariant_geq_offset(BtreeCheckState *state,
+					 ScanKey key,
+					 OffsetNumber lowerbound);
+static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
+							   Page other,
+							   ScanKey key,
+							   OffsetNumber upperbound);
+static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
+
+/*
+ * bt_index_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Only acquires AccessShareLock on index relation.  Does not consider
+ * invariants that exist between parent/child pages.
+ */
+Datum
+bt_index_check(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	Relation	indrel;
+
+	indrel = relation_open(relid, AccessShareLock);
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, false);
+
+	/* Close, but keep lock */
+	relation_close(indrel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * bt_index_parent_check(index regclass)
+ *
+ * Verify integrity of B-Tree index.
+ *
+ * Acquires ExclusiveLock on index relation, and ShareLock on the associated
+ * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
+ * are valid lower bounds on child pages.
+ */
+Datum
+bt_index_parent_check(PG_FUNCTION_ARGS)
+{
+	Oid			indrelid = PG_GETARG_OID(0);
+	Oid			heapid;
+	Relation	indrel;
+	Relation	heaprel;
+
+	/*
+	 * We must lock table before index to avoid deadlocks.	However, if the
+	 * passed indrelid isn't an index then IndexGetRelation() will fail.
+	 * Rather than emitting a not-very-helpful error message, postpone
+	 * complaining, expecting that the is-it-an-index test below will fail.
+	 */
+	heapid = IndexGetRelation(indrelid, true);
+	if (OidIsValid(heapid))
+		heaprel = heap_open(heapid, ShareLock);
+	else
+		heaprel = NULL;
+
+	/*
+	 * Open the target index relations separately (like relation_openrv(), but
+	 * with heap relation locked first to prevent deadlocking).  In hot
+	 * standby mode this will raise an error.
+	 */
+	indrel = index_open(indrelid, ExclusiveLock);
+
+	/*
+	 * Since we did the IndexGetRelation call above without any lock, it's
+	 * barely possible that a race against an index drop/recreation could have
+	 * netted us the wrong table.  Although the table itself won't actually be
+	 * examined during verification currently, a recheck still seems like a
+	 * good idea.
+	 */
+	if (heaprel == NULL || heapid != IndexGetRelation(indrelid, false))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+				 errmsg("could not open parent table of index %s",
+						RelationGetRelationName(indrel))));
+
+	/* Relation suitable for checking as B-Tree? */
+	btree_index_checkable(indrel);
+
+	/* Check index */
+	bt_check_every_level(indrel, true);
+
+	/* Close, but keep locks */
+	index_close(indrel, NoLock);
+	if (heaprel)
+		heap_close(heaprel, NoLock);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Basic checks about the suitability of a relation for checking as a B-Tree
+ * index.
+ */
+static inline void
+btree_index_checkable(Relation rel)
+{
+	if (rel->rd_rel->relkind != RELKIND_INDEX ||
+		rel->rd_rel->relam != BTREE_AM_OID)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("only B-Tree indexes are supported as targets for verification"),
+				 errdetail("Relation \"%s\" is not a B-Tree index.",
+						   RelationGetRelationName(rel))));
+
+	if (RELATION_IS_OTHER_TEMP(rel))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot access temporary tables of other sessions"),
+			 errdetail("Index \"%s\" is associated with temporary relation.",
+					   RelationGetRelationName(rel))));
+
+	if (!IndexIsValid(rel->rd_index))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot check index \"%s\"",
+						RelationGetRelationName(rel)),
+				 errdetail("Index is not valid")));
+}
+
+/*
+ * Main entry point for B-Tree SQL-callable functions.	Walks the B-Tree in
+ * logical order, verifying invariants as it goes.
+ *
+ * It is the caller's responsibility to acquire appropriate heavyweight lock on
+ * the index relation, and advise us if extra checks are safe when an
+ * ExclusiveLock is held.
+ *
+ * An ExclusiveLock is generally assumed to prevent any kind of physical
+ * modification to the index structure, including modifications that VACUUM may
+ * make.  This does not include setting of the LP_DEAD bit by concurrent index
+ * scans, although that is just metadata that is not able to directly affect
+ * any check performed here.  Any concurrent process that might act on the
+ * LP_DEAD bit being set (recycle space) requires a heavyweight lock that
+ * cannot be held while we hold an ExclusiveLock.  (Besides, even if that could
+ * happen, the ad-hoc recycling when a page might otherwise split is performed
+ * per-page, and requires an exclusive buffer lock, which wouldn't cause us
+ * trouble.  _bt_delitems_vacuum() may only delete leaf items, and so the extra
+ * parent/child check cannot be affected.)
+ */
+static void
+bt_check_every_level(Relation rel, bool exclusivelylocked)
+{
+	BtreeCheckState *state;
+	Page		metapage;
+	BTMetaPageData *metad;
+	uint32		previouslevel;
+	BtreeLevel	current;
+
+	/*
+	 * RecentGlobalXmin assertion matches index_getnext_tid().	See note on
+	 * RecentGlobalXmin/B-Tree page deletion.
+	 */
+	Assert(TransactionIdIsValid(RecentGlobalXmin));
+
+	/*
+	 * Initialize state for entire verification operation
+	 */
+	state = palloc(sizeof(BtreeCheckState));
+	state->rel = rel;
+	state->exclusivelylocked = exclusivelylocked;
+	/* Create context for page */
+	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
+												 "amcheck context",
+												 ALLOCSET_DEFAULT_MINSIZE,
+												 ALLOCSET_DEFAULT_INITSIZE,
+												 ALLOCSET_DEFAULT_MAXSIZE);
+	state->checkstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	/* Get true root block from meta-page */
+	metapage = palloc_btree_page(state, BTREE_METAPAGE);
+	metad = BTPageGetMeta(metapage);
+
+	/*
+	 * Certain deletion patterns can result in "skinny" B-Tree indexes, where
+	 * the fast root and true root differ.
+	 *
+	 * Start from the true root, not the fast root, unlike conventional index
+	 * scans.  This approach is more thorough, and removes the risk of
+	 * following a stale fast root from the meta page.
+	 */
+	if (metad->btm_fastroot != metad->btm_root)
+		ereport(DEBUG1,
+				(errcode(ERRCODE_NO_DATA),
+				 errmsg("harmless fast root mismatch in index %s",
+						RelationGetRelationName(rel)),
+				 errdetail_internal("Fast block %u (level %u) "
+									"differs from true root "
+									"block %u (level %u).",
+									metad->btm_fastroot, metad->btm_fastlevel,
+									metad->btm_root, metad->btm_level)));
+
+	/*
+	 * Starting at the root, verify every level.  Move left to right, top to
+	 * bottom.	Note that there may be no pages other than the meta page (meta
+	 * page can indicate that root is P_NONE when the index is totally empty).
+	 */
+	previouslevel = InvalidBtreeLevel;
+	current.level = metad->btm_level;
+	current.leftmost = metad->btm_root;
+	current.istruerootlevel = true;
+	while (current.leftmost != P_NONE)
+	{
+		/*
+		 * Verify this level, and get left most page for next level down, if
+		 * not at leaf level
+		 */
+		current = bt_check_level_from_leftmost(state, current);
+
+		if (current.leftmost == InvalidBlockNumber)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" has no valid pages on level below %u or first level",
+							RelationGetRelationName(rel), previouslevel)));
+
+		previouslevel = current.level;
+	}
+
+	/* Be tidy: */
+	MemoryContextDelete(state->targetcontext);
+}
+
+/*
+ * Given a left-most block at some level, move right, verifying each page
+ * individually (with more verification across pages for "exclusivelylocked"
+ * callers).  Caller should pass the true root page as the leftmost initially,
+ * working their way down by passing what is returned for the last call here
+ * until level 0 (leaf page level) was reached.
+ *
+ * Returns state for next call, if any.  This includes left-most block number
+ * one level lower that should be passed on next level/call, which is set to
+ * P_NONE on last call here (when leaf level is verified).	Level numbers
+ * follow the nbtree convention: higher levels have higher numbers, because new
+ * levels are added only due to a root page split.	Note that prior to the
+ * first root page split, the root is also a leaf page, so there is always a
+ * level 0 (leaf level), and it's always the last level processed.
+ *
+ * Note on memory management:  State's per-page context is reset here, between
+ * each call to bt_target_page_check().
+ */
+static BtreeLevel
+bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
+{
+	/* State to establish early, concerning entire level */
+	BTPageOpaque opaque;
+	MemoryContext oldcontext;
+	BtreeLevel	nextleveldown;
+
+	/* Variables for iterating across level using right links */
+	BlockNumber leftcurrent = P_NONE;
+	BlockNumber current = level.leftmost;
+
+	/* Initialize return state */
+	nextleveldown.leftmost = InvalidBlockNumber;
+	nextleveldown.level = InvalidBtreeLevel;
+	nextleveldown.istruerootlevel = false;
+
+	/* Use page-level context for duration of this call */
+	oldcontext = MemoryContextSwitchTo(state->targetcontext);
+
+	elog(DEBUG2, "verifying level %u%s", level.level,
+		 level.istruerootlevel ?
+		 " (true root level)" : level.level == 0 ? " (leaf level)" : "");
+
+	do
+	{
+		/* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Initialize state for this iteration */
+		state->targetblock = current;
+		state->target = palloc_btree_page(state, state->targetblock);
+		state->targetlsn = PageGetLSN(state->target);
+
+		opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+		if (P_IGNORE(opaque))
+		{
+			if (P_RIGHTMOST(opaque))
+				ereport(ERROR,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("block %u fell off the end of index \"%s\"",
+							 current, RelationGetRelationName(state->rel))));
+			else
+				ereport(DEBUG1,
+						(errcode(ERRCODE_NO_DATA),
+						 errmsg("block %u of index \"%s\" ignored",
+							 current, RelationGetRelationName(state->rel))));
+			goto nextpage;
+		}
+		else if (nextleveldown.leftmost == InvalidBlockNumber)
+		{
+			/*
+			 * A concurrent page split could make the caller supplied leftmost
+			 * block no longer contain the leftmost page, or no longer be the
+			 * true root, but where that isn't possible due to heavyweight
+			 * locking, check that the first valid page meets caller's
+			 * expectations.
+			 */
+			if (state->exclusivelylocked)
+			{
+				if (!P_LEFTMOST(opaque))
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+						   errmsg("block %u is not leftmost in index \"%s\"",
+							 current, RelationGetRelationName(state->rel))));
+
+				if (level.istruerootlevel && !P_ISROOT(opaque))
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+						  errmsg("block %u is not true root in index \"%s\"",
+							 current, RelationGetRelationName(state->rel))));
+			}
+
+			/*
+			 * Before beginning any non-trivial examination of level, prepare
+			 * state for next bt_check_level_from_leftmost() invocation for
+			 * the next level for the next level down (if any).
+			 *
+			 * There should be at least one non-ignorable page per level,
+			 * unless this is the leaf level, which is assumed by caller to be
+			 * final level.
+			 */
+			if (!P_ISLEAF(opaque))
+			{
+				IndexTuple	itup;
+				ItemId		itemid;
+
+				/* Internal page -- downlink gets leftmost on next level */
+				itemid = PageGetItemId(state->target, P_FIRSTDATAKEY(opaque));
+				itup = (IndexTuple) PageGetItem(state->target, itemid);
+				nextleveldown.leftmost = ItemPointerGetBlockNumber(&(itup->t_tid));
+				nextleveldown.level = opaque->btpo.level - 1;
+			}
+			else
+			{
+				/*
+				 * Leaf page -- final level caller must process.
+				 *
+				 * Note that this could also be the root page, if there has
+				 * been no root page split yet.
+				 */
+				nextleveldown.leftmost = P_NONE;
+				nextleveldown.level = InvalidBtreeLevel;
+			}
+
+			/*
+			 * Finished setting up state for this call/level.  Control will
+			 * never end up back here in any future loop iteration for this
+			 * level.
+			 */
+		}
+
+		if (state->exclusivelylocked && opaque->btpo_prev != leftcurrent)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("left link/right link pair in index \"%s\" not in agreement",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Block=%u left block=%u left link from block=%u.",
+								  current, leftcurrent, opaque->btpo_prev)));
+
+		/* Check level, which must be valid for non-ignorable page */
+		if (level.level != opaque->btpo.level)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("leftmost down link for level points to block in index \"%s\" whose level is not one level down",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Block pointed to=%u expected level=%u level in pointed to block=%u.",
+								 current, level.level, opaque->btpo.level)));
+
+		/* Verify invariants for page -- all important checks occur here */
+		bt_target_page_check(state);
+
+nextpage:
+
+		/* Try to detect circular links */
+		if (current == leftcurrent || current == opaque->btpo_prev)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+			  errmsg("circular link chain found in block %u of index \"%s\"",
+					 current, RelationGetRelationName(state->rel))));
+
+		leftcurrent = current;
+		current = opaque->btpo_next;
+
+		/* Free page and associated memory for this iteration */
+		MemoryContextReset(state->targetcontext);
+	}
+	while (current != P_NONE);
+
+	/* Don't change context for caller */
+	MemoryContextSwitchTo(oldcontext);
+
+	return nextleveldown;
+}
+
+/*
+ * Function performs the following checks on target page, or pages ancillary to
+ * target page:
+ *
+ * - That every "real" data item is less than or equal to the high key, which
+ *	 is an upper bound on the items on the pages (where there is a high key at
+ *	 all -- pages that are rightmost lack one).
+ *
+ * - That within the page, every "real" item is less than or equal to the item
+ *	 immediately to its right, if any (i.e., that the items are in order within
+ *	 the page, so that the binary searches performed by index scans are sane).
+ *
+ * - That the last item stored on the page is less than or equal to the first
+ *	 "real" data item on the page to the right (if such a first item is
+ *	 available).
+ *
+ * Furthermore, when state passed shows ExclusiveLock held, and target page is
+ * internal page, function also checks:
+ *
+ * - That all child pages respect downlinks lower bound.
+ *
+ * Note:  Memory allocated in this routine is expected to be released by caller
+ * resetting state->targetcontext.
+ */
+static void
+bt_target_page_check(BtreeCheckState *state)
+{
+	OffsetNumber offset;
+	OffsetNumber max;
+	BTPageOpaque topaque;
+
+	topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+	max = PageGetMaxOffsetNumber(state->target);
+
+	elog(DEBUG2, "verifying %u items on %s block %u", max,
+		 P_ISLEAF(topaque) ? "leaf" : "internal", state->targetblock);
+
+	/*
+	 * Loop over page items, starting from first non-highkey item, not high
+	 * key (if any).  Also, immediately skip "negative infinity" real item (if
+	 * any).
+	 */
+	for (offset = P_FIRSTDATAKEY(topaque);
+		 offset <= max;
+		 offset = OffsetNumberNext(offset))
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+		ScanKey		skey;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Don't try to generate scankey using "negative infinity" garbage
+		 * data
+		 */
+		if (offset_is_negative_infinity(topaque, offset))
+			continue;
+
+		/* Build insertion scankey for current page offset */
+		itemid = PageGetItemId(state->target, offset);
+		itup = (IndexTuple) PageGetItem(state->target, itemid);
+		skey = _bt_mkscankey(state->rel, itup);
+
+		/*
+		 * * High key check *
+		 *
+		 * If there is a high key (if this is not the rightmost page on its
+		 * entire level), check that high key actually is upper bound on all
+		 * page items.
+		 *
+		 * We prefer to check all items against high key rather than checking
+		 * just the first and trusting that the operator class obeys the
+		 * transitive law (which implies that all subsequent items also
+		 * respected the high key invariant if they pass the item order
+		 * check).
+		 *
+		 * Ideally, we'd compare every item in the index against every other
+		 * item in the index, and not trust opclass obedience of the
+		 * transitive law to bridge the gap between children and their
+		 * grandparents (as well as great-grandparents, and so on).  We don't
+		 * go to those lengths because that would be prohibitively expensive,
+		 * and probably not markedly more effective in practice.
+		 */
+		if (!P_RIGHTMOST(topaque) &&
+			!invariant_leq_offset(state, skey, P_HIKEY))
+		{
+			char	   *itid,
+					   *htid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("high key invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+					 errdetail_internal("Index tid=%s points to %s tid=%s "
+										"page lsn=%X/%X.",
+										itid,
+										P_ISLEAF(topaque) ? "heap" : "index",
+										htid,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
+		}
+
+		/*
+		 * * Item order check *
+		 *
+		 * Check that items are stored on page in logical order, by checking
+		 * current item is less than or equal to next item (if any).
+		 */
+		if (OffsetNumberNext(offset) <= max &&
+			!invariant_leq_offset(state, skey,
+								  OffsetNumberNext(offset)))
+		{
+			char	   *itid,
+					   *htid,
+					   *nitid,
+					   *nhtid;
+
+			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			htid = psprintf("(%u,%u)",
+							ItemPointerGetBlockNumber(&(itup->t_tid)),
+							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+			nitid = psprintf("(%u,%u)", state->targetblock,
+							 OffsetNumberNext(offset));
+
+			/* Reuse itup to get pointed-to heap location of second item */
+			itemid = PageGetItemId(state->target, OffsetNumberNext(offset));
+			itup = (IndexTuple) PageGetItem(state->target, itemid);
+			nhtid = psprintf("(%u,%u)",
+							 ItemPointerGetBlockNumber(&(itup->t_tid)),
+							 ItemPointerGetOffsetNumber(&(itup->t_tid)));
+
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("item order invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+			   errdetail_internal("Lower index tid=%s (points to %s tid=%s) "
+								  "higher index tid=%s (points to %s tid=%s) "
+								  "page lsn=%X/%X.",
+								  itid,
+								  P_ISLEAF(topaque) ? "heap" : "index",
+								  htid,
+								  nitid,
+								  P_ISLEAF(topaque) ? "heap" : "index",
+								  nhtid,
+								  (uint32) (state->targetlsn >> 32),
+								  (uint32) state->targetlsn)));
+		}
+
+		/*
+		 * * Last item check *
+		 *
+		 * Check last item against next/right page's first data item's when
+		 * last item on page is reached.  This additional check can detect
+		 * transposed pages.
+		 *
+		 * This check is similar to the item order check that will have
+		 * already been performed for every other "real" item on target page
+		 * when last item is checked.  The difference is that the next item
+		 * (the item that is compared to target's last item) needs to come
+		 * from the next/sibling page.	There may not be such an item
+		 * available from sibling for various reasons, though (e.g., target is
+		 * the rightmost page on level).
+		 */
+		else if (offset == max)
+		{
+			ScanKey		rightkey;
+
+			/* Get item in next/right page */
+			rightkey = bt_right_page_check_scankey(state);
+
+			if (rightkey &&
+				!invariant_geq_offset(state, rightkey, max))
+			{
+				/*
+				 * As explained at length in bt_right_page_check_scankey(),
+				 * there is a known !exclusivelylocked race that could account
+				 * for apparent violation of invariant, which we must check
+				 * for before actually proceeding with raising error.  Our
+				 * canary condition is that target page was deleted.
+				 */
+				if (!state->exclusivelylocked)
+				{
+					/* Get fresh copy of target page */
+					state->target = palloc_btree_page(state, state->targetblock);
+					/* Note that we deliberately do not update target LSN */
+					topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+					/*
+					 * All !exclusivelylocked checks now performed; just
+					 * return
+					 */
+					if (P_IGNORE(topaque))
+						return;
+				}
+
+				ereport(ERROR,
+						(errcode(ERRCODE_INDEX_CORRUPTED),
+						 errmsg("cross page item order invariant violated for index \"%s\"",
+								RelationGetRelationName(state->rel)),
+						 errdetail_internal("Last item on page tid=(%u,%u) "
+											"page lsn=%X/%X.",
+											state->targetblock, offset,
+											(uint32) (state->targetlsn >> 32),
+											(uint32) state->targetlsn)));
+			}
+		}
+
+		/*
+		 * * Downlink check *
+		 *
+		 * Additional check of child items iff this is an internal page and
+		 * caller holds an ExclusiveLock.  This happens for every downlink
+		 * (item) in target excluding the negative-infinity downlink (again,
+		 * this is because it has no useful value to compare).
+		 */
+		if (!P_ISLEAF(topaque) && state->exclusivelylocked)
+		{
+			BlockNumber childblock = ItemPointerGetBlockNumber(&(itup->t_tid));
+
+			bt_downlink_check(state, childblock, skey);
+		}
+	}
+}
+
+/*
+ * Return a scankey for an item on page to right of current target (or the
+ * first non-ignorable page), sufficient to check ordering invariant on last
+ * item in current target page.  Returned scankey relies on local memory
+ * allocated for the child page, which caller cannot pfree().  Caller's memory
+ * context should be reset between calls here.
+ *
+ * This is the first data item, and so all adjacent items are checked against
+ * their immediate sibling item (which may be on a sibling page, or even a
+ * "cousin" page at parent boundaries where target's rightlink points to page
+ * with different parent page).  If no such valid item is available, return
+ * NULL instead.
+ *
+ * Note that !exclusivelylocked callers must reverify that target page has not
+ * been concurrently deleted.
+ */
+static ScanKey
+bt_right_page_check_scankey(BtreeCheckState *state)
+{
+	BTPageOpaque opaque;
+	ItemId		rightitem;
+	BlockNumber targetnext;
+	Page		rightpage;
+	OffsetNumber nline;
+
+	/* Determine target's next block number */
+	opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
+
+	/* If target is already rightmost, no right sibling; nothing to do here */
+	if (P_RIGHTMOST(opaque))
+		return NULL;
+
+	/*
+	 * General notes on concurrent page splits and page deletion:
+	 *
+	 * Routines like _bt_search() don't require *any* page split interlock
+	 * when descending the tree, including something very light like a buffer
+	 * pin. That's why it's okay that we don't either.  This avoidance of any
+	 * need to "couple" buffer locks is the raison d' etre of the Lehman & Yao
+	 * algorithm, in fact.
+	 *
+	 * That leaves deletion.  A deleted page won't actually be recycled by
+	 * VACUUM early enough for us to fail to at least follow its right link
+	 * (or left link, or downlink) and find its sibling, because recycling
+	 * does not occur until no possible index scan could land on the page.
+	 * Index scans can follow links with nothing more than their snapshot as
+	 * an interlock and be sure of at least that much.	(See page
+	 * recycling/RecentGlobalXmin notes in nbtree README.)
+	 *
+	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
+	 * dead (ignorable) page one or more times.  There will either be a
+	 * further right link to follow that leads to a live page before too long
+	 * (before passing by parent's rightmost child), or we will find the end
+	 * of the entire level instead (possible when parent page is itself the
+	 * rightmost on its level).
+	 */
+	targetnext = opaque->btpo_next;
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		rightpage = palloc_btree_page(state, targetnext);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(rightpage);
+
+		if (!P_IGNORE(opaque) || P_RIGHTMOST(opaque))
+			break;
+
+		/* We landed on a deleted page, so step right to find a live page */
+		targetnext = opaque->btpo_next;
+		ereport(DEBUG1,
+				(errcode(ERRCODE_NO_DATA),
+				 errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead",
+					opaque->btpo.level, RelationGetRelationName(state->rel)),
+				 errdetail_internal("Deleted page found when building scankey from right sibling.")));
+
+		/* Be slightly more pro-active in freeing this memory, just in case */
+		pfree(rightpage);
+	}
+
+	/*
+	 * No ExclusiveLock held case -- why it's safe to proceed.
+	 *
+	 * Problem:
+	 *
+	 * We must avoid false positive reports of corruption when caller treats
+	 * item returned here as an upper bound on target's last item.  In
+	 * general, false positives are disallowed.  Avoiding them here when
+	 * caller is !exclusivelylocked is subtle.
+	 *
+	 * A concurrent page deletion by VACUUM of the target page can result in
+	 * the insertion of items on to this right sibling page that would
+	 * previously have been inserted on our target page.  There might have
+	 * been insertions that followed the target's downlink after it was made
+	 * to point to right sibling instead of target by page deletion's first
+	 * phase. The inserters insert items that would belong on target page.
+	 * This race is very tight, but it's possible.  This is our only problem.
+	 *
+	 * Non-problems:
+	 *
+	 * We are not hindered by a concurrent page split of the target; we'll
+	 * never land on the second half of the page anyway.  A concurrent split
+	 * of the right page will also not matter, because the first data item
+	 * remains the same within the left half, which we'll reliably land on. If
+	 * we had to skip over ignorable/deleted pages, it cannot matter because
+	 * their key space has already been atomically merged with the first
+	 * non-ignorable page we eventually find (doesn't matter whether the page
+	 * we eventually find is a true sibling or a cousin of target, which we go
+	 * into below).
+	 *
+	 * Solution:
+	 *
+	 * Caller knows that it should reverify that target is not ignorable
+	 * (half-dead or deleted) when cross-page sibling item comparison appears
+	 * to indicate corruption (invariant fails).  This detects the single race
+	 * condition that exists for caller.  This is correct because the
+	 * continued existence of target block as non-ignorable (not half-dead or
+	 * deleted) implies that target page was not merged into from the right by
+	 * deletion; the key space at or after target never moved left.  Target's
+	 * parent either has the same downlink to target as before, or a <=
+	 * downlink due to deletion at the left of target.	Target either has the
+	 * same highkey as before, or a highkey <= before when there is a page
+	 * split. (The rightmost concurrently-split-from-target-page page will
+	 * still have the same highkey as target was originally found to have,
+	 * which for our purposes is equivalent to target's highkey itself never
+	 * changing, since we reliably skip over
+	 * concurrently-split-from-target-page pages.)
+	 *
+	 * In simpler terms, we allow that the key space of the target may expand
+	 * left (the key space can move left on the left side of target only), but
+	 * the target key space cannot expand right and get ahead of us without
+	 * our detecting it.  The key space of the target cannot shrink, unless it
+	 * shrinks to zero due to the deletion of the original page, our canary
+	 * condition.  (To be very precise, we're a bit stricter than that because
+	 * it might just have been that the target page split and only the
+	 * original target page was deleted.  We can be more strict, just not more
+	 * lax.)
+	 *
+	 * Top level tree walk caller moves on to next page (makes it the new
+	 * target) following recovery from this race.  (cf.  The rationale for
+	 * child/downlink verification needing an ExclusiveLock within
+	 * bt_downlink_check(), where page deletion is also the main source of
+	 * trouble.)
+	 *
+	 * Note that it doesn't matter if right sibling page here is actually a
+	 * cousin page, because in order for the key space to be readjusted in a
+	 * way that causes us issues in next level up (guiding problematic
+	 * concurrent insertions to the cousin from the grandparent rather than to
+	 * the sibling from the parent), there'd have to be page deletion of
+	 * target's parent page (affecting target's parent's downlink in target's
+	 * grandparent page).  Internal page deletion only occurs when there are
+	 * no child pages (they were all fully deleted), and caller is checking
+	 * that the target's parent has at least one non-deleted (so
+	 * non-ignorable) child: the target page.  (Note that the first phase of
+	 * deletion atomically marks the page to be deleted half-dead/ignorable at
+	 * the same time downlink in its parent is removed, so caller will
+	 * definitely not fail to detect that this happened.)
+	 *
+	 * This trick is inspired by the method backward scans use for dealing
+	 * with concurrent page splits; concurrent page deletion is a problem that
+	 * similarly receives special consideration sometimes (it's possible that
+	 * the backwards scan will re-read its "original" block after failing to
+	 * find a right-link to it, having already moved in the opposite direction
+	 * (right/"forwards") a few times to try to locate one).  Just like us,
+	 * that happens only to determine if there was a concurrent page deletion
+	 * of a reference page, and just like us if there was a page deletion of
+	 * that reference page it means we can move on from caring about the
+	 * reference page.	See the nbtree README for a full description of how
+	 * that works.
+	 */
+	nline = PageGetMaxOffsetNumber(rightpage);
+
+	/*
+	 * Get first data item, if any
+	 */
+	if (P_ISLEAF(opaque) && nline >= P_FIRSTDATAKEY(opaque))
+	{
+		/* Return first data item (if any) */
+		rightitem = PageGetItemId(rightpage, P_FIRSTDATAKEY(opaque));
+	}
+	else if (!P_ISLEAF(opaque) &&
+			 nline >= OffsetNumberNext(P_FIRSTDATAKEY(opaque)))
+	{
+		/*
+		 * Return first item after the internal page's "negative infinity"
+		 * item
+		 */
+		rightitem = PageGetItemId(rightpage,
+								  OffsetNumberNext(P_FIRSTDATAKEY(opaque)));
+	}
+	else
+	{
+		/*
+		 * No first item.  Page is probably empty leaf page, but it's also
+		 * possible that it's an internal page with only a negative infinity
+		 * item.
+		 */
+		ereport(DEBUG1,
+				(errcode(ERRCODE_NO_DATA),
+				 errmsg("%s block %u of index \"%s\" has no first data item",
+						P_ISLEAF(opaque) ? "leaf" : "internal", targetnext,
+						RelationGetRelationName(state->rel))));
+		return NULL;
+	}
+
+	/*
+	 * Return first real item scankey.	Note that this relies on right page
+	 * memory remaining allocated.
+	 */
+	return _bt_mkscankey(state->rel,
+						 (IndexTuple) PageGetItem(rightpage, rightitem));
+}
+
+/*
+ * Checks one of target's downlink against its child page.
+ *
+ * Conceptually, the target page continues to be what is checked here.	The
+ * target block is still blamed in the event of finding an invariant violation.
+ * The downlink insertion into the target is probably where any problem raised
+ * here arises, and there is no such thing as a parent link, so doing the
+ * verification this way around is much more practical.
+ */
+static void
+bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
+				  ScanKey targetkey)
+{
+	OffsetNumber offset;
+	OffsetNumber maxoffset;
+	Page		child;
+	BTPageOpaque copaque;
+
+	/*
+	 * Caller must have ExclusiveLock on target relation, because of
+	 * considerations around page deletion by VACUUM.
+	 *
+	 * NB: In general, page deletion deletes the right sibling's downlink, not
+	 * the downlink of the page being deleted; the deleted page's downlink is
+	 * reused for its sibling.	The key space is thereby consolidated between
+	 * the deleted page and its right sibling.	(We cannot delete a parent
+	 * page's rightmost child unless it is the last child page, and we intend
+	 * to also delete the parent itself.)
+	 *
+	 * If this verification happened without an ExclusiveLock, the following
+	 * race condition could cause false positives:
+	 *
+	 * In general, concurrent page deletion might occur, including deletion of
+	 * the left sibling of the child page that is examined here.  If such a
+	 * page deletion were to occur, closely followed by an insertion into the
+	 * newly expanded key space of the child, a window for the false positive
+	 * opens up: the stale parent/target downlink originally followed to get
+	 * to the child legitimately ceases to be a lower bound on all items in
+	 * the page, since the key space was concurrently expanded "left".
+	 * (Insertion followed the "new" downlink for the child, not our now-stale
+	 * downlink, which was concurrently physically removed in target/parent as
+	 * part of deletion's first phase.)
+	 *
+	 * Note that while the cross-page-same-level last item check uses a trick
+	 * that allows it to perform verification for !exclusivelylocked callers,
+	 * a similar trick seems difficult here.  The trick that that other check
+	 * uses is, in essence, to lock down race conditions to those that occur
+	 * due to concurrent page deletion of the target; that's a race that can
+	 * be reliably detected before actually reporting corruption.
+	 *
+	 * On the other hand, we'd need to lock down race conditions involving
+	 * deletion of child's left page, for long enough to read the child page
+	 * into memory (in other words, a scheme with concurrently held buffer
+	 * locks on both child and left-of-child pages).  That's unacceptable for
+	 * amcheck functions on general principle, though.
+	 */
+	Assert(state->exclusivelylocked);
+
+	/*
+	 * Verify child page has the downlink key from target page (its parent) as
+	 * a lower bound.
+	 *
+	 * Check all items, rather than checking just the first and trusting that
+	 * the operator class obeys the transitive law.
+	 */
+	child = palloc_btree_page(state, childblock);
+	copaque = (BTPageOpaque) PageGetSpecialPointer(child);
+	maxoffset = PageGetMaxOffsetNumber(child);
+
+	for (offset = P_FIRSTDATAKEY(copaque);
+		 offset <= maxoffset;
+		 offset = OffsetNumberNext(offset))
+	{
+		/*
+		 * Skip comparison of target page key against "negative infinity"
+		 * item, if any.  Checking it would indicate that it's not an upper
+		 * bound, but that's only because of the hard-coding within
+		 * _bt_compare().
+		 */
+		if (offset_is_negative_infinity(copaque, offset))
+			continue;
+
+		if (!invariant_leq_nontarget_offset(state, child,
+											targetkey, offset))
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("down-link lower bound invariant violated for index \"%s\"",
+							RelationGetRelationName(state->rel)),
+				errdetail_internal("Parent block=%u child index tid=(%u,%u) "
+								   "parent page lsn=%X/%X.",
+								   state->targetblock, childblock, offset,
+								   (uint32) (state->targetlsn >> 32),
+								   (uint32) state->targetlsn)));
+	}
+
+	pfree(child);
+}
+
+/*
+ * Is particular offset within page (whose special state is passed by caller)
+ * the page negative-infinity item?
+ *
+ * As noted in comments above _bt_compare(), there is special handling of the
+ * first data item as a "negative infinity" item.  The hard-coding within
+ * _bt_compare() makes comparing this item for the purposes of verification
+ * pointless at best, since the IndexTuple only contains a valid TID (a
+ * reference TID to child page).
+ */
+static inline bool
+offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
+{
+	/*
+	 * For internal pages only, the first item after high key, if any, is
+	 * negative infinity item.	Internal pages always have a negative infinity
+	 * item, whereas leaf pages never have one.  This implies that negative
+	 * infinity item is either first or second line item, or there is none
+	 * within page.
+	 *
+	 * Right-most pages don't have a high key, but could be said to
+	 * conceptually have a "positive infinity" high key.  Thus, there is a
+	 * symmetry between down link items in parent pages, and high keys in
+	 * children.  Together, they represent the part of the key space that
+	 * belongs to each page in the index.  For example, all children of the
+	 * root page will have negative infinity as a lower bound from root
+	 * negative infinity downlink, and positive infinity as an upper bound
+	 * (implicitly, from "imaginary" positive infinity high key in root).
+	 */
+	return !P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque);
+}
+
+/*
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_leq_offset(BtreeCheckState *state, ScanKey key,
+					 OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * Does the invariant hold that the key is greater than or equal to a given
+ * lower bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_geq_offset(BtreeCheckState *state, ScanKey key,
+					 OffsetNumber lowerbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound);
+
+	return cmp >= 0;
+}
+
+/*
+ * Does the invariant hold that the key is less than or equal to a given upper
+ * bound offset item, with the offset relating to a caller-supplied page that
+ * is not the current target page? Caller's non-target page is typically a
+ * child page of the target, checked as part of checking a property of the
+ * target page (i.e. the key comes from the target).
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_leq_nontarget_offset(BtreeCheckState *state,
+							   Page nontarget, ScanKey key,
+							   OffsetNumber upperbound)
+{
+	int16		natts = state->rel->rd_rel->relnatts;
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);
+
+	return cmp <= 0;
+}
+
+/*
+ * Given a block number of a B-Tree page, return page in palloc()'d memory.
+ * While at it, perform some basic checks of the page.
+ *
+ * There is never an attempt to get a consistent view of multiple pages using
+ * multiple concurrent buffer locks; in general, we only acquire a single pin
+ * and buffer lock at a time, which is often all that the nbtree code requires.
+ *
+ * Operating on a copy of the page is useful because it prevents control
+ * getting stuck in an uninterruptible state when an underlying operator class
+ * misbehaves.
+ */
+static Page
+palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
+{
+	Buffer		buffer;
+	Page		page;
+	BTPageOpaque opaque;
+
+	page = palloc(BLCKSZ);
+
+	/*
+	 * We copy the page into local storage to avoid holding pin on the buffer
+	 * longer than we must.
+	 */
+	buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
+								state->checkstrategy);
+	LockBuffer(buffer, BT_READ);
+
+	/*
+	 * Perform the same basic sanity checking that nbtree itself performs for
+	 * every page:
+	 */
+	_bt_checkpage(state->rel, buffer);
+
+	/* Only use copy of page in palloc()'d memory */
+	memcpy(page, BufferGetPage(buffer), BLCKSZ);
+	UnlockReleaseBuffer(buffer);
+
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (opaque->btpo_flags & BTP_META && blocknum != BTREE_METAPAGE)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+				 errmsg("invalid meta page found at block %u in index \"%s\"",
+						blocknum, RelationGetRelationName(state->rel))));
+
+	/* Check page from block that ought to be meta page */
+	if (blocknum == BTREE_METAPAGE)
+	{
+		BTMetaPageData *metad = BTPageGetMeta(page);
+
+		if (!(opaque->btpo_flags & BTP_META) ||
+			metad->btm_magic != BTREE_MAGIC)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("index \"%s\" meta page is corrupt",
+							RelationGetRelationName(state->rel))));
+
+		if (metad->btm_version != BTREE_VERSION)
+			ereport(ERROR,
+					(errcode(ERRCODE_INDEX_CORRUPTED),
+					 errmsg("version mismatch in index \"%s\": file version %d, code version %d",
+							RelationGetRelationName(state->rel),
+							metad->btm_version, BTREE_VERSION)));
+	}
+
+	/*
+	 * Deleted pages have no sane "level" field, so can only check non-deleted
+	 * page level
+	 */
+	if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+			errmsg("invalid leaf page level %u for block %u in index \"%s\"",
+		opaque->btpo.level, blocknum, RelationGetRelationName(state->rel))));
+
+	if (blocknum != BTREE_METAPAGE && !P_ISLEAF(opaque) &&
+		!P_ISDELETED(opaque) && opaque->btpo.level == 0)
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+		 errmsg("invalid internal page level 0 for block %u in index \"%s\"",
+				opaque->btpo.level, RelationGetRelationName(state->rel))));
+
+	if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque))
+		ereport(ERROR,
+				(errcode(ERRCODE_INDEX_CORRUPTED),
+		   errmsg("internal page block %u in index \"%s\" has garbage items",
+				  blocknum, RelationGetRelationName(state->rel))));
+
+	return page;
+}
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
new file mode 100644
index 0000000000..168d3f4397
--- /dev/null
+++ b/doc/src/sgml/amcheck.sgml
@@ -0,0 +1,275 @@
+<!-- doc/src/sgml/amcheck.sgml -->
+
+<sect1 id="amcheck" xreflabel="amcheck">
+ <title>amcheck</title>
+
+ <indexterm zone="amcheck">
+  <primary>amcheck</primary>
+ </indexterm>
+
+ <para>
+  The <filename>amcheck</> module provides functions that allow you to
+  verify the logical consistency of the structure of indexes.  If the
+  structure appears to be valid, no error is raised.
+ </para>
+
+ <para>
+  The functions verify various <emphasis>invariants</> in the
+  structure of the representation of particular indexes.  The
+  correctness of the access method functions behind index scans and
+  other important operations relies on these invariants always
+  holding.  For example, certain functions verify, among other things,
+  that all B-Tree pages have items in <quote>logical</> order (e.g.,
+  for B-Tree indexes on <type>text</>, index tuples should be in
+  collated lexical order).  If that particular invariant somehow fails
+  to hold, we can expect binary searches on the affected page to
+  incorrectly guide index scans, resulting in wrong answers to SQL
+  queries.  Problems like this can be very subtle.
+ </para>
+ <para>
+  Verification is performed using the same procedures as those used by
+  index scans themselves, which may be user-defined operator class
+  code.  For example, B-Tree index verification relies on comparisons
+  made with one or more B-Tree support function 1 routines.  See <xref
+  linkend="xindex-support"> for details of operator class support
+  functions.
+ </para>
+ <para>
+  <filename>amcheck</> functions may be used only by superusers.
+ </para>
+
+ <sect2>
+  <title>Functions</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <function>bt_index_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_check</function> tests that its target, a
+      B-Tree index, respects a variety of invariants.  Example usage:
+<screen>
+test=# SELECT bt_index_check(c.oid), c.relname, c.relpages
+FROM pg_index i
+JOIN pg_opclass op ON i.indclass[0] = op.oid
+JOIN pg_am am ON op.opcmethod = am.oid
+JOIN pg_class c ON i.indexrelid = c.oid
+JOIN pg_namespace n ON c.relnamespace = n.oid
+WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
+-- Don't check temp tables, which may be from another session:
+AND c.relpersistence != 't'
+-- Function may throw an error when this is omitted:
+AND i.indisready AND i.indisvalid
+ORDER BY c.relpages DESC LIMIT 10;
+ bt_index_check |             relname             | relpages 
+----------------+---------------------------------+----------
+                | pg_depend_reference_index       |       43
+                | pg_depend_depender_index        |       40
+                | pg_proc_proname_args_nsp_index  |       31
+                | pg_description_o_c_o_index      |       21
+                | pg_attribute_relid_attnam_index |       14
+                | pg_proc_oid_index               |       10
+                | pg_attribute_relid_attnum_index |        9
+                | pg_amproc_fam_proc_index        |        5
+                | pg_amop_opr_fam_index           |        5
+                | pg_amop_fam_strat_index         |        5
+(10 rows)
+</screen>
+      This example shows a session that performs verification of every
+      catalog index in the database <quote>test</>.  Details of just
+      the 10 largest indexes verified are displayed.  Since no error
+      is raised, all indexes tested appear to be logically consistent.
+      Naturally, this query could easily be changed to call
+      <function>bt_index_check</function> for every index in the
+      database where verification is supported.
+     </para>
+     <para>
+      An <literal>AccessShareLock</> is acquired on the target index
+      by <function>bt_index_check</function>.  This lock mode is the
+      same lock mode acquired on relations by simple
+      <literal>SELECT</> statements.
+      <function>bt_index_check</function> does not verify invariants
+      that span child/parent relationships, nor does it verify that
+      the target index is consistent with its heap relation.  When a
+      routine, lightweight test for corruption is required in a live
+      production environment, using
+      <function>bt_index_check</function> often provides the best
+      trade-off between thoroughness of verification and limiting the
+      impact on application performance and availability.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>bt_index_parent_check(index regclass) returns void</function>
+     <indexterm>
+      <primary>bt_index_parent_check</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>bt_index_parent_check</function> tests that its
+      target, a B-Tree index, respects a variety of invariants.  The
+      checks performed by <function>bt_index_parent_check</function>
+      are a superset of the checks performed by
+      <function>bt_index_check</function>.
+      <function>bt_index_parent_check</function> can be thought of as
+      a more thorough variant of <function>bt_index_check</function>:
+      unlike <function>bt_index_check</function>,
+      <function>bt_index_parent_check</function> also checks
+      invariants that span parent/child relationships.  However, it
+      does not verify that the target index is consistent with its
+      heap relation.  <function>bt_index_parent_check</function>
+      follows the general convention of raising an error if it finds a
+      logical inconsistency or other problem.
+     </para>
+     <para>
+      An <literal>ExclusiveLock</> is required on the target index by
+      <function>bt_index_parent_check</function> (a
+      <literal>ShareLock</> is also acquired on the heap relation).
+      These locks prevent concurrent data modification from
+      <command>INSERT</>, <command>UPDATE</>, and <command>DELETE</>
+      commands.  The locks also prevent the underlying relation from
+      being concurrently processed by <command>VACUUM</>, as well as
+      all other utility commands.  Note that the function holds locks
+      for the entire duration of the transaction, so there can be a
+      significant advantage to verifying each index individually, each
+      in its own dedicated transaction.
+     </para>
+     <para>
+      <function>bt_index_parent_check</function>'s additional
+      verification is more likely to detect various pathological
+      cases.  These cases may involve an incorrectly implemented
+      B-Tree operator class used by the index that is checked, or,
+      hypothetically, undiscovered bugs in the underlying B-Tree index
+      access method code.  Note that
+      <function>bt_index_parent_check</function> cannot be used when
+      Hot Standby mode is enabled (i.e., on read-only physical
+      replicas), unlike <function>bt_index_check</function>.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Using <filename>amcheck</> effectively</title>
+
+ <para>
+  <filename>amcheck</> can be effective at detecting various types of
+  failure modes that <link
+  linkend="app-initdb-data-checksums"><application>data page
+  checksums</></link> will always fail to catch.  These include:
+
+  <itemizedlist>
+   <listitem>
+    <para>
+     Structural inconsistencies caused by incorrect operator class
+     implementations.
+    </para>
+    <para>
+     This includes issues caused by the comparison rules of operating
+     system collations changing. Comparisons of datums of a collatable
+     type like <type>text</> must be immutable (just as all
+     comparisons used for B-Tree index scans must be immutable), which
+     implies that operating system collation rules must never change.
+     Though rare, updates to operating system collation rules can
+     cause these issues. More commonly, an inconsistency in the
+     collation order between a master server and a standby server is
+     implicated, possibly because the <emphasis>major</> operating
+     system version in use is inconsistent.  Such inconsistencies will
+     generally only arise on standby servers, and so can generally
+     only be detected on standby servers.
+    </para>
+    <para>
+     If a problem like this arises, it may not affect each individual
+     index that is ordered using an affected collation, simply because
+     <emphasis>indexed</> values might happen to have the same
+     absolute ordering regardless of the behavioral inconsistency. See
+     <xref linkend="locale"> and <xref linkend="collation"> for
+     further details about how <productname>PostgreSQL</> uses
+     operating system locales and collations.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by hypothetical undiscovered bugs in the
+     underlying <productname>PostgreSQL</> access method code or sort
+     code.
+    </para>
+    <para>
+     Automatic verification of the structural integrity of indexes
+     plays a role in the general testing of new or proposed
+     <productname>PostgreSQL</> features that could plausibly allow a
+     logical inconsistency to be introduced.  One obvious testing
+     strategy is to call <filename>amcheck</> functions continuously
+     when running the standard regression tests.  See <xref
+     linkend="regress-run"> for details on running the tests.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Filesystem or storage subsystem faults where checksums happen to
+     simply not be enabled.
+    </para>
+    <para>
+     Note that <filename>amcheck</> examines a page as represented in some
+     shared memory buffer at the time of verification if there is only a
+     shared buffer hit when accessing the block. Consequently,
+     <filename>amcheck</> does not necessarily examine data read from the
+     filesystem at the time of verification. Note that when checksums are
+     enabled, <filename>amcheck</> may raise an error due to a checksum
+     failure when a corrupt block is read into a buffer.
+    </para>
+   </listitem>
+   <listitem>
+    <para>
+     Corruption caused by faulty RAM, and the broader memory subsystem
+     and operating system.
+    </para>
+    <para>
+     <productname>PostgreSQL</> does not protect against correctable
+     memory errors and it is assumed you will operate using RAM that
+     uses industry standard Error Correcting Codes (ECC) or better
+     protection.  However, ECC memory is typically only immune to
+     single-bit errors, and should not be assumed to provide
+     <emphasis>absolute</emphasis> protection against failures that
+     result in memory corruption.
+    </para>
+   </listitem>
+  </itemizedlist>
+  In general, <filename>amcheck</> can only prove the presence of
+  corruption; it cannot prove its absence.
+ </para>
+
+ </sect2>
+ <sect2>
+  <title>Repairing corruption</title>
+ <para>
+  No error concerning corruption raised by <filename>amcheck</> should
+  ever be a false positive.  In practice, <filename>amcheck</> is more
+  likely to find software bugs than problems with hardware.
+  <filename>amcheck</> raises errors in the event of conditions that,
+  by definition, should never happen, and so careful analysis of
+  <filename>amcheck</> errors is often required.
+ </para>
+ <para>
+  There is no general method of repairing problems that
+  <filename>amcheck</> detects.  An explanation for the root cause of
+  an invariant violation should be sought.  <xref
+  linkend="pageinspect"> may play a useful role in diagnosing
+  corruption that <filename>amcheck</> detects.  A <command>REINDEX</>
+  may not be effective in repairing corruption.
+ </para>
+
+ </sect2>
+
+</sect1>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 03e5889839..eaaa36cb87 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -103,6 +103,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  </para>
 
  &adminpack;
+ &amcheck;
  &auth-delay;
  &auto-explain;
  &bloom;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index e7aa92f914..6782f07aea 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -107,6 +107,7 @@
 <!-- contrib information -->
 <!ENTITY contrib         SYSTEM "contrib.sgml">
 <!ENTITY adminpack       SYSTEM "adminpack.sgml">
+<!ENTITY amcheck         SYSTEM "amcheck.sgml">
 <!ENTITY auth-delay      SYSTEM "auth-delay.sgml">
 <!ENTITY auto-explain    SYSTEM "auto-explain.sgml">
 <!ENTITY bloom           SYSTEM "bloom.sgml">
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 296552e394..3487f7becb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -244,6 +244,8 @@ BrinRevmap
 BrinSpecialSpace
 BrinTuple
 BrinValues
+BtreeCheckState
+BtreeLevel
 Bucket
 BufFile
 Buffer
-- 
2.11.0.22.g8d7a455.dirty

0002-Cleanups-tests.patchtext/x-patch; charset=us-asciiDownload

From ecb145091ad8ebc48fe25d55c76a2663a9c1ef52 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 9 Mar 2017 15:50:27 -0800
Subject: [PATCH 2/2] Cleanups & tests.

---
 contrib/amcheck/Makefile                 |   4 +-
 contrib/amcheck/amcheck.control          |   2 +-
 contrib/amcheck/expected/check.out       |   1 +
 contrib/amcheck/expected/check_btree.out |  87 ++++++++++++++++++++++
 contrib/amcheck/sql/check.sql            |   1 +
 contrib/amcheck/sql/check_btree.sql      |  59 +++++++++++++++
 contrib/amcheck/verify_nbtree.c          | 119 +++++++++++++++++--------------
 doc/src/sgml/amcheck.sgml                |  14 ++--
 8 files changed, 223 insertions(+), 64 deletions(-)
 create mode 100644 contrib/amcheck/expected/check.out
 create mode 100644 contrib/amcheck/expected/check_btree.out
 create mode 100644 contrib/amcheck/sql/check.sql
 create mode 100644 contrib/amcheck/sql/check_btree.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index f14c7aa58b..43bed919ae 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -5,7 +5,9 @@ OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
 DATA = amcheck--1.0.sql
-PGFILEDESC = "amcheck - functions to verify access method invariants"
+PGFILEDESC = "amcheck - function for verifying relation integrity"
+
+REGRESS = check check_btree
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 180f150e7f..05e2861d7a 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
-comment = 'verify access method invariants'
+comment = 'functions for verifying relation integrity'
 default_version = '1.0'
 module_pathname = '$libdir/amcheck'
 relocatable = true
diff --git a/contrib/amcheck/expected/check.out b/contrib/amcheck/expected/check.out
new file mode 100644
index 0000000000..5b3e6d530c
--- /dev/null
+++ b/contrib/amcheck/expected/check.out
@@ -0,0 +1 @@
+CREATE EXTENSION amcheck;
diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
new file mode 100644
index 0000000000..851b5cdcc1
--- /dev/null
+++ b/contrib/amcheck/expected/check_btree.out
@@ -0,0 +1,87 @@
+-- minimal test, basically just verifying that amcheck
+CREATE TABLE bttest_a(id int8);
+CREATE TABLE bttest_b(id int8);
+INSERT INTO bttest_a SELECT * FROM generate_series(1, 100000);
+INSERT INTO bttest_b SELECT * FROM generate_series(100000, 1, -1);
+CREATE INDEX bttest_a_idx ON bttest_a USING btree (id);
+CREATE INDEX bttest_b_idx ON bttest_b USING btree (id);
+CREATE ROLE bttest_role;
+-- verify permissions are checked (error due to function not callable)
+SET ROLE bttest_role;
+SELECT bt_index_check('bttest_a_idx'::regclass);
+ERROR:  permission denied for function bt_index_check
+SELECT bt_index_parent_check('bttest_a_idx'::regclass);
+ERROR:  permission denied for function bt_index_parent_check
+RESET ROLE;
+-- we, intentionally, don't check relation permissions - it's useful
+-- to run this cluster-wide with a restricted account, and as tested
+-- above explicit permission has to be granted for that.
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parentcheck(regclass) TO bttest_role;
+ERROR:  function bt_index_parentcheck(regclass) does not exist
+SET ROLE bttest_role;
+SELECT bt_index_check('bttest_a_idx');
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_a_idx');
+ERROR:  permission denied for function bt_index_parent_check
+RESET ROLE;
+-- verify plain tables are rejected (error)
+SELECT bt_index_check('bttest_a');
+ERROR:  "bttest_a" is not an index
+SELECT bt_index_parent_check('bttest_a');
+ERROR:  "bttest_a" is not an index
+-- verify non-existing indexes are rejected (error)
+SELECT bt_index_check(17);
+ERROR:  could not open relation with OID 17
+SELECT bt_index_parent_check(17);
+ERROR:  could not open relation with OID 17
+-- verify wrong index types are rejected (error)
+BEGIN;
+CREATE INDEX bttest_a_brin_idx ON bttest_a USING brin(id);
+SELECT bt_index_parent_check('bttest_a_brin_idx');
+ERROR:  only B-Tree indexes are supported as targets for verification
+DETAIL:  Relation "bttest_a_brin_idx" is not a B-Tree index.
+ROLLBACK;
+-- normal check outside of xact
+SELECT bt_index_check('bttest_a_idx');
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+-- more expansive test
+SELECT bt_index_parent_check('bttest_b_idx');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+BEGIN;
+SELECT bt_index_check('bttest_a_idx');
+ bt_index_check 
+----------------
+ 
+(1 row)
+
+SELECT bt_index_parent_check('bttest_b_idx');
+ bt_index_parent_check 
+-----------------------
+ 
+(1 row)
+
+-- make sure we don't have any leftover locks
+SELECT * FROM pg_locks WHERE relation IN ('bttest_a_idx'::regclass, 'bttest_b_idx'::regclass);
+ locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath 
+----------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-----+------+---------+----------
+(0 rows)
+
+COMMIT;
+-- cleanup
+DROP TABLE bttest_a;
+DROP TABLE bttest_b;
+DROP OWNED BY bttest_role; -- permissions
+DROP ROLE bttest_role;
diff --git a/contrib/amcheck/sql/check.sql b/contrib/amcheck/sql/check.sql
new file mode 100644
index 0000000000..5b3e6d530c
--- /dev/null
+++ b/contrib/amcheck/sql/check.sql
@@ -0,0 +1 @@
+CREATE EXTENSION amcheck;
diff --git a/contrib/amcheck/sql/check_btree.sql b/contrib/amcheck/sql/check_btree.sql
new file mode 100644
index 0000000000..53377f6da9
--- /dev/null
+++ b/contrib/amcheck/sql/check_btree.sql
@@ -0,0 +1,59 @@
+-- minimal test, basically just verifying that amcheck
+CREATE TABLE bttest_a(id int8);
+CREATE TABLE bttest_b(id int8);
+
+INSERT INTO bttest_a SELECT * FROM generate_series(1, 100000);
+INSERT INTO bttest_b SELECT * FROM generate_series(100000, 1, -1);
+
+CREATE INDEX bttest_a_idx ON bttest_a USING btree (id);
+CREATE INDEX bttest_b_idx ON bttest_b USING btree (id);
+
+CREATE ROLE bttest_role;
+
+-- verify permissions are checked (error due to function not callable)
+SET ROLE bttest_role;
+SELECT bt_index_check('bttest_a_idx'::regclass);
+SELECT bt_index_parent_check('bttest_a_idx'::regclass);
+RESET ROLE;
+
+-- we, intentionally, don't check relation permissions - it's useful
+-- to run this cluster-wide with a restricted account, and as tested
+-- above explicit permission has to be granted for that.
+GRANT EXECUTE ON FUNCTION bt_index_check(regclass) TO bttest_role;
+GRANT EXECUTE ON FUNCTION bt_index_parentcheck(regclass) TO bttest_role;
+SET ROLE bttest_role;
+SELECT bt_index_check('bttest_a_idx');
+SELECT bt_index_parent_check('bttest_a_idx');
+RESET ROLE;
+
+-- verify plain tables are rejected (error)
+SELECT bt_index_check('bttest_a');
+SELECT bt_index_parent_check('bttest_a');
+
+-- verify non-existing indexes are rejected (error)
+SELECT bt_index_check(17);
+SELECT bt_index_parent_check(17);
+
+-- verify wrong index types are rejected (error)
+BEGIN;
+CREATE INDEX bttest_a_brin_idx ON bttest_a USING brin(id);
+SELECT bt_index_parent_check('bttest_a_brin_idx');
+ROLLBACK;
+
+-- normal check outside of xact
+SELECT bt_index_check('bttest_a_idx');
+-- more expansive test
+SELECT bt_index_parent_check('bttest_b_idx');
+
+BEGIN;
+SELECT bt_index_check('bttest_a_idx');
+SELECT bt_index_parent_check('bttest_b_idx');
+-- make sure we don't have any leftover locks
+SELECT * FROM pg_locks WHERE relation IN ('bttest_a_idx'::regclass, 'bttest_b_idx'::regclass);
+COMMIT;
+
+-- cleanup
+DROP TABLE bttest_a;
+DROP TABLE bttest_b;
+DROP OWNED BY bttest_role; -- permissions
+DROP ROLE bttest_role;
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 4cef4c44ff..ad8e15ea67 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -43,8 +43,8 @@ PG_MODULE_MAGIC;
  * target is the point of reference for a verification operation.
  *
  * Other B-Tree pages may be allocated, but those are always auxiliary (e.g.,
- * they are current target's child pages).  Conceptually, problems are only
- * ever found in the current target page.	Each page found by verification's
+ * they are current target's child pages). Conceptually, problems are only
+ * ever found in the current target page. Each page found by verification's
  * left/right, top/bottom scan becomes the target exactly once.
  */
 typedef struct BtreeCheckState
@@ -92,6 +92,7 @@ typedef struct BtreeLevel
 PG_FUNCTION_INFO_V1(bt_index_check);
 PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
+static void bt_index_check_internal(Oid indrelid, bool parentcheck);
 static inline void btree_index_checkable(Relation rel);
 static void bt_check_every_level(Relation rel, bool exclusivelylocked);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
@@ -119,25 +120,15 @@ static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
  *
  * Verify integrity of B-Tree index.
  *
- * Only acquires AccessShareLock on index relation.  Does not consider
+ * Acquires AccessShareLock on heap & index relations.  Does not consider
  * invariants that exist between parent/child pages.
  */
 Datum
 bt_index_check(PG_FUNCTION_ARGS)
 {
-	Oid			relid = PG_GETARG_OID(0);
-	Relation	indrel;
+	Oid			indrelid = PG_GETARG_OID(0);
 
-	indrel = relation_open(relid, AccessShareLock);
-
-	/* Relation suitable for checking as B-Tree? */
-	btree_index_checkable(indrel);
-
-	/* Check index */
-	bt_check_every_level(indrel, false);
-
-	/* Close, but keep lock */
-	relation_close(indrel, NoLock);
+	bt_index_check_internal(indrelid, false);
 
 	PG_RETURN_VOID();
 }
@@ -147,36 +138,55 @@ bt_index_check(PG_FUNCTION_ARGS)
  *
  * Verify integrity of B-Tree index.
  *
- * Acquires ExclusiveLock on index relation, and ShareLock on the associated
- * heap relation, a bit like REINDEX.  Verifies that downlinks in parent pages
- * are valid lower bounds on child pages.
+ * Acquires ShareLock on heap & index relations.  Verifies that downlinks in
+ * parent pages are valid lower bounds on child pages.
  */
 Datum
 bt_index_parent_check(PG_FUNCTION_ARGS)
 {
 	Oid			indrelid = PG_GETARG_OID(0);
+
+	bt_index_check_internal(indrelid, true);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Helper for bt_index_[parent_]check, coordinating the bulk of the work.
+ */
+static void
+bt_index_check_internal(Oid indrelid, bool parentcheck)
+{
 	Oid			heapid;
 	Relation	indrel;
 	Relation	heaprel;
+	LOCKMODE	lockmode;
+
+	if (parentcheck)
+		lockmode = ShareLock;
+	else
+		lockmode = AccessShareLock;
 
 	/*
-	 * We must lock table before index to avoid deadlocks.	However, if the
+	 * We must lock table before index to avoid deadlocks.  However, if the
 	 * passed indrelid isn't an index then IndexGetRelation() will fail.
 	 * Rather than emitting a not-very-helpful error message, postpone
 	 * complaining, expecting that the is-it-an-index test below will fail.
+	 *
+	 * In hot standby mode this will raise an error when parentcheck is true.
 	 */
 	heapid = IndexGetRelation(indrelid, true);
 	if (OidIsValid(heapid))
-		heaprel = heap_open(heapid, ShareLock);
+		heaprel = heap_open(heapid, lockmode);
 	else
 		heaprel = NULL;
 
 	/*
 	 * Open the target index relations separately (like relation_openrv(), but
 	 * with heap relation locked first to prevent deadlocking).  In hot
-	 * standby mode this will raise an error.
+	 * standby mode this will raise an error when parentcheck is true.
 	 */
-	indrel = index_open(indrelid, ExclusiveLock);
+	indrel = index_open(indrelid, lockmode);
 
 	/*
 	 * Since we did the IndexGetRelation call above without any lock, it's
@@ -195,19 +205,25 @@ bt_index_parent_check(PG_FUNCTION_ARGS)
 	btree_index_checkable(indrel);
 
 	/* Check index */
-	bt_check_every_level(indrel, true);
+	bt_check_every_level(indrel, parentcheck);
 
-	/* Close, but keep locks */
-	index_close(indrel, NoLock);
+	/*
+	 * Release locks early. That's ok here because nothing in the called
+	 * routines will trigger shared cache invalidations to be sent, so we can
+	 * relax the usual pattern of only releasing locks after commit.
+	 */
+	index_close(indrel, lockmode);
 	if (heaprel)
-		heap_close(heaprel, NoLock);
-
-	PG_RETURN_VOID();
+		heap_close(heaprel, lockmode);
 }
 
 /*
  * Basic checks about the suitability of a relation for checking as a B-Tree
  * index.
+ *
+ * NB: Intentionally not checking permissions, the function is normally not
+ * callable by non-superusers. If granted, it's useful to be able to check a
+ * whole cluster.
  */
 static inline void
 btree_index_checkable(Relation rel)
@@ -236,7 +252,7 @@ btree_index_checkable(Relation rel)
 }
 
 /*
- * Main entry point for B-Tree SQL-callable functions.	Walks the B-Tree in
+ * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in
  * logical order, verifying invariants as it goes.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
@@ -265,7 +281,7 @@ bt_check_every_level(Relation rel, bool exclusivelylocked)
 	BtreeLevel	current;
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().	See note on
+	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
 	 * RecentGlobalXmin/B-Tree page deletion.
 	 */
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
@@ -301,15 +317,13 @@ bt_check_every_level(Relation rel, bool exclusivelylocked)
 				(errcode(ERRCODE_NO_DATA),
 				 errmsg("harmless fast root mismatch in index %s",
 						RelationGetRelationName(rel)),
-				 errdetail_internal("Fast block %u (level %u) "
-									"differs from true root "
-									"block %u (level %u).",
+				 errdetail_internal("Fast block %u (level %u) differs from true root block %u (level %u).",
 									metad->btm_fastroot, metad->btm_fastlevel,
 									metad->btm_root, metad->btm_level)));
 
 	/*
 	 * Starting at the root, verify every level.  Move left to right, top to
-	 * bottom.	Note that there may be no pages other than the meta page (meta
+	 * bottom.  Note that there may be no pages other than the meta page (meta
 	 * page can indicate that root is P_NONE when the index is totally empty).
 	 */
 	previouslevel = InvalidBtreeLevel;
@@ -346,9 +360,9 @@ bt_check_every_level(Relation rel, bool exclusivelylocked)
  *
  * Returns state for next call, if any.  This includes left-most block number
  * one level lower that should be passed on next level/call, which is set to
- * P_NONE on last call here (when leaf level is verified).	Level numbers
+ * P_NONE on last call here (when leaf level is verified).  Level numbers
  * follow the nbtree convention: higher levels have higher numbers, because new
- * levels are added only due to a root page split.	Note that prior to the
+ * levels are added only due to a root page split.  Note that prior to the
  * first root page split, the root is also a leaf page, so there is always a
  * level 0 (leaf level), and it's always the last level processed.
  *
@@ -610,8 +624,7 @@ bt_target_page_check(BtreeCheckState *state)
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("high key invariant violated for index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to %s tid=%s "
-										"page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s points to %s tid=%s page lsn=%X/%X.",
 										itid,
 										P_ISLEAF(topaque) ? "heap" : "index",
 										htid,
@@ -676,7 +689,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * already been performed for every other "real" item on target page
 		 * when last item is checked.  The difference is that the next item
 		 * (the item that is compared to target's last item) needs to come
-		 * from the next/sibling page.	There may not be such an item
+		 * from the next/sibling page.  There may not be such an item
 		 * available from sibling for various reasons, though (e.g., target is
 		 * the rightmost page on level).
 		 */
@@ -716,8 +729,7 @@ bt_target_page_check(BtreeCheckState *state)
 						(errcode(ERRCODE_INDEX_CORRUPTED),
 						 errmsg("cross page item order invariant violated for index \"%s\"",
 								RelationGetRelationName(state->rel)),
-						 errdetail_internal("Last item on page tid=(%u,%u) "
-											"page lsn=%X/%X.",
+						 errdetail_internal("Last item on page tid=(%u,%u) page lsn=%X/%X.",
 											state->targetblock, offset,
 											(uint32) (state->targetlsn >> 32),
 											(uint32) state->targetlsn)));
@@ -787,7 +799,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * (or left link, or downlink) and find its sibling, because recycling
 	 * does not occur until no possible index scan could land on the page.
 	 * Index scans can follow links with nothing more than their snapshot as
-	 * an interlock and be sure of at least that much.	(See page
+	 * an interlock and be sure of at least that much.  (See page
 	 * recycling/RecentGlobalXmin notes in nbtree README.)
 	 *
 	 * Furthermore, it's okay if we follow a rightlink and find a half-dead or
@@ -860,7 +872,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * deleted) implies that target page was not merged into from the right by
 	 * deletion; the key space at or after target never moved left.  Target's
 	 * parent either has the same downlink to target as before, or a <=
-	 * downlink due to deletion at the left of target.	Target either has the
+	 * downlink due to deletion at the left of target.  Target either has the
 	 * same highkey as before, or a highkey <= before when there is a page
 	 * split. (The rightmost concurrently-split-from-target-page page will
 	 * still have the same highkey as target was originally found to have,
@@ -907,7 +919,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 * that happens only to determine if there was a concurrent page deletion
 	 * of a reference page, and just like us if there was a page deletion of
 	 * that reference page it means we can move on from caring about the
-	 * reference page.	See the nbtree README for a full description of how
+	 * reference page.  See the nbtree README for a full description of how
 	 * that works.
 	 */
 	nline = PageGetMaxOffsetNumber(rightpage);
@@ -946,7 +958,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.	Note that this relies on right page
+	 * Return first real item scankey.  Note that this relies on right page
 	 * memory remaining allocated.
 	 */
 	return _bt_mkscankey(state->rel,
@@ -956,7 +968,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 /*
  * Checks one of target's downlink against its child page.
  *
- * Conceptually, the target page continues to be what is checked here.	The
+ * Conceptually, the target page continues to be what is checked here.  The
  * target block is still blamed in the event of finding an invariant violation.
  * The downlink insertion into the target is probably where any problem raised
  * here arises, and there is no such thing as a parent link, so doing the
@@ -977,8 +989,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	 *
 	 * NB: In general, page deletion deletes the right sibling's downlink, not
 	 * the downlink of the page being deleted; the deleted page's downlink is
-	 * reused for its sibling.	The key space is thereby consolidated between
-	 * the deleted page and its right sibling.	(We cannot delete a parent
+	 * reused for its sibling.  The key space is thereby consolidated between
+	 * the deleted page and its right sibling.  (We cannot delete a parent
 	 * page's rightmost child unless it is the last child page, and we intend
 	 * to also delete the parent itself.)
 	 *
@@ -1041,11 +1053,10 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
 							RelationGetRelationName(state->rel)),
-				errdetail_internal("Parent block=%u child index tid=(%u,%u) "
-								   "parent page lsn=%X/%X.",
-								   state->targetblock, childblock, offset,
-								   (uint32) (state->targetlsn >> 32),
-								   (uint32) state->targetlsn)));
+					 errdetail_internal("Parent block=%u child index tid=(%u,%u) parent page lsn=%X/%X.",
+									  state->targetblock, childblock, offset,
+										(uint32) (state->targetlsn >> 32),
+										(uint32) state->targetlsn)));
 	}
 
 	pfree(child);
@@ -1066,7 +1077,7 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 {
 	/*
 	 * For internal pages only, the first item after high key, if any, is
-	 * negative infinity item.	Internal pages always have a negative infinity
+	 * negative infinity item.  Internal pages always have a negative infinity
 	 * item, whereas leaf pages never have one.  This implies that negative
 	 * infinity item is either first or second line item, or there is none
 	 * within page.
diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml
index 168d3f4397..893a5b41d9 100644
--- a/doc/src/sgml/amcheck.sgml
+++ b/doc/src/sgml/amcheck.sgml
@@ -24,7 +24,7 @@
   collated lexical order).  If that particular invariant somehow fails
   to hold, we can expect binary searches on the affected page to
   incorrectly guide index scans, resulting in wrong answers to SQL
-  queries.  Problems like this can be very subtle.
+  queries.
  </para>
  <para>
   Verification is performed using the same procedures as those used by
@@ -90,9 +90,9 @@ ORDER BY c.relpages DESC LIMIT 10;
       database where verification is supported.
      </para>
      <para>
-      An <literal>AccessShareLock</> is acquired on the target index
-      by <function>bt_index_check</function>.  This lock mode is the
-      same lock mode acquired on relations by simple
+      <function>bt_index_check</function> acquires an <literal>AccessShareLock</>
+      on the target index and the heap relation it belongs to. This lock mode
+      is the same lock mode acquired on relations by simple
       <literal>SELECT</> statements.
       <function>bt_index_check</function> does not verify invariants
       that span child/parent relationships, nor does it verify that
@@ -132,7 +132,7 @@ ORDER BY c.relpages DESC LIMIT 10;
       logical inconsistency or other problem.
      </para>
      <para>
-      An <literal>ExclusiveLock</> is required on the target index by
+      A <literal>ShareLock</> is required on the target index by
       <function>bt_index_parent_check</function> (a
       <literal>ShareLock</> is also acquired on the heap relation).
       These locks prevent concurrent data modification from
@@ -140,9 +140,7 @@ ORDER BY c.relpages DESC LIMIT 10;
       commands.  The locks also prevent the underlying relation from
       being concurrently processed by <command>VACUUM</>, as well as
       all other utility commands.  Note that the function holds locks
-      for the entire duration of the transaction, so there can be a
-      significant advantage to verifying each index individually, each
-      in its own dedicated transaction.
+      only while running, not for the entire transaction.
      </para>
      <para>
       <function>bt_index_parent_check</function>'s additional
-- 
2.11.0.22.g8d7a455.dirty

#84

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Andres Freund (#83)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Mar 9, 2017 at 3:52 PM, Andres Freund <andres@anarazel.de> wrote:

But I also think it's more important to get some initial verification
functionality in, than resolving this conflict. I do, also quite
strongly, think we'll be better served with having something like what
you're proposing than nothing, and I don't have time/energy to turn your
patch into what I'm envisioning, nor necessarily the buyin. Hence I'm
planning to commit the current interface.

Attached is your original patch, and a patch editorializing things. I do
plan to merge those before pushing.

Your revisions all seem fine. No problems with any of it.

Who would you like to see added to Reviewed-By?

I am generally in favor of more inclusive Reviewed-By lists. I suggest
that we expand it to:

"Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia
Lubennikova, Robert Haas, Amit Langote"

Thanks for your help with this!

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Peter Geoghegan (#84)

Re: amcheck (B-Tree integrity checking tool)

On 2017-03-09 16:29:24 -0800, Peter Geoghegan wrote:

I am generally in favor of more inclusive Reviewed-By lists. I suggest
that we expand it to:

"Reviewed-By: Andres Freund, Thomas Vondra, Thomas Munro, Anastasia
Lubennikova, Robert Haas, Amit Langote"

And pushed with these.

I don't really expect buildfarm fallout, but ..

- Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Andres Freund (#85)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote:

And pushed with these.

Thanks.

I don't really expect buildfarm fallout, but ..

Unfortunately, there is some on Windows:

verify_nbtree.obj : error LNK2001: unresolved external symbol
RecentGlobalXmin
[C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]

Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
remove the documenting assertion and leave that comment as-is. I'll
work on a patch for this soon.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Andres Freund

andres@anarazel.de

almost 9 years ago

In reply to: Peter Geoghegan (#86)

Re: amcheck (B-Tree integrity checking tool)

Hi,

On 2017-03-09 19:04:46 -0800, Peter Geoghegan wrote:

On Thu, Mar 9, 2017 at 4:41 PM, Andres Freund <andres@anarazel.de> wrote:

I don't really expect buildfarm fallout, but ..

Unfortunately, there is some on Windows:

Thanks for monitoring.

verify_nbtree.obj : error LNK2001: unresolved external symbol
RecentGlobalXmin
[C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]

Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
remove the documenting assertion and leave that comment as-is. I'll
work on a patch for this soon.

Hm - I think it's fair to export RecentGlobalXmin, given that we
obviously use it across modules in core code. I see very little reason
not to export it.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Andres Freund (#87)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Mar 9, 2017 at 7:10 PM, Andres Freund <andres@anarazel.de> wrote:

verify_nbtree.obj : error LNK2001: unresolved external symbol
RecentGlobalXmin
[C:\buildfarm\buildenv\HEAD\pgsql.build\amcheck.vcxproj]

Rather than marking RecentGlobalXmin as PGDLLIMPORT, I'd rather just
remove the documenting assertion and leave that comment as-is. I'll
work on a patch for this soon.

Hm - I think it's fair to export RecentGlobalXmin, given that we
obviously use it across modules in core code. I see very little reason
not to export it.

Well, the assertion is completely useless as anything but documentation...

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Peter Geoghegan (#88)

1 attachment(s)

Re: amcheck (B-Tree integrity checking tool)

On Thu, Mar 9, 2017 at 7:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:

Hm - I think it's fair to export RecentGlobalXmin, given that we
obviously use it across modules in core code. I see very little reason
not to export it.

Well, the assertion is completely useless as anything but documentation...

I came up with the attached.

--
Peter Geoghegan

Attachments:

0001-Fix-amcheck-build-on-Windows.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-amcheck-build-on-Windows.patchDownload

From 05a8c0de7da7cd9ee4d8436abdf8368c1cf8ac19 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Thu, 9 Mar 2017 19:17:07 -0800
Subject: [PATCH] Fix amcheck build on Windows.

Remove documenting assertion around RecentGlobalXmin, rather than
marking it PGDLLIMPORT.

In passing, update some comments that did not reflect the state of
amcheck as-committed.
---
 contrib/amcheck/verify_nbtree.c | 74 +++++++++++++++++++----------------------
 1 file changed, 35 insertions(+), 39 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 5724aa6..d21b209 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -26,7 +26,6 @@
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "utils/memutils.h"
-#include "utils/snapmgr.h"
 
 
 PG_MODULE_MAGIC;
@@ -55,8 +54,8 @@ typedef struct BtreeCheckState
 
 	/* B-Tree Index Relation */
 	Relation	rel;
-	/* ExclusiveLock held? */
-	bool		exclusivelylocked;
+	/* ShareLock held on heap/index, rather than AccessShareLock? */
+	bool		strongerlock;
 	/* Per-page context */
 	MemoryContext targetcontext;
 	/* Buffer access strategy */
@@ -94,7 +93,7 @@ PG_FUNCTION_INFO_V1(bt_index_parent_check);
 
 static void bt_index_check_internal(Oid indrelid, bool parentcheck);
 static inline void btree_index_checkable(Relation rel);
-static void bt_check_every_level(Relation rel, bool exclusivelylocked);
+static void bt_check_every_level(Relation rel, bool strongerlock);
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
@@ -256,23 +255,26 @@ btree_index_checkable(Relation rel)
  * logical order, verifying invariants as it goes.
  *
  * It is the caller's responsibility to acquire appropriate heavyweight lock on
- * the index relation, and advise us if extra checks are safe when an
- * ExclusiveLock is held.
+ * the index relation, and advise us if extra checks are safe when a ShareLock
+ * is held.
  *
- * An ExclusiveLock is generally assumed to prevent any kind of physical
+ * A ShareLock is generally assumed to prevent any kind of physical
  * modification to the index structure, including modifications that VACUUM may
  * make.  This does not include setting of the LP_DEAD bit by concurrent index
  * scans, although that is just metadata that is not able to directly affect
  * any check performed here.  Any concurrent process that might act on the
  * LP_DEAD bit being set (recycle space) requires a heavyweight lock that
- * cannot be held while we hold an ExclusiveLock.  (Besides, even if that could
+ * cannot be held while we hold a ShareLock.  (Besides, even if that could
  * happen, the ad-hoc recycling when a page might otherwise split is performed
  * per-page, and requires an exclusive buffer lock, which wouldn't cause us
  * trouble.  _bt_delitems_vacuum() may only delete leaf items, and so the extra
  * parent/child check cannot be affected.)
+ *
+ * Note that we rely on RecentGlobalXmin interlock against recycling, just like
+ * standard index scans.  See note on RecentGlobalXmin/B-Tree page deletion.
  */
 static void
-bt_check_every_level(Relation rel, bool exclusivelylocked)
+bt_check_every_level(Relation rel, bool strongerlock)
 {
 	BtreeCheckState *state;
 	Page		metapage;
@@ -281,17 +283,11 @@ bt_check_every_level(Relation rel, bool exclusivelylocked)
 	BtreeLevel	current;
 
 	/*
-	 * RecentGlobalXmin assertion matches index_getnext_tid().  See note on
-	 * RecentGlobalXmin/B-Tree page deletion.
-	 */
-	Assert(TransactionIdIsValid(RecentGlobalXmin));
-
-	/*
 	 * Initialize state for entire verification operation
 	 */
 	state = palloc(sizeof(BtreeCheckState));
 	state->rel = rel;
-	state->exclusivelylocked = exclusivelylocked;
+	state->strongerlock = strongerlock;
 	/* Create context for page */
 	state->targetcontext = AllocSetContextCreate(CurrentMemoryContext,
 												 "amcheck context",
@@ -353,7 +349,7 @@ bt_check_every_level(Relation rel, bool exclusivelylocked)
 
 /*
  * Given a left-most block at some level, move right, verifying each page
- * individually (with more verification across pages for "exclusivelylocked"
+ * individually (with more verification across pages for "strongerlock"
  * callers).  Caller should pass the true root page as the leftmost initially,
  * working their way down by passing what is returned for the last call here
  * until level 0 (leaf page level) was reached.
@@ -428,7 +424,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 			 * locking, check that the first valid page meets caller's
 			 * expectations.
 			 */
-			if (state->exclusivelylocked)
+			if (state->strongerlock)
 			{
 				if (!P_LEFTMOST(opaque))
 					ereport(ERROR,
@@ -482,7 +478,7 @@ bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
 			 */
 		}
 
-		if (state->exclusivelylocked && opaque->btpo_prev != leftcurrent)
+		if (state->strongerlock && opaque->btpo_prev != leftcurrent)
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("left link/right link pair in index \"%s\" not in agreement",
@@ -541,7 +537,7 @@ nextpage:
  *	 "real" data item on the page to the right (if such a first item is
  *	 available).
  *
- * Furthermore, when state passed shows ExclusiveLock held, and target page is
+ * Furthermore, when state passed shows ShareLock held, and target page is
  * internal page, function also checks:
  *
  * - That all child pages respect downlinks lower bound.
@@ -597,8 +593,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 * page items.
 		 *
 		 * We prefer to check all items against high key rather than checking
-		 * just the first and trusting that the operator class obeys the
-		 * transitive law (which implies that all subsequent items also
+		 * just the last and trusting that the operator class obeys the
+		 * transitive law (which implies that all previous items also
 		 * respected the high key invariant if they pass the item order
 		 * check).
 		 *
@@ -705,12 +701,12 @@ bt_target_page_check(BtreeCheckState *state)
 			{
 				/*
 				 * As explained at length in bt_right_page_check_scankey(),
-				 * there is a known !exclusivelylocked race that could account
+				 * there is a known !strongerlock race that could account
 				 * for apparent violation of invariant, which we must check
 				 * for before actually proceeding with raising error.  Our
 				 * canary condition is that target page was deleted.
 				 */
-				if (!state->exclusivelylocked)
+				if (!state->strongerlock)
 				{
 					/* Get fresh copy of target page */
 					state->target = palloc_btree_page(state, state->targetblock);
@@ -718,7 +714,7 @@ bt_target_page_check(BtreeCheckState *state)
 					topaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
 
 					/*
-					 * All !exclusivelylocked checks now performed; just
+					 * All !strongerlock checks now performed; just
 					 * return
 					 */
 					if (P_IGNORE(topaque))
@@ -740,11 +736,11 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Downlink check *
 		 *
 		 * Additional check of child items iff this is an internal page and
-		 * caller holds an ExclusiveLock.  This happens for every downlink
-		 * (item) in target excluding the negative-infinity downlink (again,
-		 * this is because it has no useful value to compare).
+		 * caller holds a ShareLock.  This happens for every downlink (item) in
+		 * target excluding the negative-infinity downlink (again, this is
+		 * because it has no useful value to compare).
 		 */
-		if (!P_ISLEAF(topaque) && state->exclusivelylocked)
+		if (!P_ISLEAF(topaque) && state->strongerlock)
 		{
 			BlockNumber childblock = ItemPointerGetBlockNumber(&(itup->t_tid));
 
@@ -766,7 +762,7 @@ bt_target_page_check(BtreeCheckState *state)
  * with different parent page).  If no such valid item is available, return
  * NULL instead.
  *
- * Note that !exclusivelylocked callers must reverify that target page has not
+ * Note that !strongerlock callers must reverify that target page has not
  * been concurrently deleted.
  */
 static ScanKey
@@ -833,14 +829,14 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * No ExclusiveLock held case -- why it's safe to proceed.
+	 * No ShareLock held case -- why it's safe to proceed.
 	 *
 	 * Problem:
 	 *
 	 * We must avoid false positive reports of corruption when caller treats
 	 * item returned here as an upper bound on target's last item.  In
 	 * general, false positives are disallowed.  Avoiding them here when
-	 * caller is !exclusivelylocked is subtle.
+	 * caller is !strongerlock is subtle.
 	 *
 	 * A concurrent page deletion by VACUUM of the target page can result in
 	 * the insertion of items on to this right sibling page that would
@@ -892,7 +888,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	 *
 	 * Top level tree walk caller moves on to next page (makes it the new
 	 * target) following recovery from this race.  (cf.  The rationale for
-	 * child/downlink verification needing an ExclusiveLock within
+	 * child/downlink verification needing a ShareLock within
 	 * bt_downlink_check(), where page deletion is also the main source of
 	 * trouble.)
 	 *
@@ -984,8 +980,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	BTPageOpaque copaque;
 
 	/*
-	 * Caller must have ExclusiveLock on target relation, because of
-	 * considerations around page deletion by VACUUM.
+	 * Caller must have ShareLock on target relation, because of considerations
+	 * around page deletion by VACUUM.
 	 *
 	 * NB: In general, page deletion deletes the right sibling's downlink, not
 	 * the downlink of the page being deleted; the deleted page's downlink is
@@ -994,8 +990,8 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	 * page's rightmost child unless it is the last child page, and we intend
 	 * to also delete the parent itself.)
 	 *
-	 * If this verification happened without an ExclusiveLock, the following
-	 * race condition could cause false positives:
+	 * If this verification happened without a ShareLock, the following race
+	 * condition could cause false positives:
 	 *
 	 * In general, concurrent page deletion might occur, including deletion of
 	 * the left sibling of the child page that is examined here.  If such a
@@ -1009,7 +1005,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	 * part of deletion's first phase.)
 	 *
 	 * Note that while the cross-page-same-level last item check uses a trick
-	 * that allows it to perform verification for !exclusivelylocked callers,
+	 * that allows it to perform verification for !strongerlock callers,
 	 * a similar trick seems difficult here.  The trick that that other check
 	 * uses is, in essence, to lock down race conditions to those that occur
 	 * due to concurrent page deletion of the target; that's a race that can
@@ -1021,7 +1017,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 	 * locks on both child and left-of-child pages).  That's unacceptable for
 	 * amcheck functions on general principle, though.
 	 */
-	Assert(state->exclusivelylocked);
+	Assert(state->strongerlock);
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-- 
2.7.4