pg_amcheck contrib application

Started by Mark Dilgeralmost 5 years ago161 messages
#1Mark Dilger
mark.dilger@enterprisedb.com
3 attachment(s)

New thread, was "Re: new heapcheck contrib module"

On Mar 2, 2021, at 10:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 2, 2021 at 12:10 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

On further reflection, I decided to implement these changes and not worry about the behavioral change.

Thanks.

I skipped this part. The initcmd argument is only handed to ParallelSlotsGetIdle(). Doing as you suggest would not really be simpler, it would just move that argument to ParallelSlotsSetup(). But I don't feel strongly about it, so I can move this, too, if you like.

I didn't do this either, and for the same reason. It's just a parameter to ParallelSlotsGetIdle(), so nothing is really gained by moving it to ParallelSlotsSetup().

OK. I thought it was more natural to pass a bunch of arguments at
setup time rather than passing a bunch of arguments at get-idle time,
but I don't feel strongly enough about it to insist, and somebody else
can always change it later if they decide I had the right idea.

When you originally proposed the idea, I thought that it would work out as a simpler interface to have it your way, but in terms of the interface it came out about the same. Internally it is still simpler to do it your way, so since you seem to still like your way better, this next version has it that way.

Rather than the slots user tweak the slot's ConnParams, ParallelSlotsGetIdle() takes a dbname argument, and uses it as ConnParams->override_dbname.

OK, but you forgot to update the comments. ParallelSlotsGetIdle()
still talks about a cparams argument that it no longer has.

Fixed.

The usual idiom for sizing a memory allocation involving
FLEXIBLE_ARRAY_MEMBER is something like offsetof(ParallelSlotArray,
slots) + numslots * sizeof(ParallelSlot). Your version uses sizeof();
don't.

Fixed.

Other than that 0001 looks to me to be in pretty good shape now.

And your other review email, also moved to this new thread....

On Mar 2, 2021, at 12:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 2, 2021 at 1:24 PM Robert Haas <robertmhaas@gmail.com> wrote:

Other than that 0001 looks to me to be in pretty good shape now.

Incidentally, we might want to move this to a new thread with a better
subject line, since the current subject line really doesn't describe
the uncommitted portion of the work. And create a new CF entry, too.

Moved here.

Moving onto 0002:

The index checking options should really be called btree index
checking options. I think I'd put the table options first, and the
btree options second. Other kinds of indexes could follow some day. I
would personally omit the short forms of --heapallindexed and
--parent-check; I think we'll run out of option names too quickly if
people add more kinds of checks.

Done.

While doing this, I also renamed some of the variables to more closely match the option name. I think the code is clearer now.

Perhaps VerifyBtreeSlotHandler should emit a warning of some kind if
PQntuples(res) != 0.

The functions bt_index_check and bt_index_parent_check are defined to return VOID, which results in PQntuples(res) == 1. I added code to verify this condition, but it only serves to alert the user if the amcheck version is behaving in an unexpected way, perhaps due to a amcheck/pg_amcheck version mismatch.

+               /*
+                * Test that this function works, but for now we're
not using the list
+                * 'relations' that it builds.
+                */
+               conn = connectDatabase(&cparams, progname, opts.echo,
false, true);

This comment appears to have nothing to do with the code, since
connectDatabase() does not build a list of 'relations'.

True. Removed.

amcheck_sql seems to include paranoia, but do we need that if we're
using a secure search path? Similarly for other SQL queries, e.g. in
prepare_table_command.

I removed the OPERATOR(pg_catalog.=) paranoia.

It might not be strictly necessary for the static functions in
pg_amcheck.c to use_three completelyDifferent NamingConventions for
its static functions.

The idea is that the functions that interoperate with parallel slots would follow its NamingConvention; those interoperating with patternToSQLRegex and PQExpBuffers would follow their namingConvention; and those not so interoperating would follow a less obnoxious naming_convention. To my eye, that color codes the function names in a useful way. To your eye, it just looks awful. I've changed it to use just one naming_convention.

should_processing_continue() is one semicolon over budget.

That's not the first time I've done that recently. Removed.

The initializer for opts puts a comma even after the last member
initializer. Is that going to be portable to all compilers?

I don't know. I learned to put commas at the end of lists back when I did mostly perl programming, as you get cleaner diffs when you add more stuff to the list later. Whether I can get away with that in C using initializers I don't know. I don't have a multiplicity of compilers to check.

I have removed the extra comma.

+ for (failed = false, cell = opts.include.head; cell; cell = cell->next)

I think failed has to be false here, because it gets initialized at
the top of the function. If we need to reinitialize it for some
reason, I would prefer you do that on the previous line, separate from
the for loop stuff.

It does have to be false there. There is no need to reinitialize it.

+   char       *dbrgx;          /* Database regexp parsed from pattern, or
+                                * NULL */
+   char       *nsprgx;         /* Schema regexp parsed from pattern, or NULL */
+   char       *relrgx;         /* Relation regexp parsed from pattern, or
+                                * NULL */
+   bool        tblonly;        /* true if relrgx should only match tables */
+   bool        idxonly;        /* true if relrgx should only match indexes */

Maybe: db_regex, nsp_regex, rel_regex, table_only, index_only?

Just because it seems theoretically possible that someone will see
nsprgx and not immediately understand what it's supposed to mean, even
if they know that nsp is a common abbreviation for namespace in
PostgreSQL code, and even if they also know what a regular expression
is.

Changed. Along the way, I noticed that "tbl" and "idx" were being used in C/SQL both to mean ("table_only", "index_only") in some contexts and ("is_table", "is_index') in others, so I replaced all instances of "tbl" and "idx" with the unambiguous labels.

Your four messages about there being nothing to check seem like they
could be consolidated down to one: "nothing to check for pattern
\"%s\"".

I anticipated your review comment, but I'm worried about the case that somebody runs

pg_amcheck -t "foo" -i "foo"

and one of those matches and the other does not. The message 'nothing to check for pattern "foo"' will be wrong (because there was something to check for it) and unhelpful (because it doesn't say which failed to match.)

I would favor changing things so that once argument parsing is
complete, we switch to reporting all errors that way. So in other
words here, and everything that follows:

+ fprintf(stderr, "%s: no databases to check\n", progname);

Same concern about the output for

pg_amcheck -t "foo" -i "foo" -d "foo"

You might think I'm being silly here, as database names, table names, and index names should in normal usage not be hard for the user to distinguish. But consider

pg_amcheck "mydb.myschema.mytable"

If it says, 'nothing to check for pattern "mydb.myschema.mytable"', you don't know if the database doesn't exist or if the table doesn't exist.

+ * ParallelSlots based event loop follows.

"Main event loop."

Changed.

To me it would read slightly better to change each reference to
"relations list" to "list of relations", but perhaps that is too
nitpicky.

No harm picking those nits. Changed.

I think the two instances of goto finish could be avoided with not
much work. At most a few things need to happen only if !failed, and
maybe not even that, if you just said "break;" instead.

Good point. Changed.

+ * Note: Heap relation corruption is returned by verify_heapam() without the
+ * use of raising errors, but running verify_heapam() on a corrupted table may

How about "Heap relation corruption() is reported by verify_heapam()
via the result set, rather than an ERROR, ..."

Changed, though I assumed your parens for corruption() were not intended.

Ok, so now you've moved on to reviewing the regression tests....

It seems mighty inefficient to have a whole bunch of consecutive calls
to remove_relation_file() or corrupt_first_page() when every such call
stops and restarts the database. I would guess these tests will run
noticeably faster if you don't do that. Either the functions need to
take a list of arguments, or the stop/start needs to be pulled up and
done in the caller.

Changed.

corrupt_first_page() could use a comment explaining what exactly we're
overwriting, and in particular noting that we don't want to just
clobber the LSN, but rather something where we can detect a wrong
value.

Added comments that we're skipping past the PageHeader and overwriting garbage starting in the line pointers.

There's a long list of calls to command_checks_all() in 003_check.pl
that don't actually check anything but that the command failed, but
run it with a bunch of different options. I don't understand the value
of that, and suggest reducing the number of cases tested. If you want,
you can have tests elsewhere that focus -- perhaps by using verbose
mode -- on checking that the right tables are being checked.

This should be better in this next patch series.

This is not yet a full review of everything in this patch -- I haven't
sorted through all of the tests yet, or all of the new query
construction logic -- but to me this looks pretty close to
committable.

Thanks for the review!

Attachments:

v42-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patchapplication/octet-stream; name=v42-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patch; x-unix-mode=0644Download
From 79c357606f06c6422ed2039b6395a59050a4cff1 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 3 Mar 2021 07:16:55 -0800
Subject: [PATCH v42 1/3] Reworking ParallelSlots for mutliple DB use

The existing implementation of ParallelSlots is used by reindexdb
and vacuumdb to process tables in parallel in only one database at
a time.  The ParallelSlots interface reflects this usage pattern.
The function to set up the slots assumes all slots should be
connected to the same database, and the function for getting the
next idle slot pays no attention to which database the slot may be
connected to.

In anticipation of pg_amcheck using parallel slots to process
multiple databases in parallel, reworking the interface while
trying to remain reasonably simple for reindexdb and vacuumdb to
use:

ParallelSlotsSetup() no longer creates or receives database
connections.  It takes arguments that it stores for use in
subsequent operations when a connection needs to be formed.

Callers who already have a connection and want to reuse it can give
it to the parallel slots using a new function,
ParallelSlotsAdoptConn().  Both reindexdb and vacuumdb use this.

ParallelSlotsGetIdle() is extended to take a dbname argument
indicating the database to which a connection is desired, and to
manage a heterogeneous set of slots potentially connected to varying
databases and some perhaps not yet connected.  The function will
reuse an existing connection or form a new connection as necessary.

The logic for determining whether a slot's connection is suitable
for reuse is based on the database the slot's connection is
connected to, and whether that matches the database desired.  Other
connection parameters (user, host, port, etc.) are assumed not to
change from slot to slot.
---
 src/bin/scripts/reindexdb.c          |  17 +-
 src/bin/scripts/vacuumdb.c           |  46 +--
 src/fe_utils/parallel_slot.c         | 411 +++++++++++++++++++--------
 src/include/fe_utils/parallel_slot.h |  27 +-
 src/tools/pgindent/typedefs.list     |   2 +
 5 files changed, 342 insertions(+), 161 deletions(-)

diff --git a/src/bin/scripts/reindexdb.c b/src/bin/scripts/reindexdb.c
index cf28176243..fc0681538a 100644
--- a/src/bin/scripts/reindexdb.c
+++ b/src/bin/scripts/reindexdb.c
@@ -36,7 +36,7 @@ static SimpleStringList *get_parallel_object_list(PGconn *conn,
 												  ReindexType type,
 												  SimpleStringList *user_list,
 												  bool echo);
-static void reindex_one_database(const ConnParams *cparams, ReindexType type,
+static void reindex_one_database(ConnParams *cparams, ReindexType type,
 								 SimpleStringList *user_list,
 								 const char *progname,
 								 bool echo, bool verbose, bool concurrently,
@@ -330,7 +330,7 @@ main(int argc, char *argv[])
 }
 
 static void
-reindex_one_database(const ConnParams *cparams, ReindexType type,
+reindex_one_database(ConnParams *cparams, ReindexType type,
 					 SimpleStringList *user_list,
 					 const char *progname, bool echo,
 					 bool verbose, bool concurrently, int concurrentCons,
@@ -341,7 +341,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 	bool		parallel = concurrentCons > 1;
 	SimpleStringList *process_list = user_list;
 	ReindexType process_type = type;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	bool		failed = false;
 	int			items_count = 0;
 
@@ -461,7 +461,8 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 
 	Assert(process_list != NULL);
 
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, NULL);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	cell = process_list->head;
 	do
@@ -475,7 +476,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -489,7 +490,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
@@ -499,8 +500,8 @@ finish:
 		pg_free(process_list);
 	}
 
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pfree(slots);
+	ParallelSlotsTerminate(sa);
+	pfree(sa);
 
 	if (failed)
 		exit(1);
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 602fd45c42..7901c41f16 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -45,7 +45,7 @@ typedef struct vacuumingOptions
 } vacuumingOptions;
 
 
-static void vacuum_one_database(const ConnParams *cparams,
+static void vacuum_one_database(ConnParams *cparams,
 								vacuumingOptions *vacopts,
 								int stage,
 								SimpleStringList *tables,
@@ -408,7 +408,7 @@ main(int argc, char *argv[])
  * a list of tables from the database.
  */
 static void
-vacuum_one_database(const ConnParams *cparams,
+vacuum_one_database(ConnParams *cparams,
 					vacuumingOptions *vacopts,
 					int stage,
 					SimpleStringList *tables,
@@ -421,13 +421,14 @@ vacuum_one_database(const ConnParams *cparams,
 	PGresult   *res;
 	PGconn	   *conn;
 	SimpleStringListCell *cell;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	SimpleStringList dbtables = {NULL, NULL};
 	int			i;
 	int			ntups;
 	bool		failed = false;
 	bool		tables_listed = false;
 	bool		has_where = false;
+	const char *initcmd;
 	const char *stage_commands[] = {
 		"SET default_statistics_target=1; SET vacuum_cost_delay=0;",
 		"SET default_statistics_target=10; RESET vacuum_cost_delay;",
@@ -684,26 +685,25 @@ vacuum_one_database(const ConnParams *cparams,
 		concurrentCons = 1;
 
 	/*
-	 * Setup the database connections. We reuse the connection we already have
-	 * for the first slot.  If not in parallel mode, the first slot in the
-	 * array contains the connection.
+	 * All slots need to be prepared to run the appropriate analyze stage, if
+	 * caller requested that mode.  We have to prepare the initial connection
+	 * ourselves before setting up the slots.
 	 */
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	if (stage == ANALYZE_NO_STAGE)
+		initcmd = NULL;
+	else
+	{
+		initcmd = stage_commands[stage];
+		executeCommand(conn, initcmd, echo);
+	}
 
 	/*
-	 * Prepare all the connections to run the appropriate analyze stage, if
-	 * caller requested that mode.
+	 * Setup the database connections. We reuse the connection we already have
+	 * for the first slot.  If not in parallel mode, the first slot in the
+	 * array contains the connection.
 	 */
-	if (stage != ANALYZE_NO_STAGE)
-	{
-		int			j;
-
-		/* We already emitted the message above */
-
-		for (j = 0; j < concurrentCons; j++)
-			executeCommand((slots + j)->connection,
-						   stage_commands[stage], echo);
-	}
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, initcmd);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	initPQExpBuffer(&sql);
 
@@ -719,7 +719,7 @@ vacuum_one_database(const ConnParams *cparams,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -740,12 +740,12 @@ vacuum_one_database(const ConnParams *cparams,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pg_free(slots);
+	ParallelSlotsTerminate(sa);
+	pg_free(sa);
 
 	termPQExpBuffer(&sql);
 
diff --git a/src/fe_utils/parallel_slot.c b/src/fe_utils/parallel_slot.c
index b625deb254..a09e5460e5 100644
--- a/src/fe_utils/parallel_slot.c
+++ b/src/fe_utils/parallel_slot.c
@@ -25,31 +25,23 @@
 #include "common/logging.h"
 #include "fe_utils/cancel.h"
 #include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
 
 #define ERRCODE_UNDEFINED_TABLE  "42P01"
 
-static void init_slot(ParallelSlot *slot, PGconn *conn);
 static int	select_loop(int maxFd, fd_set *workerset);
 static bool processQueryResult(ParallelSlot *slot, PGresult *result);
 
-static void
-init_slot(ParallelSlot *slot, PGconn *conn)
-{
-	slot->connection = conn;
-	/* Initially assume connection is idle */
-	slot->isFree = true;
-	ParallelSlotClearHandler(slot);
-}
-
 /*
  * Process (and delete) a query result.  Returns true if there's no problem,
- * false otherwise. It's up to the handler to decide what cosntitutes a
+ * false otherwise. It's up to the handler to decide what constitutes a
  * problem.
  */
 static bool
 processQueryResult(ParallelSlot *slot, PGresult *result)
 {
 	Assert(slot->handler != NULL);
+	Assert(slot->connection != NULL);
 
 	/* On failure, the handler should return NULL after freeing the result */
 	if (!slot->handler(result, slot->connection, slot->handler_context))
@@ -71,6 +63,9 @@ consumeQueryResult(ParallelSlot *slot)
 	bool		ok = true;
 	PGresult   *result;
 
+	Assert(slot != NULL);
+	Assert(slot->connection != NULL);
+
 	SetCancelConn(slot->connection);
 	while ((result = PQgetResult(slot->connection)) != NULL)
 	{
@@ -137,151 +132,316 @@ select_loop(int maxFd, fd_set *workerset)
 }
 
 /*
- * ParallelSlotsGetIdle
- *		Return a connection slot that is ready to execute a command.
- *
- * This returns the first slot we find that is marked isFree, if one is;
- * otherwise, we loop on select() until one socket becomes available.  When
- * this happens, we read the whole set and mark as free all sockets that
- * become available.  If an error occurs, NULL is returned.
+ * Return the offset of a suitable idle slot, or -1 if none are available.  If
+ * the given dbname is not null, only idle slots connected to the given
+ * database are considered suitable, otherwise all idle connected slots are
+ * considered suitable.
  */
-ParallelSlot *
-ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
+static int
+find_matching_idle_slot(const ParallelSlotArray *sa, const char *dbname)
 {
 	int			i;
-	int			firstFree = -1;
 
-	/*
-	 * Look for any connection currently free.  If there is one, mark it as
-	 * taken and let the caller know the slot to use.
-	 */
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (slots[i].isFree)
-		{
-			slots[i].isFree = false;
-			return slots + i;
-		}
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			continue;
+
+		if (dbname == NULL ||
+			strcmp(PQdb(sa->slots[i].connection), dbname) == 0)
+			return i;
+	}
+	return -1;
+}
+
+/*
+ * Return the offset of the first slot without a database connection, or -1 if
+ * all slots are connected.
+ */
+static int
+find_unconnected_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			return i;
+	}
+
+	return -1;
+}
+
+/*
+ * Return the offset of the first idle slot, or -1 if all slots are busy.
+ */
+static int
+find_any_idle_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+		if (!sa->slots[i].inUse)
+			return i;
+
+	return -1;
+}
+
+/*
+ * Wait for any slot's connection to have query results, consume the results,
+ * and update the slot's status as appropriate.  Returns true on success,
+ * false on cancellation, on error, or if no slots are connected.
+ */
+static bool
+wait_on_slots(ParallelSlotArray *sa)
+{
+	int			i;
+	fd_set		slotset;
+	int			maxFd = 0;
+	PGconn	   *cancelconn = NULL;
+
+	/* We must reconstruct the fd_set for each call to select_loop */
+	FD_ZERO(&slotset);
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		int			sock;
+
+		/* We shouldn't get here if we still have slots without connections */
+		Assert(sa->slots[i].connection != NULL);
+
+		sock = PQsocket(sa->slots[i].connection);
+
+		/*
+		 * We don't really expect any connections to lose their sockets after
+		 * startup, but just in case, cope by ignoring them.
+		 */
+		if (sock < 0)
+			continue;
+
+		/* Keep track of the first valid connection we see. */
+		if (cancelconn == NULL)
+			cancelconn = sa->slots[i].connection;
+
+		FD_SET(sock, &slotset);
+		if (sock > maxFd)
+			maxFd = sock;
 	}
 
 	/*
-	 * No free slot found, so wait until one of the connections has finished
-	 * its task and return the available slot.
+	 * If we get this far with no valid connections, processing cannot
+	 * continue.
 	 */
-	while (firstFree < 0)
+	if (cancelconn == NULL)
+		return false;
+
+	SetCancelConn(sa->slots->connection);
+	i = select_loop(maxFd, &slotset);
+	ResetCancelConn();
+
+	/* failure? */
+	if (i < 0)
+		return false;
+
+	for (i = 0; i < sa->numslots; i++)
 	{
-		fd_set		slotset;
-		int			maxFd = 0;
+		int			sock;
 
-		/* We must reconstruct the fd_set for each call to select_loop */
-		FD_ZERO(&slotset);
+		sock = PQsocket(sa->slots[i].connection);
 
-		for (i = 0; i < numslots; i++)
+		if (sock >= 0 && FD_ISSET(sock, &slotset))
 		{
-			int			sock = PQsocket(slots[i].connection);
-
-			/*
-			 * We don't really expect any connections to lose their sockets
-			 * after startup, but just in case, cope by ignoring them.
-			 */
-			if (sock < 0)
-				continue;
-
-			FD_SET(sock, &slotset);
-			if (sock > maxFd)
-				maxFd = sock;
+			/* select() says input is available, so consume it */
+			PQconsumeInput(sa->slots[i].connection);
 		}
 
-		SetCancelConn(slots->connection);
-		i = select_loop(maxFd, &slotset);
-		ResetCancelConn();
-
-		/* failure? */
-		if (i < 0)
-			return NULL;
-
-		for (i = 0; i < numslots; i++)
+		/* Collect result(s) as long as any are available */
+		while (!PQisBusy(sa->slots[i].connection))
 		{
-			int			sock = PQsocket(slots[i].connection);
+			PGresult   *result = PQgetResult(sa->slots[i].connection);
 
-			if (sock >= 0 && FD_ISSET(sock, &slotset))
+			if (result != NULL)
 			{
-				/* select() says input is available, so consume it */
-				PQconsumeInput(slots[i].connection);
+				/* Handle and discard the command result */
+				if (!processQueryResult(&sa->slots[i], result))
+					return false;
 			}
-
-			/* Collect result(s) as long as any are available */
-			while (!PQisBusy(slots[i].connection))
+			else
 			{
-				PGresult   *result = PQgetResult(slots[i].connection);
-
-				if (result != NULL)
-				{
-					/* Handle and discard the command result */
-					if (!processQueryResult(slots + i, result))
-						return NULL;
-				}
-				else
-				{
-					/* This connection has become idle */
-					slots[i].isFree = true;
-					ParallelSlotClearHandler(slots + i);
-					if (firstFree < 0)
-						firstFree = i;
-					break;
-				}
+				/* This connection has become idle */
+				sa->slots[i].inUse = false;
+				ParallelSlotClearHandler(&sa->slots[i]);
+				break;
 			}
 		}
 	}
+	return true;
+}
 
-	slots[firstFree].isFree = false;
-	return slots + firstFree;
+/*
+ * Open a new database connection using the stored connection parameters and
+ * optionally a given dbname if not null, execute the stored initial command if
+ * any, and associate the new connection with the given slot.
+ */
+static void
+connect_slot(ParallelSlotArray *sa, int slotno, const char *dbname)
+{
+	const char *old_override;
+	ParallelSlot *slot = &sa->slots[slotno];
+
+	old_override = sa->cparams->override_dbname;
+	if (dbname)
+		sa->cparams->override_dbname = dbname;
+	slot->connection = connectDatabase(sa->cparams, sa->progname, sa->echo, false, true);
+	sa->cparams->override_dbname = old_override;
+
+	if (PQsocket(slot->connection) >= FD_SETSIZE)
+	{
+		pg_log_fatal("too many jobs for this platform");
+		exit(1);
+	}
+
+	/* Setup the connection using the supplied command, if any. */
+	if (sa->initcmd)
+		executeCommand(slot->connection, sa->initcmd, sa->echo);
 }
 
 /*
- * ParallelSlotsSetup
- *		Prepare a set of parallel slots to use on a given database.
+ * ParallelSlotsGetIdle
+ *		Return a connection slot that is ready to execute a command.
+ *
+ * The slot returned is chosen as follows:
+ *
+ * If any idle slot already has an open connection, and if either dbname is
+ * null or the existing connection is to the given database, that slot will be
+ * returned allowing the connection to be reused.
+ *
+ * Otherwise, if any idle slot is not yet connected to any database, the slot
+ * will be returned with it's connection opened using the stored cparams and
+ * optionally the given dbname if not null.
+ *
+ * Otherwise, if any idle slot exists, an idle slot will be chosen and returned
+ * after having it's connection disconnected and reconnected using the stored
+ * cparams and optionally the given dbname if not null.
  *
- * This creates and initializes a set of connections to the database
- * using the information given by the caller, marking all parallel slots
- * as free and ready to use.  "conn" is an initial connection set up
- * by the caller and is associated with the first slot in the parallel
- * set.
+ * Otherwise, if any slots have connections that are busy, we loop on select()
+ * until one socket becomes available.  When this happens, we read the whole
+ * set and mark as free all sockets that become available.  We then select a
+ * slot using the same rules as above.
+ *
+ * Otherwise, we cannot return a slot, which is an error, and NULL is returned.
+ *
+ * For any connection created, if the stored initcmd is not null, it will be
+ * executed as a command on the newly formed connection before the slot is
+ * returned.
+ *
+ * If an error occurs, NULL is returned.
  */
 ParallelSlot *
-ParallelSlotsSetup(const ConnParams *cparams,
-				   const char *progname, bool echo,
-				   PGconn *conn, int numslots)
+ParallelSlotsGetIdle(ParallelSlotArray *sa, const char *dbname)
 {
-	ParallelSlot *slots;
-	int			i;
+	int			offset;
 
-	Assert(conn != NULL);
+	Assert(sa);
+	Assert(sa->numslots > 0);
 
-	slots = (ParallelSlot *) pg_malloc(sizeof(ParallelSlot) * numslots);
-	init_slot(slots, conn);
-	if (numslots > 1)
+	while (1)
 	{
-		for (i = 1; i < numslots; i++)
+		/* First choice: a slot already connected to the desired database. */
+		offset = find_matching_idle_slot(sa, dbname);
+		if (offset >= 0)
 		{
-			conn = connectDatabase(cparams, progname, echo, false, true);
-
-			/*
-			 * Fail and exit immediately if trying to use a socket in an
-			 * unsupported range.  POSIX requires open(2) to use the lowest
-			 * unused file descriptor and the hint given relies on that.
-			 */
-			if (PQsocket(conn) >= FD_SETSIZE)
-			{
-				pg_log_fatal("too many jobs for this platform -- try %d", i);
-				exit(1);
-			}
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
+
+		/* Second choice: a slot not connected to any database. */
+		offset = find_unconnected_slot(sa);
+		if (offset >= 0)
+		{
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
 
-			init_slot(slots + i, conn);
+		/* Third choice: a slot connected to the wrong database. */
+		offset = find_any_idle_slot(sa);
+		if (offset >= 0)
+		{
+			disconnectDatabase(sa->slots[offset].connection);
+			sa->slots[offset].connection = NULL;
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
 		}
+
+		/*
+		 * Fourth choice: block until one or more slots become available. If
+		 * any slot's hit a fatal error, we'll find out about that here and
+		 * return NULL.
+		 */
+		if (!wait_on_slots(sa))
+			return NULL;
 	}
+}
+
+/*
+ * ParallelSlotsSetup
+ *		Prepare a set of parallel slots but do not connect to any database.
+ *
+ * This creates and initializes a set of slots, marking all parallel slots as
+ * free and ready to use.  Establishing connections is delayed until requesting
+ * a free slot.  The cparams, progname, echo, and initcmd are stored for later
+ * use and must remain valid for the lifetime of the returned array.
+ */
+ParallelSlotArray *
+ParallelSlotsSetup(int numslots, ConnParams *cparams, const char *progname,
+				   bool echo, const char *initcmd)
+{
+	ParallelSlotArray *sa;
 
-	return slots;
+	Assert(numslots > 0);
+	Assert(cparams != NULL);
+	Assert(progname != NULL);
+
+	sa = (ParallelSlotArray *) palloc0(offsetof(ParallelSlotArray, slots) +
+									   numslots * sizeof(ParallelSlot));
+
+	sa->numslots = numslots;
+	sa->cparams = cparams;
+	sa->progname = progname;
+	sa->echo = echo;
+	sa->initcmd = initcmd;
+
+	return sa;
+}
+
+/*
+ * ParallelSlotsAdoptConn
+ *		Assign an open connection to the slots array for reuse.
+ *
+ * This turns over ownership of an open connection to a slots array.  The
+ * caller should not further use or close the connection.  All the connection's
+ * parameters (user, host, port, etc.) except possibly dbname should match
+ * those of the slots array's cparams, as given in ParallelSlotsSetup.  If
+ * these parameters differ, subsequent behavior is undefined.
+ */
+void
+ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn)
+{
+	int		offset;
+
+	offset = find_unconnected_slot(sa);
+	if (offset >= 0)
+		sa->slots[offset].connection = conn;
+	else
+		disconnectDatabase(conn);
 }
 
 /*
@@ -292,13 +452,13 @@ ParallelSlotsSetup(const ConnParams *cparams,
  * terminate all connections.
  */
 void
-ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
+ParallelSlotsTerminate(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		PGconn	   *conn = slots[i].connection;
+		PGconn	   *conn = sa->slots[i].connection;
 
 		if (conn == NULL)
 			continue;
@@ -314,13 +474,15 @@ ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
  * error has been found on the way.
  */
 bool
-ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
+ParallelSlotsWaitCompletion(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (!consumeQueryResult(slots + i))
+		if (sa->slots[i].connection == NULL)
+			continue;
+		if (!consumeQueryResult(&sa->slots[i]))
 			return false;
 	}
 
@@ -350,6 +512,9 @@ ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
 bool
 TableCommandResultHandler(PGresult *res, PGconn *conn, void *context)
 {
+	Assert(res != NULL);
+	Assert(conn != NULL);
+
 	/*
 	 * If it's an error, report it.  Errors about a missing table are harmless
 	 * so we continue processing; but die for other errors.
diff --git a/src/include/fe_utils/parallel_slot.h b/src/include/fe_utils/parallel_slot.h
index 8902f8d4f4..b7e2b0a29b 100644
--- a/src/include/fe_utils/parallel_slot.h
+++ b/src/include/fe_utils/parallel_slot.h
@@ -21,7 +21,7 @@ typedef bool (*ParallelSlotResultHandler) (PGresult *res, PGconn *conn,
 typedef struct ParallelSlot
 {
 	PGconn	   *connection;		/* One connection */
-	bool		isFree;			/* Is it known to be idle? */
+	bool		inUse;			/* Is the slot being used? */
 
 	/*
 	 * Prior to issuing a command or query on 'connection', a handler callback
@@ -33,6 +33,16 @@ typedef struct ParallelSlot
 	void	   *handler_context;
 } ParallelSlot;
 
+typedef struct ParallelSlotArray
+{
+	int			numslots;
+	ConnParams *cparams;
+	const char *progname;
+	bool		echo;
+	const char *initcmd;
+	ParallelSlot slots[FLEXIBLE_ARRAY_MEMBER];
+} ParallelSlotArray;
+
 static inline void
 ParallelSlotSetHandler(ParallelSlot *slot, ParallelSlotResultHandler handler,
 					   void *context)
@@ -48,15 +58,18 @@ ParallelSlotClearHandler(ParallelSlot *slot)
 	slot->handler_context = NULL;
 }
 
-extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlot *slots, int numslots);
+extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlotArray *slots,
+										  const char *dbname);
+
+extern ParallelSlotArray *ParallelSlotsSetup(int numslots, ConnParams *cparams,
+											 const char *progname, bool echo,
+											 const char *initcmd);
 
-extern ParallelSlot *ParallelSlotsSetup(const ConnParams *cparams,
-										const char *progname, bool echo,
-										PGconn *conn, int numslots);
+extern void ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn);
 
-extern void ParallelSlotsTerminate(ParallelSlot *slots, int numslots);
+extern void ParallelSlotsTerminate(ParallelSlotArray *sa);
 
-extern bool ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots);
+extern bool ParallelSlotsWaitCompletion(ParallelSlotArray *sa);
 
 extern bool TableCommandResultHandler(PGresult *res, PGconn *conn,
 									  void *context);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95aefa1..b1dec43f9d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -403,6 +403,7 @@ ConfigData
 ConfigVariable
 ConnCacheEntry
 ConnCacheKey
+ConnParams
 ConnStatusType
 ConnType
 ConnectionStateEnum
@@ -1729,6 +1730,7 @@ ParallelHashJoinState
 ParallelIndexScanDesc
 ParallelReadyList
 ParallelSlot
+ParallelSlotArray
 ParallelState
 ParallelTableScanDesc
 ParallelTableScanDescData
-- 
2.21.1 (Apple Git-122.3)

v42-0002-Adding-contrib-module-pg_amcheck.patchapplication/octet-stream; name=v42-0002-Adding-contrib-module-pg_amcheck.patch; x-unix-mode=0644Download
From b59f43cc27ae69c73d8c1209346166b9c13ef68e Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Mar 2021 08:34:40 -0800
Subject: [PATCH v42 2/3] Adding contrib module pg_amcheck

Adding new contrib module pg_amcheck, which is a command line
interface for running amcheck's verifications against tables and
indexes.
---
 contrib/Makefile                           |    1 +
 contrib/pg_amcheck/.gitignore              |    3 +
 contrib/pg_amcheck/Makefile                |   29 +
 contrib/pg_amcheck/pg_amcheck.c            | 1939 ++++++++++++++++++++
 contrib/pg_amcheck/t/001_basic.pl          |    9 +
 contrib/pg_amcheck/t/002_nonesuch.pl       |  213 +++
 contrib/pg_amcheck/t/003_check.pl          |  497 +++++
 contrib/pg_amcheck/t/004_verify_heapam.pl  |  487 +++++
 contrib/pg_amcheck/t/005_opclass_damage.pl |   54 +
 doc/src/sgml/contrib.sgml                  |    1 +
 doc/src/sgml/filelist.sgml                 |    1 +
 doc/src/sgml/pgamcheck.sgml                |  668 +++++++
 src/tools/msvc/Install.pm                  |    2 +-
 src/tools/msvc/Mkvcbuild.pm                |    6 +-
 src/tools/pgindent/typedefs.list           |    3 +
 15 files changed, 3909 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pg_amcheck/.gitignore
 create mode 100644 contrib/pg_amcheck/Makefile
 create mode 100644 contrib/pg_amcheck/pg_amcheck.c
 create mode 100644 contrib/pg_amcheck/t/001_basic.pl
 create mode 100644 contrib/pg_amcheck/t/002_nonesuch.pl
 create mode 100644 contrib/pg_amcheck/t/003_check.pl
 create mode 100644 contrib/pg_amcheck/t/004_verify_heapam.pl
 create mode 100644 contrib/pg_amcheck/t/005_opclass_damage.pl
 create mode 100644 doc/src/sgml/pgamcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index f27e458482..a72dcf7304 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -30,6 +30,7 @@ SUBDIRS = \
 		old_snapshot	\
 		pageinspect	\
 		passwordcheck	\
+		pg_amcheck	\
 		pg_buffercache	\
 		pg_freespacemap \
 		pg_prewarm	\
diff --git a/contrib/pg_amcheck/.gitignore b/contrib/pg_amcheck/.gitignore
new file mode 100644
index 0000000000..c21a14de31
--- /dev/null
+++ b/contrib/pg_amcheck/.gitignore
@@ -0,0 +1,3 @@
+pg_amcheck
+
+/tmp_check/
diff --git a/contrib/pg_amcheck/Makefile b/contrib/pg_amcheck/Makefile
new file mode 100644
index 0000000000..bc61ee7970
--- /dev/null
+++ b/contrib/pg_amcheck/Makefile
@@ -0,0 +1,29 @@
+# contrib/pg_amcheck/Makefile
+
+PGFILEDESC = "pg_amcheck - detects corruption within database relations"
+PGAPPICON = win32
+
+PROGRAM = pg_amcheck
+OBJS = \
+	$(WIN32RES) \
+	pg_amcheck.o
+
+REGRESS_OPTS += --load-extension=amcheck --load-extension=pageinspect
+EXTRA_INSTALL += contrib/amcheck contrib/pageinspect
+
+TAP_TESTS = 1
+
+PG_CPPFLAGS = -I$(libpq_srcdir)
+PG_LIBS_INTERNAL = -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+SHLIB_PREREQS = submake-libpq
+subdir = contrib/pg_amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_amcheck/pg_amcheck.c b/contrib/pg_amcheck/pg_amcheck.c
new file mode 100644
index 0000000000..63982fd66b
--- /dev/null
+++ b/contrib/pg_amcheck/pg_amcheck.c
@@ -0,0 +1,1939 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_amcheck.c
+ *		Detects corruption within database relations.
+ *
+ * Copyright (c) 2017-2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_amcheck/pg_amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <time.h>
+
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_namespace_d.h"
+#include "common/logging.h"
+#include "common/username.h"
+#include "fe_utils/cancel.h"
+#include "fe_utils/option_utils.h"
+#include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
+#include "fe_utils/simple_list.h"
+#include "fe_utils/string_utils.h"
+#include "getopt_long.h"		/* pgrminclude ignore */
+#include "pgtime.h"
+#include "storage/block.h"
+
+/* pg_amcheck command line options controlled by user flags */
+typedef struct AmcheckOptions
+{
+	bool		alldb;
+	bool		echo;
+	bool		quiet;
+	bool		verbose;
+	bool		strict_names;
+	bool		show_progress;
+	int			jobs;
+
+	/*
+	 * Relations to check or not to check (both heap and btree together), as
+	 * lists of PatternInfo structs.
+	 */
+	SimplePtrList include;
+	SimplePtrList exclude;
+
+	/*
+	 * As an optimization, if any pattern in the exclude list applies to heap
+	 * tables, or similarly if any such pattern applies to btree indexes, then
+	 * these will be true, otherwise false.  These should always agree with
+	 * what you'd conclude by grep'ing through the exclude list.
+	 */
+	bool		excludetbl;
+	bool		excludeidx;
+
+	/*
+	 * If any inclusion pattern exists, then we should only be checking
+	 * matching relations rather than all relations, so this is true iff
+	 * include is empty.
+	 */
+	bool		allrel;
+
+	/* heap table checking options */
+	bool		no_toast_expansion;
+	bool		reconcile_toast;
+	bool		on_error_stop;
+	long		startblock;
+	long		endblock;
+	const char *skip;
+
+	/* btree index checking options */
+	bool		parent_check;
+	bool		rootdescend;
+	bool		heapallindexed;
+
+	/* heap and btree hybrid option */
+	bool		no_index_expansion;
+} AmcheckOptions;
+
+static AmcheckOptions opts = {
+	.alldb = false,
+	.echo = false,
+	.quiet = false,
+	.verbose = false,
+	.strict_names = true,
+	.show_progress = false,
+	.jobs = 1,
+	.include = {NULL, NULL},
+	.exclude = {NULL, NULL},
+	.excludetbl = false,
+	.excludeidx = false,
+	.allrel = true,
+	.no_toast_expansion = false,
+	.reconcile_toast = true,
+	.on_error_stop = false,
+	.startblock = -1,
+	.endblock = -1,
+	.skip = "none",
+	.parent_check = false,
+	.rootdescend = false,
+	.heapallindexed = false,
+	.no_index_expansion = false
+};
+
+static const char *progname = NULL;
+
+typedef struct PatternInfo
+{
+	int			pattern_id;		/* Unique ID of this pattern */
+	const char *pattern;		/* Unaltered pattern from the command line */
+	char	   *db_regex;		/* Database regexp parsed from pattern, or
+								 * NULL */
+	char	   *nsp_regex;		/* Schema regexp parsed from pattern, or NULL */
+	char	   *rel_regex;		/* Relation regexp parsed from pattern, or
+								 * NULL */
+	bool		table_only;		/* true if rel_regex should only match tables */
+	bool		index_only;		/* true if rel_regex should only match indexes */
+	bool		matched;		/* true if the pattern matched in any database */
+}			PatternInfo;
+
+/* Unique pattern id counter */
+static int	next_id = 1;
+
+/* Whether all relations have so far passed their corruption checks */
+static bool all_checks_pass = true;
+
+/* Time last progress report was displayed */
+static pg_time_t last_progress_report = 0;
+
+typedef struct DatabaseInfo
+{
+	char	   *datname;
+	char	   *amcheck_schema; /* escaped, quoted literal */
+} DatabaseInfo;
+
+typedef struct RelationInfo
+{
+	const DatabaseInfo *datinfo;	/* shared by other relinfos */
+	Oid			reloid;
+	bool		is_table;		/* true if heap, false if btree */
+} RelationInfo;
+
+/*
+ * Query for determining if contrib's amcheck is installed.  If so, selects the
+ * namespace name where amcheck's functions can be found.
+ */
+static const char *amcheck_sql =
+"SELECT n.nspname, x.extversion"
+"\nFROM pg_catalog.pg_extension x"
+"\nJOIN pg_catalog.pg_namespace n"
+"\nON x.extnamespace = n.oid"
+"\nWHERE x.extname = 'amcheck'";
+
+static void prepare_table_command(PQExpBuffer sql, Oid reloid,
+								  const char *nspname);
+static void prepare_btree_command(PQExpBuffer sql, Oid reloid,
+								  const char *nspname);
+static void run_command(ParallelSlot *slot, const char *sql,
+						ConnParams *cparams);
+static bool verify_heapam_slot_handler(PGresult *res, PGconn *conn,
+									   void *context);
+static bool verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context);
+static void help(const char *progname);
+static void progress_report(uint64 relations_total, uint64 relations_checked,
+							const char *datname, bool force, bool finished);
+
+static void append_database_pattern(SimplePtrList *list, const char *pattern,
+									int encoding);
+static void append_schema_pattern(SimplePtrList *list, const char *pattern,
+								  int encoding);
+static void append_relation_pattern(SimplePtrList *list, const char *pattern,
+									int encoding);
+static void append_table_pattern(SimplePtrList *list, const char *pattern,
+								 int encoding);
+static void append_index_pattern(SimplePtrList *list, const char *pattern,
+								 int encoding);
+static void compile_database_list(PGconn *conn, SimplePtrList *databases);
+static void compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+										 const DatabaseInfo *datinfo);
+
+int
+main(int argc, char *argv[])
+{
+	PGconn	   *conn;
+	SimplePtrListCell *cell;
+	SimplePtrList databases = {NULL, NULL};
+	SimplePtrList relations = {NULL, NULL};
+	bool		failed = false;
+	const char *latest_datname;
+	int			parallel_workers;
+	ParallelSlotArray *sa;
+	PQExpBufferData sql;
+	long long int reltotal;
+	long long int relprogress;
+
+	static struct option long_options[] = {
+		/* Connection options */
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"maintenance-db", required_argument, NULL, 1},
+
+		/* check options */
+		{"all", no_argument, NULL, 'a'},
+		{"dbname", required_argument, NULL, 'd'},
+		{"exclude-dbname", required_argument, NULL, 'D'},
+		{"echo", no_argument, NULL, 'e'},
+		{"index", required_argument, NULL, 'i'},
+		{"exclude-index", required_argument, NULL, 'I'},
+		{"jobs", required_argument, NULL, 'j'},
+		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'r'},
+		{"exclude-relation", required_argument, NULL, 'R'},
+		{"schema", required_argument, NULL, 's'},
+		{"exclude-schema", required_argument, NULL, 'S'},
+		{"table", required_argument, NULL, 't'},
+		{"exclude-table", required_argument, NULL, 'T'},
+		{"verbose", no_argument, NULL, 'v'},
+		{"no-index-expansion", no_argument, NULL, 2},
+		{"no-toast-expansion", no_argument, NULL, 3},
+		{"exclude-toast-pointers", no_argument, NULL, 4},
+		{"on-error-stop", no_argument, NULL, 5},
+		{"skip", required_argument, NULL, 6},
+		{"startblock", required_argument, NULL, 7},
+		{"endblock", required_argument, NULL, 8},
+		{"rootdescend", no_argument, NULL, 9},
+		{"no-strict-names", no_argument, NULL, 10},
+		{"progress", no_argument, NULL, 11},
+		{"heapallindexed", no_argument, NULL, 12},
+		{"parent-check", no_argument, NULL, 13},
+
+		{NULL, 0, NULL, 0}
+	};
+
+	int			optindex;
+	int			c;
+
+	/*
+	 * If a maintenance database is specified, that will be used for the
+	 * initial connection.  Failing that, the first plain argument (without a
+	 * flag) will be used.  If neither of those are given, the first database
+	 * specified with -d.
+	 */
+	const char *primary_db = NULL;
+	const char *secondary_db = NULL;
+	const char *tertiary_db = NULL;
+
+	const char *host = NULL;
+	const char *port = NULL;
+	const char *username = NULL;
+	enum trivalue prompt_password = TRI_DEFAULT;
+	int			encoding = pg_get_encoding_from_locale(NULL, false);
+	ConnParams	cparams;
+
+	pg_logging_init(argv[0]);
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("contrib"));
+
+	handle_help_version_opts(argc, argv, progname, help);
+
+	/* process command-line options */
+	while ((c = getopt_long(argc, argv, "ad:D:eh:Hi:I:j:p:Pqr:R:s:S:t:T:U:wWv",
+							long_options, &optindex)) != -1)
+	{
+		char	   *endptr;
+
+		switch (c)
+		{
+			case 'a':
+				opts.alldb = true;
+				break;
+			case 'd':
+				if (tertiary_db == NULL)
+					tertiary_db = optarg;
+				append_database_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'D':
+				append_database_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'e':
+				opts.echo = true;
+				break;
+			case 'h':
+				host = pg_strdup(optarg);
+				break;
+			case 'i':
+				opts.allrel = false;
+				append_index_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'I':
+				opts.excludeidx = true;
+				append_index_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'j':
+				opts.jobs = atoi(optarg);
+				if (opts.jobs < 1)
+				{
+					pg_log_error("number of parallel jobs must be at least 1");
+					exit(1);
+				}
+				break;
+			case 'p':
+				port = pg_strdup(optarg);
+				break;
+			case 'q':
+				opts.quiet = true;
+				break;
+			case 'r':
+				opts.allrel = false;
+				append_relation_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'R':
+				opts.excludeidx = true;
+				opts.excludetbl = true;
+				append_relation_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 's':
+				opts.allrel = false;
+				append_schema_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'S':
+				append_schema_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 't':
+				opts.allrel = false;
+				append_table_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'T':
+				opts.excludetbl = true;
+				append_table_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'U':
+				username = pg_strdup(optarg);
+				break;
+			case 'w':
+				prompt_password = TRI_NO;
+				break;
+			case 'W':
+				prompt_password = TRI_YES;
+				break;
+			case 'v':
+				opts.verbose = true;
+				pg_logging_increase_verbosity();
+				break;
+			case 1:
+				primary_db = pg_strdup(optarg);
+				break;
+			case 2:
+				opts.no_index_expansion = true;
+				break;
+			case 3:
+				opts.no_toast_expansion = true;
+				break;
+			case 4:
+				opts.reconcile_toast = false;
+				break;
+			case 5:
+				opts.on_error_stop = true;
+				break;
+			case 6:
+				if (pg_strcasecmp(optarg, "all-visible") == 0)
+					opts.skip = "all visible";
+				else if (pg_strcasecmp(optarg, "all-frozen") == 0)
+					opts.skip = "all frozen";
+				else
+				{
+					fprintf(stderr, "invalid skip options");
+					exit(1);
+				}
+				break;
+			case 7:
+				opts.startblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"relation starting block argument contains garbage characters");
+					exit(1);
+				}
+				if (opts.startblock > (long) MaxBlockNumber)
+				{
+					fprintf(stderr,
+							"relation starting block argument out of bounds");
+					exit(1);
+				}
+				break;
+			case 8:
+				opts.endblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"relation ending block argument contains garbage characters");
+					exit(1);
+				}
+				if (opts.startblock > (long) MaxBlockNumber)
+				{
+					fprintf(stderr,
+							"relation ending block argument out of bounds");
+					exit(1);
+				}
+				break;
+			case 9:
+				opts.rootdescend = true;
+				opts.parent_check = true;
+				break;
+			case 10:
+				opts.strict_names = false;
+				break;
+			case 11:
+				opts.show_progress = true;
+				break;
+			case 12:
+				opts.heapallindexed = true;
+				break;
+			case 13:
+				opts.parent_check = true;
+				break;
+			default:
+				fprintf(stderr,
+						"Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+		}
+	}
+
+	if (opts.endblock >= 0 && opts.endblock < opts.startblock)
+	{
+		pg_log_error("relation ending block argument precedes starting block argument");
+		exit(1);
+	}
+
+	/* non-option arguments specify database names */
+	while (optind < argc)
+	{
+		if (secondary_db == NULL)
+			secondary_db = argv[optind];
+		append_database_pattern(&opts.include, argv[optind], encoding);
+		optind++;
+	}
+
+	/* fill cparams except for dbname, which is set below */
+	cparams.pghost = host;
+	cparams.pgport = port;
+	cparams.pguser = username;
+	cparams.prompt_password = prompt_password;
+	cparams.override_dbname = NULL;
+
+	setup_cancel_handler(NULL);
+
+	/* choose the database for our initial connection */
+	if (primary_db)
+		cparams.dbname = primary_db;
+	else if (secondary_db != NULL)
+		cparams.dbname = secondary_db;
+	else if (tertiary_db != NULL)
+		cparams.dbname = tertiary_db;
+	else
+	{
+		const char *default_db;
+
+		if (getenv("PGDATABASE"))
+			default_db = getenv("PGDATABASE");
+		else if (getenv("PGUSER"))
+			default_db = getenv("PGUSER");
+		else
+			default_db = get_user_name_or_exit(progname);
+
+		/*
+		 * Users expect the database name inferred from the environment to get
+		 * checked, not just get used for the initial connection.
+		 */
+		append_database_pattern(&opts.include, default_db, encoding);
+
+		cparams.dbname = default_db;
+	}
+
+	conn = connectMaintenanceDatabase(&cparams, progname, opts.echo);
+	compile_database_list(conn, &databases);
+	disconnectDatabase(conn);
+
+	if (databases.head == NULL)
+	{
+		fprintf(stderr, "%s: no databases to check\n", progname);
+		exit(0);
+	}
+
+	/*
+	 * Compile a list of all relations spanning all databases to be checked.
+	 */
+	for (cell = databases.head; cell; cell = cell->next)
+	{
+		PGresult   *result;
+		int			ntups;
+		const char *amcheck_schema = NULL;
+		DatabaseInfo *dat = (DatabaseInfo *) cell->ptr;
+
+		cparams.override_dbname = dat->datname;
+		conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+
+		/*
+		 * Verify that amcheck is installed for this next database.  User
+		 * error could result in a database not having amcheck that should
+		 * have it, but we also could be iterating over multiple databases
+		 * where not all of them have amcheck installed (for example,
+		 * 'template1').
+		 */
+		result = executeQuery(conn, amcheck_sql, opts.echo);
+		if (PQresultStatus(result) != PGRES_TUPLES_OK)
+		{
+			/* Querying the catalog failed. */
+			pg_log_error("database \"%s\": %s\n",
+						 PQdb(conn), PQerrorMessage(conn));
+			pg_log_error("query was: %s", amcheck_sql);
+			PQclear(result);
+			disconnectDatabase(conn);
+			exit(1);
+		}
+		ntups = PQntuples(result);
+		if (ntups == 0)
+		{
+			/* Querying the catalog succeeded, but amcheck is missing. */
+			fprintf(stderr,
+					"%s: skipping database \"%s\": amcheck is not installed\n",
+					progname, PQdb(conn));
+			disconnectDatabase(conn);
+			continue;
+		}
+		amcheck_schema = PQgetvalue(result, 0, 0);
+		if (opts.verbose)
+			fprintf(stderr,
+					"%s: in database \"%s\": using amcheck version \"%s\" in schema \"%s\"\n",
+					progname, PQdb(conn), PQgetvalue(result, 0, 1),
+					amcheck_schema);
+		dat->amcheck_schema = PQescapeIdentifier(conn, amcheck_schema,
+												 strlen(amcheck_schema));
+		PQclear(result);
+
+		compile_relation_list_one_db(conn, &relations, dat);
+		disconnectDatabase(conn);
+	}
+
+	/*
+	 * Check that all inclusion patterns matched at least one schema or
+	 * relation that we can check.
+	 */
+	for (cell = opts.include.head; cell; cell = cell->next)
+	{
+		PatternInfo *pat = (PatternInfo *) cell->ptr;
+
+		if (!pat->matched && (pat->nsp_regex != NULL || pat->rel_regex != NULL))
+		{
+			failed = opts.strict_names;
+
+			if (!opts.quiet)
+			{
+				if (pat->table_only)
+					fprintf(stderr, "%s: no tables to check for \"%s\"\n",
+							progname, pat->pattern);
+				else if (pat->index_only)
+					fprintf(stderr, "%s: no btree indexes to check for \"%s\"\n",
+							progname, pat->pattern);
+				else if (pat->rel_regex == NULL)
+					fprintf(stderr, "%s: no relations to check in schemas for \"%s\"\n",
+							progname, pat->pattern);
+				else
+					fprintf(stderr, "%s: no relations to check for \"%s\"\n",
+							progname, pat->pattern);
+			}
+		}
+	}
+
+	if (failed)
+		exit(1);
+
+	/*
+	 * Set parallel_workers to the lesser of opts.jobs and the number of
+	 * relations.
+	 */
+	reltotal = 0;
+	parallel_workers = 0;
+	for (cell = relations.head; cell; cell = cell->next)
+	{
+		reltotal++;
+		if (parallel_workers < opts.jobs)
+			parallel_workers++;
+	}
+
+	if (reltotal == 0)
+	{
+		fprintf(stderr, "%s: no relations to check", progname);
+		exit(1);
+	}
+	progress_report(reltotal, 0, NULL, true, false);
+
+	/*
+	 * Main event loop.
+	 *
+	 * We use server-side parallelism to check up to parallel_workers
+	 * relations in parallel.  The list of relations was computed in database
+	 * order, which minimizes the number of connects and disconnects as we
+	 * process the list.
+	 */
+	latest_datname = NULL;
+	sa = ParallelSlotsSetup(parallel_workers, &cparams, progname, opts.echo,
+							NULL);
+
+	initPQExpBuffer(&sql);
+	for (relprogress = 0, cell = relations.head; cell; cell = cell->next)
+	{
+		ParallelSlot *free_slot;
+		RelationInfo *rel;
+
+		rel = (RelationInfo *) cell->ptr;
+
+		if (CancelRequested)
+		{
+			failed = true;
+			break;
+		}
+
+		/*
+		 * The list of relations is in database sorted order.  If this next
+		 * relation is in a different database than the last one seen, we are
+		 * about to start checking this database.  Note that other slots may
+		 * still be working on relations from prior databases.
+		 */
+		latest_datname = rel->datinfo->datname;
+
+		progress_report(reltotal, relprogress, latest_datname, false, false);
+
+		relprogress++;
+
+		/*
+		 * Get a parallel slot for the next amcheck command, blocking if
+		 * necessary until one is available, or until a previously issued slot
+		 * command fails, indicating that we should abort checking the
+		 * remaining objects.
+		 */
+		free_slot = ParallelSlotsGetIdle(sa, rel->datinfo->datname);
+		if (!free_slot)
+		{
+			/*
+			 * Something failed.  We don't need to know what it was, because
+			 * the handler should already have emitted the necessary error
+			 * messages.
+			 */
+			failed = true;
+			break;
+		}
+
+		/*
+		 * Execute the appropriate amcheck command for this relation using our
+		 * slot's database connection.  We do not wait for the command to
+		 * complete, nor do we perform any error checking, as that is done by
+		 * the parallel slots and our handler callback functions.
+		 */
+		if (rel->is_table)
+		{
+			prepare_table_command(&sql, rel->reloid,
+								  rel->datinfo->amcheck_schema);
+			ParallelSlotSetHandler(free_slot, verify_heapam_slot_handler,
+								   sql.data);
+			run_command(free_slot, sql.data, &cparams);
+		}
+		else
+		{
+			prepare_btree_command(&sql, rel->reloid,
+								  rel->datinfo->amcheck_schema);
+			ParallelSlotSetHandler(free_slot, verify_btree_slot_handler, NULL);
+			run_command(free_slot, sql.data, &cparams);
+		}
+	}
+	termPQExpBuffer(&sql);
+
+	if (!failed)
+	{
+
+		/*
+		 * Wait for all slots to complete, or for one to indicate that an error
+		 * occurred.  Like above, we rely on the handler emitting the necessary
+		 * error messages.
+		 */
+		if (sa && !ParallelSlotsWaitCompletion(sa))
+			failed = true;
+
+		progress_report(reltotal, relprogress, NULL, true, true);
+	}
+
+	if (sa)
+	{
+		ParallelSlotsTerminate(sa);
+		pg_free(sa);
+	}
+
+	if (failed)
+		exit(1);
+
+	if (!all_checks_pass)
+		exit(2);
+}
+
+/*
+ * prepare_table_command
+ *
+ * Creates a SQL command for running amcheck checking on the given heap
+ * relation.  The command is phrased as a SQL query, with column order and
+ * names matching the expectations of verify_heapam_slot_handler, which will
+ * receive and handle each row returned from the verify_heapam() function.
+ *
+ * sql: buffer into which the table checking command will be written
+ * reloid: relation of the table to be checked
+ * amcheck_schema: escaped and quoted name of schema in which amcheck contrib
+ * module is installed
+ */
+static void
+prepare_table_command(PQExpBuffer sql, Oid reloid, const char *amcheck_schema)
+{
+	resetPQExpBuffer(sql);
+	appendPQExpBuffer(sql,
+					  "SELECT n.nspname, c.relname, v.blkno, v.offnum, "
+					  "v.attnum, v.msg"
+					  "\nFROM %s.verify_heapam("
+					  "\nrelation := %u,"
+					  "\non_error_stop := %s,"
+					  "\ncheck_toast := %s,"
+					  "\nskip := '%s'",
+					  amcheck_schema,
+					  reloid,
+					  opts.on_error_stop ? "true" : "false",
+					  opts.reconcile_toast ? "true" : "false",
+					  opts.skip);
+	if (opts.startblock >= 0)
+		appendPQExpBuffer(sql, ",\nstartblock := %ld", opts.startblock);
+	if (opts.endblock >= 0)
+		appendPQExpBuffer(sql, ",\nendblock := %ld", opts.endblock);
+	appendPQExpBuffer(sql, "\n) v,"
+					  "\npg_catalog.pg_class c"
+					  "\nJOIN pg_catalog.pg_namespace n"
+					  "\nON c.relnamespace = n.oid"
+					  "\nWHERE c.oid = %u",
+					  reloid);
+}
+
+/*
+ * prepare_btree_command
+ *
+ * Creates a SQL command for running amcheck checking on the given btree index
+ * relation.  The command does not select any columns, as btree checking
+ * functions do not return any, but rather return corruption information by
+ * raising errors, which verify_btree_slot_handler expects.
+ *
+ * sql: buffer into which the table checking command will be written
+ * reloid: relation of the table to be checked
+ * amcheck_schema: escaped and quoted name of schema in which amcheck contrib
+ * module is installed
+ */
+static void
+prepare_btree_command(PQExpBuffer sql, Oid reloid, const char *amcheck_schema)
+{
+	resetPQExpBuffer(sql);
+	if (opts.parent_check)
+		appendPQExpBuffer(sql,
+						  "SELECT %s.bt_index_parent_check("
+						  "\nindex := '%u'::regclass,"
+						  "\nheapallindexed := %s,"
+						  "\nrootdescend := %s)",
+						  amcheck_schema,
+						  reloid,
+						  (opts.heapallindexed ? "true" : "false"),
+						  (opts.rootdescend ? "true" : "false"));
+	else
+		appendPQExpBuffer(sql,
+						  "SELECT %s.bt_index_check("
+						  "\nindex := '%u'::regclass,"
+						  "\nheapallindexed := %s)",
+						  amcheck_schema,
+						  reloid,
+						  (opts.heapallindexed ? "true" : "false"));
+}
+
+/*
+ * run_command
+ *
+ * Sends a command to the server without waiting for the command to complete.
+ * Logs an error if the command cannot be sent, but otherwise any errors are
+ * expected to be handled by a ParallelSlotHandler.
+ *
+ * If reconnecting to the database is necessary, the cparams argument may be
+ * modified.
+ *
+ * slot: slot with connection to the server we should use for the command
+ * sql: query to send
+ * cparams: connection parameters in case the slot needs to be reconnected
+ */
+static void
+run_command(ParallelSlot *slot, const char *sql, ConnParams *cparams)
+{
+	if (opts.echo)
+		printf("%s\n", sql);
+
+	if (PQsendQuery(slot->connection, sql) == 0)
+	{
+		pg_log_error("error sending command to database \"%s\": %s",
+					 PQdb(slot->connection),
+					 PQerrorMessage(slot->connection));
+		pg_log_error("command was: %s", sql);
+		exit(1);
+	}
+}
+
+/*
+ * should_processing_continue
+ *
+ * Checks a query result returned from a query (presumably issued on a slot's
+ * connection) to determine if parallel slots should continue issuing further
+ * commands.
+ *
+ * Note: Heap relation corruption is reported by verify_heapam() via the result
+ * set, rather than an ERROR, but running verify_heapam() on a corrupted table
+ * may still result in an error being returned from the server due to missing
+ * relation files, bad checksums, etc.  The btree corruption checking functions
+ * always use errors to communicate corruption messages.  We can't just abort
+ * processing because we got a mere ERROR.
+ *
+ * res: result from an executed sql query
+ */
+static bool
+should_processing_continue(PGresult *res)
+{
+	const char *severity;
+
+	switch (PQresultStatus(res))
+	{
+			/* These are expected and ok */
+		case PGRES_COMMAND_OK:
+		case PGRES_TUPLES_OK:
+		case PGRES_NONFATAL_ERROR:
+			break;
+
+			/* This is expected but requires closer scrutiny */
+		case PGRES_FATAL_ERROR:
+			severity = PQresultErrorField(res, PG_DIAG_SEVERITY_NONLOCALIZED);
+			if (strcmp(severity, "FATAL") == 0)
+				return false;
+			if (strcmp(severity, "PANIC") == 0)
+				return false;
+			break;
+
+			/* These are unexpected */
+		case PGRES_BAD_RESPONSE:
+		case PGRES_EMPTY_QUERY:
+		case PGRES_COPY_OUT:
+		case PGRES_COPY_IN:
+		case PGRES_COPY_BOTH:
+		case PGRES_SINGLE_TUPLE:
+			return false;
+	}
+	return true;
+}
+
+/*
+ * verify_heapam_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a table checking command
+ * created by prepare_table_command and outputs the results for the user.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: the sql query being handled, as a cstring
+ */
+static bool
+verify_heapam_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			i;
+		int			ntups = PQntuples(res);
+
+		if (ntups > 0)
+			all_checks_pass = false;
+
+		for (i = 0; i < ntups; i++)
+		{
+			if (!PQgetisnull(res, i, 4))
+				printf("relation %s.%s.%s, block %s, offset %s, attribute %s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+					   PQgetvalue(res, i, 2),	/* blkno */
+					   PQgetvalue(res, i, 3),	/* offnum */
+					   PQgetvalue(res, i, 4),	/* attnum */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else if (!PQgetisnull(res, i, 3))
+				printf("relation %s.%s.%s, block %s, offset %s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+					   PQgetvalue(res, i, 2),	/* blkno */
+					   PQgetvalue(res, i, 3),	/* offnum */
+				/* attnum is null: 4 */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else if (!PQgetisnull(res, i, 2))
+				printf("relation %s.%s.%s, block %s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+					   PQgetvalue(res, i, 2),	/* blkno */
+				/* offnum is null: 3 */
+				/* attnum is null: 4 */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else if (!PQgetisnull(res, i, 1))
+				printf("relation %s.%s.%s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+				/* blkno is null:  2 */
+				/* offnum is null: 3 */
+				/* attnum is null: 4 */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else
+				printf("%s.%s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 5));	/* msg */
+		}
+	}
+	else if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		all_checks_pass = false;
+		printf("%s: %s\n", PQdb(conn), PQerrorMessage(conn));
+		printf("%s: query was: %s\n", PQdb(conn), (const char *) context);
+	}
+
+	return should_processing_continue(res);
+}
+
+/*
+ * verify_btree_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a btree checking command
+ * created by prepare_btree_command and outputs them for the user.  The results
+ * from the btree checking command is assumed to be empty, but when the results
+ * are an error code, the useful information about the corruption is expected
+ * in the connection's error message.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: unused
+ */
+static bool
+verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			ntups = PQntuples(res);
+
+		if (ntups != 1)
+		{
+			/*
+			 * We expect the btree checking functions to return one void row
+			 * each, so we should output some sort of warning if we get
+			 * anything else, not because it indicates corruption, but because
+			 * it suggests a mismatch between amcheck and pg_amcheck versions.
+			 *
+			 * In conjunction with --progress, anything written to stderr at
+			 * this time would present strangely to the user without an extra
+			 * newline, so we print one.  If we were multithreaded, we'd have
+			 * to avoid splitting this across multiple calls, but we're in an
+			 * event loop, so it doesn't matter.
+			 */
+			if (opts.show_progress)
+				fprintf(stderr, "\n");
+			fprintf(stderr, "%s: btree checking function returned unexpected number of rows: %d\n",
+					progname, ntups);
+			fprintf(stderr, "%s: are %s's and amcheck's versions compatible?\n",
+					progname, progname);
+		}
+	}
+	else
+	{
+		all_checks_pass = false;
+		printf("%s: %s\n", PQdb(conn), PQerrorMessage(conn));
+	}
+
+	return should_processing_continue(res);
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_amcheck"
+ */
+static void
+help(const char *progname)
+{
+	printf("%s checks objects in a PostgreSQL database for corruption.\n\n", progname);
+	printf("Usage:\n");
+	printf("  %s [OPTION]... [DBNAME]\n", progname);
+	printf("\nTarget Options:\n");
+	printf("  -a, --all                 check all databases\n");
+	printf("  -d, --dbname=DBNAME       check specific database(s)\n");
+	printf("  -D, --exclude-dbname=DBNAME do NOT check specific database(s)\n");
+	printf("  -i, --index=INDEX         check specific index(es)\n");
+	printf("  -I, --exclude-index=INDEX do NOT check specific index(es)\n");
+	printf("  -r, --relation=RELNAME    check specific relation(s)\n");
+	printf("  -R, --exclude-relation=RELNAME do NOT check specific relation(s)\n");
+	printf("  -s, --schema=SCHEMA       check specific schema(s)\n");
+	printf("  -S, --exclude-schema=SCHEMA do NOT check specific schema(s)\n");
+	printf("  -t, --table=TABLE         check specific table(s)\n");
+	printf("  -T, --exclude-table=TABLE do NOT check specific table(s)\n");
+	printf("      --no-index-expansion  do NOT expand list of relations to include indexes\n");
+	printf("      --no-toast-expansion  do NOT expand list of relations to include toast\n");
+	printf("      --no-strict-names     do NOT require patterns to match objects\n");
+	printf("\nTable Checking Options:\n");
+	printf("      --exclude-toast-pointers do NOT follow relation toast pointers\n");
+	printf("      --on-error-stop       stop checking at end of first corrupt page\n");
+	printf("      --skip=OPTION         do NOT check \"all-frozen\" or \"all-visible\" blocks\n");
+	printf("      --startblock=BLOCK    begin checking table(s) at the given block number\n");
+	printf("      --endblock=BLOCK      check table(s) only up to the given block number\n");
+	printf("\nBtree Index Checking Options:\n");
+	printf("      --heapallindexed      check all heap tuples are found within indexes\n");
+	printf("      --parent-check        check index parent/child relationships\n");
+	printf("      --rootdescend         search from root page to refind tuples\n");
+	printf("\nConnection options:\n");
+	printf("  -h, --host=HOSTNAME       database server host or socket directory\n");
+	printf("  -p, --port=PORT           database server port\n");
+	printf("  -U, --username=USERNAME   user name to connect as\n");
+	printf("  -w, --no-password         never prompt for password\n");
+	printf("  -W, --password            force password prompt\n");
+	printf("  --maintenance-db=DBNAME   alternate maintenance database\n");
+	printf("\nOther Options:\n");
+	printf("  -e, --echo                show the commands being sent to the server\n");
+	printf("  -j, --jobs=NUM            use this many concurrent connections to the server\n");
+	printf("  -q, --quiet               don't write any messages\n");
+	printf("  -v, --verbose             write a lot of output\n");
+	printf("  -V, --version             output version information, then exit\n");
+	printf("      --progress            show progress information\n");
+	printf("  -?, --help                show this help, then exit\n");
+
+	printf("\nRead the description of the amcheck contrib module for details.\n");
+	printf("\nReport bugs to <%s>.\n", PACKAGE_BUGREPORT);
+	printf("%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Print a progress report based on the global variables. If verbose output
+ * is enabled, also print the current file name.
+ *
+ * Progress report is written at maximum once per second, unless the force
+ * parameter is set to true.
+ *
+ * If finished is set to true, this is the last progress report. The cursor
+ * is moved to the next line.
+ */
+static void
+progress_report(uint64 relations_total, uint64 relations_checked,
+				const char *datname, bool force, bool finished)
+{
+	int			percent = 0;
+	char		checked_str[32];
+	char		total_str[32];
+	pg_time_t	now;
+
+	if (!opts.show_progress)
+		return;
+
+	now = time(NULL);
+	if (now == last_progress_report && !force && !finished)
+		return;					/* Max once per second */
+
+	last_progress_report = now;
+	if (relations_total)
+		percent = (int) (relations_checked * 100 / relations_total);
+
+	/*
+	 * Separate step to keep platform-dependent format code out of fprintf
+	 * calls.  We only test for INT64_FORMAT availability in snprintf, not
+	 * fprintf.
+	 */
+	snprintf(checked_str, sizeof(checked_str), INT64_FORMAT, relations_checked);
+	snprintf(total_str, sizeof(total_str), INT64_FORMAT, relations_total);
+
+#define VERBOSE_DATNAME_LENGTH 35
+	if (opts.verbose)
+	{
+		if (!datname)
+
+			/*
+			 * No datname given, so clear the status line (used for first and
+			 * last call)
+			 */
+			fprintf(stderr,
+					"%*s/%s (%d%%) %*s",
+					(int) strlen(total_str),
+					checked_str, total_str, percent,
+					VERBOSE_DATNAME_LENGTH + 2, "");
+		else
+		{
+			bool		truncate = (strlen(datname) > VERBOSE_DATNAME_LENGTH);
+
+			fprintf(stderr,
+					"%*s/%s (%d%%), (%s%-*.*s)",
+					(int) strlen(total_str),
+					checked_str, total_str, percent,
+			/* Prefix with "..." if we do leading truncation */
+					truncate ? "..." : "",
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+			/* Truncate datname at beginning if it's too long */
+					truncate ? datname + strlen(datname) - VERBOSE_DATNAME_LENGTH + 3 : datname);
+		}
+	}
+	else
+		fprintf(stderr,
+				"%*s/%s (%d%%)",
+				(int) strlen(total_str),
+				checked_str, total_str, percent);
+
+	/*
+	 * Stay on the same line if reporting to a terminal and we're not done
+	 * yet.
+	 */
+	fputc((!finished && isatty(fileno(stderr))) ? '\r' : '\n', stderr);
+}
+
+/*
+ * append_database_pattern
+ *
+ * Adds to a list the given pattern interpreted as a database name pattern.
+ *
+ * list: the list to be appended
+ * pattern: the database name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_database_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = (PatternInfo *) palloc0(sizeof(PatternInfo));
+
+	info->pattern_id = next_id++;
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->db_regex = pstrdup(buf.data);
+
+	termPQExpBuffer(&buf);
+
+	simple_ptr_list_append(list, info);
+}
+
+/*
+ * append_schema_pattern
+ *
+ * Adds to a list the given pattern interpreted as a schema name pattern.
+ *
+ * list: the list to be appended
+ * pattern: the schema name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_schema_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = (PatternInfo *) palloc0(sizeof(PatternInfo));
+
+	info->pattern_id = next_id++;
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->nsp_regex = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+
+	simple_ptr_list_append(list, info);
+}
+
+/*
+ * append_relation_pattern_helper
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ * table_only: whether the pattern should only be matched against heap tables
+ * index_only: whether the pattern should only be matched against btree indexes
+ */
+static void
+append_relation_pattern_helper(SimplePtrList *list, const char *pattern,
+							   int encoding, bool table_only, bool index_only)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PQExpBufferData relbuf;
+	PatternInfo *info = (PatternInfo *) palloc0(sizeof(PatternInfo));
+
+	info->pattern_id = next_id++;
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+	initPQExpBuffer(&relbuf);
+
+	patternToSQLRegex(encoding, &dbbuf, &nspbuf, &relbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+		info->db_regex = pstrdup(dbbuf.data);
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+	if (relbuf.data[0])
+		info->rel_regex = pstrdup(relbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+	termPQExpBuffer(&relbuf);
+
+	info->table_only = table_only;
+	info->index_only = index_only;
+
+	simple_ptr_list_append(list, info);
+}
+
+/*
+ * append_relation_pattern
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern, to be
+ * matched against both tables and indexes.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_relation_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(list, pattern, encoding, false, false);
+}
+
+/*
+ * append_table_pattern
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern, to be
+ * matched only against tables.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_table_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(list, pattern, encoding, true, false);
+}
+
+/*
+ * append_index_pattern
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern, to be
+ * matched only against indexes.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_index_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(list, pattern, encoding, false, true);
+}
+
+/*
+ * append_db_pattern_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the database portions filtered from the list of patterns expressed as three
+ * columns:
+ *
+ *     id: the unique pattern ID
+ *     pat: the full user specified pattern from the command line
+ *     rgx: the database regular expression parsed from the pattern
+ *
+ * Patterns without a database portion are skipped.  Patterns with more than
+ * just a database portion are optionally skipped, depending on argument
+ * 'inclusive'.
+ *
+ * buf: the buffer to be appended
+ * patterns: the list of patterns to be inserted into the CTE
+ * conn: the database connection
+ * inclusive: whether to include patterns with schema and/or relation parts
+ */
+static void
+append_db_pattern_cte(PQExpBuffer buf, const SimplePtrList *patterns,
+					  PGconn *conn, bool inclusive)
+{
+	SimplePtrListCell *cell;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (cell = patterns->head; cell; cell = cell->next)
+	{
+		PatternInfo *info = (PatternInfo *) cell->ptr;
+
+		if (info->db_regex != NULL &&
+			(inclusive || (info->nsp_regex == NULL && info->rel_regex == NULL)))
+		{
+			if (!have_values)
+				appendPQExpBufferStr(buf, "\nVALUES");
+			have_values = true;
+			appendPQExpBuffer(buf, "%s\n(%d, ", comma, info->pattern_id);
+			appendStringLiteralConn(buf, info->pattern, conn);
+			appendPQExpBufferStr(buf, ", ");
+			appendStringLiteralConn(buf, info->db_regex, conn);
+			appendPQExpBufferStr(buf, ")");
+			comma = ",";
+		}
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf, "\nSELECT NULL, NULL, NULL WHERE false");
+}
+
+/*
+ * compile_database_list
+ *
+ * Compiles a list of databases to check based on the user supplied options,
+ * sorted to preserve the order they were specified on the command line.  In
+ * the event that multiple databases match a single command line pattern, they
+ * are secondarily sorted by name.
+ *
+ * conn: connection to the initial database
+ * databases: the list onto which databases should be appended
+ */
+static void
+compile_database_list(PGconn *conn, SimplePtrList *databases)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	bool		fatal;
+
+	initPQExpBuffer(&sql);
+
+	/* Append the include patterns CTE. */
+	appendPQExpBufferStr(&sql, "WITH include_raw (id, pat, rgx) AS (");
+	append_db_pattern_cte(&sql, &opts.include, conn, true);
+
+	/* Append the exclude patterns CTE. */
+	appendPQExpBufferStr(&sql, "\n),\nexclude_raw (id, pat, rgx) AS (");
+	append_db_pattern_cte(&sql, &opts.exclude, conn, false);
+	appendPQExpBufferStr(&sql, "\n),");
+
+	/*
+	 * Append the database CTE, which includes whether each database is
+	 * connectable and also joins against exclude_raw to determine whether
+	 * each database is excluded.
+	 */
+	appendPQExpBufferStr(&sql,
+						 "\ndatabase (datname) AS ("
+						 "\nSELECT d.datname"
+						 "\nFROM pg_catalog.pg_database d"
+						 "\nLEFT OUTER JOIN exclude_raw e"
+						 "\nON d.datname ~ e.rgx"
+						 "\nWHERE d.datallowconn"
+						 "\nAND e.id IS NULL"
+						 "\n),"
+
+	/*
+	 * Append the include_pat CTE, which joins the include_raw CTE against the
+	 * databases CTE to determine if all the inclusion patterns had matches,
+	 * and whether each matched pattern had the misfortune of only matching
+	 * excluded or unconnectable databases.
+	 */
+						 "\ninclude_pat (id, pat, checkable) AS ("
+						 "\nSELECT i.id, i.pat,"
+						 "\nCOUNT(*) FILTER ("
+						 "\nWHERE d IS NOT NULL"
+						 "\n) AS checkable"
+						 "\nFROM include_raw i"
+						 "\nLEFT OUTER JOIN database d"
+						 "\nON d.datname ~ i.rgx"
+						 "\nGROUP BY i.id, i.pat"
+						 "\n),"
+
+	/*
+	 * Append the filtered_databases CTE, which selects from the database CTE
+	 * optionally joined against the include_raw CTE to only select databases
+	 * that match an inclusion pattern.  This appears to duplicate what the
+	 * include_pat CTE already did above, but here we want only databses, and
+	 * there we wanted patterns.
+	 */
+						 "\nfiltered_databases (datname) AS ("
+						 "\nSELECT DISTINCT d.datname"
+						 "\nFROM database d");
+	if (!opts.alldb)
+		appendPQExpBufferStr(&sql,
+							 "\nINNER JOIN include_raw i"
+							 "\nON d.datname ~ i.rgx");
+	appendPQExpBufferStr(&sql,
+						 "\n)"
+
+	/*
+	 * Select the checkable databases and the unmatched inclusion patterns.
+	 */
+						 "\nSELECT pat, datname"
+						 "\nFROM ("
+						 "\nSELECT id, pat, NULL::TEXT AS datname"
+						 "\nFROM include_pat"
+						 "\nWHERE checkable = 0"
+						 "\nUNION ALL"
+						 "\nSELECT NULL, NULL, datname"
+						 "\nFROM filtered_databases"
+						 "\n) AS combined_records"
+						 "\nORDER BY id NULLS LAST, datname");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_error("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (fatal = false, i = 0; i < ntups; i++)
+	{
+		const char *pat = NULL;
+		const char *datname = NULL;
+
+		if (!PQgetisnull(res, i, 0))
+			pat = PQgetvalue(res, i, 0);
+		if (!PQgetisnull(res, i, 1))
+			datname = PQgetvalue(res, i, 1);
+
+		if (pat != NULL)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern that matched no
+			 * checkable databases.
+			 */
+			fatal = opts.strict_names;
+			fprintf(stderr, "%s: no checkable database: \"%s\"\n",
+					progname, pat);
+		}
+		else
+		{
+			/* Current record pertains to a database */
+			Assert(datname != NULL);
+
+			DatabaseInfo *dat = (DatabaseInfo *) palloc0(sizeof(DatabaseInfo));
+
+			/* This database is included.  Add to list */
+			if (opts.verbose)
+				fprintf(stderr, "%s: including database: \"%s\"\n", progname,
+						datname);
+
+			dat->datname = pstrdup(datname);
+			simple_ptr_list_append(databases, dat);
+		}
+	}
+	PQclear(res);
+
+	if (fatal)
+	{
+		disconnectDatabase(conn);
+		exit(1);
+	}
+}
+
+/*
+ * append_rel_pattern_raw_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the patterns from the given list as seven columns:
+ *
+ *     id: the unique pattern ID
+ *     pat: the full user specified pattern from the command line
+ *     db_regex: the database regexp parsed from the pattern, or NULL if the
+ *               pattern had no database part
+ *     nsp_regex: the namespace regexp parsed from the pattern, or NULL if the
+ *                pattern had no namespace part
+ *     rel_regex: the relname regexp parsed from the pattern, or NULL if the
+ *                pattern had no relname part
+ *     table_only: true if the pattern applies only to tables (not indexes)
+ *     index_only: true if the pattern applies only to indexes (not tables)
+ *
+ * buf: the buffer to be appended
+ * patterns: the list of patterns to be inserted into the CTE
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_raw_cte(PQExpBuffer buf, const SimplePtrList *patterns,
+						   PGconn *conn)
+{
+	SimplePtrListCell *cell;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (cell = patterns->head; cell; cell = cell->next)
+	{
+		PatternInfo *info = (PatternInfo *) cell->ptr;
+
+		if (!have_values)
+			appendPQExpBufferStr(buf, "\nVALUES");
+		have_values = true;
+		appendPQExpBuffer(buf, "%s\n(%d::INTEGER, ", comma, info->pattern_id);
+		appendStringLiteralConn(buf, info->pattern, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->db_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->db_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->nsp_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->nsp_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->rel_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->rel_regex, conn);
+		if (info->table_only)
+			appendPQExpBufferStr(buf, "::TEXT, true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, "::TEXT, false::BOOLEAN");
+		if (info->index_only)
+			appendPQExpBufferStr(buf, ", true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, ", false::BOOLEAN");
+		appendPQExpBufferStr(buf, ")");
+		comma = ",";
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf,
+							 "\nSELECT NULL::INTEGER, NULL::TEXT, NULL::TEXT,"
+							 "\nNULL::TEXT, NULL::TEXT, NULL::BOOLEAN,"
+							 "\nNULL::BOOLEAN"
+							 "\nWHERE false");
+}
+
+/*
+ * append_rel_pattern_filtered_cte
+ *
+ * Appends to the buffer a Common Table Expression (CTE) which selects
+ * all patterns from the named raw CTE, filtered by database.  All patterns
+ * which have no database portion or whose database portion matches our
+ * connection's database name are selected, with other patterns excluded.
+ *
+ * The basic idea here is that if we're connected to database "foo" and we have
+ * patterns "foo.bar.baz", "alpha.beta" and "one.two.three", we only want to
+ * use the first two while processing relations in this database, as the third
+ * one is not relevant.
+ *
+ * buf: the buffer to be appended
+ * raw: the name of the CTE to select from
+ * filtered: the name of the CTE to create
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_filtered_cte(PQExpBuffer buf, const char *raw,
+								const char *filtered, PGconn *conn)
+{
+	appendPQExpBuffer(buf,
+					  "\n%s (id, pat, nsp_regex, rel_regex, table_only, index_only) AS ("
+					  "\nSELECT id, pat, nsp_regex, rel_regex, table_only, index_only"
+					  "\nFROM %s r"
+					  "\nWHERE (r.db_regex IS NULL"
+					  "\nOR ",
+					  filtered, raw);
+	appendStringLiteralConn(buf, PQdb(conn), conn);
+	appendPQExpBufferStr(buf, " ~ r.db_regex)");
+	appendPQExpBufferStr(buf,
+						 "\nAND (r.nsp_regex IS NOT NULL"
+						 "\nOR r.rel_regex IS NOT NULL)"
+						 "\n),");
+}
+
+/*
+ * compile_relation_list_one_db
+ *
+ * Compiles a list of relations to check within the currently connected
+ * database based on the user supplied options, sorted by descending size,
+ * and appends them to the given list of relations.
+ *
+ * The cells of the constructed list contain all information about the relation
+ * necessary to connect to the database and check the object, including which
+ * database to connect to, where contrib/amcheck is installed, and the Oid and
+ * type of object (table vs. index).  Rather than duplicating the database
+ * details per relation, the relation structs use references to the same
+ * database object, provided by the caller.
+ *
+ * conn: connection to this next database, which should be the same as in 'dat'
+ * relations: list onto which the relations information should be appended
+ * dat: the database info struct for use by each relation
+ */
+static void
+compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+							 const DatabaseInfo *dat)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	const char *datname;
+
+	initPQExpBuffer(&sql);
+	appendPQExpBufferStr(&sql, "WITH");
+
+	/* Append CTEs for the relation inclusion patterns, if any */
+	if (!opts.allrel)
+	{
+		appendPQExpBufferStr(&sql,
+							 "\ninclude_raw (id, pat, db_regex, nsp_regex, rel_regex, table_only, index_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.include, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "include_raw", "include_pat", conn);
+	}
+
+	/* Append CTEs for the relation exclusion patterns, if any */
+	if (opts.excludetbl || opts.excludeidx)
+	{
+		appendPQExpBufferStr(&sql,
+							 "\nexclude_raw (id, pat, db_regex, nsp_regex, rel_regex, table_only, index_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.exclude, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "exclude_raw", "exclude_pat", conn);
+	}
+
+	/* Append the relation CTE. */
+	appendPQExpBufferStr(&sql,
+						 "\nrelation (id, pat, oid, reltoastrelid, relpages, is_table, is_index) AS ("
+						 "\nSELECT DISTINCT ON (c.oid");
+	if (!opts.allrel)
+		appendPQExpBufferStr(&sql, ", ip.id) ip.id, ip.pat,");
+	else
+		appendPQExpBufferStr(&sql, ") NULL::INTEGER AS id, NULL::TEXT AS pat,");
+	appendPQExpBuffer(&sql,
+					  "\nc.oid, c.reltoastrelid, c.relpages,"
+					  "\nc.relam = %u AS is_table,"
+					  "\nc.relam = %u AS is_index"
+					  "\nFROM pg_catalog.pg_class c"
+					  "\nINNER JOIN pg_catalog.pg_namespace n"
+					  "\nON c.relnamespace = n.oid",
+					  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (!opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nINNER JOIN include_pat ip"
+						  "\nON (n.nspname ~ ip.nsp_regex OR ip.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ip.rel_regex OR ip.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ip.table_only)"
+						  "\nAND (c.relam = %u OR NOT ip.index_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (opts.excludetbl || opts.excludeidx)
+		appendPQExpBuffer(&sql,
+						  "\nLEFT OUTER JOIN exclude_pat ep"
+						  "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.table_only)"
+						  "\nAND (c.relam = %u OR NOT ep.index_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	if (opts.excludetbl || opts.excludeidx)
+		appendPQExpBufferStr(&sql, "\nWHERE ep.pat IS NULL");
+	else
+		appendPQExpBufferStr(&sql, "\nWHERE true");
+
+	/*
+	 * We need to be careful not to break the --no-toast-expansion and
+	 * --no-index-expansion options.  By default, the indexes, toast tables,
+	 * and toast table indexes associated with primary tables are included,
+	 * using their own CTEs below.  We implement the --exclude-* options by
+	 * not creating those CTEs, but that's no use if we've already selected
+	 * the toast and indexes here.  On the other hand, we want inclusion
+	 * patterns that match indexes or toast tables to be honored.  So, if
+	 * inclusion patterns were given, we want to select all tables, toast
+	 * tables, or indexes that match the patterns.  But if no inclusion
+	 * patterns were given, and we're simply matching all relations, then we
+	 * only want to match the primary tables here.
+	 */
+	if (opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam = %u"
+						  "\nAND c.relkind IN ('r', 'm', 't')"
+						  "\nAND c.relnamespace != %u",
+						  HEAP_TABLE_AM_OID, PG_TOAST_NAMESPACE);
+	else
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam IN (%u, %u)"
+						  "\nAND c.relkind IN ('r', 'm', 't', 'i')"
+						  "\nAND ((c.relam = %u AND c.relkind IN ('r', 'm', 't')) OR"
+						  "\n(c.relam = %u AND c.relkind = 'i'))",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID,
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	appendPQExpBufferStr(&sql,
+						 "\nORDER BY c.oid"
+						 "\n)");
+
+	if (!opts.no_toast_expansion)
+	{
+		/*
+		 * Include a CTE for toast tables associated with primary tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * toast table names.
+		 */
+		appendPQExpBufferStr(&sql,
+							 ",\ntoast (oid, relpages) AS ("
+							 "\nSELECT t.oid, t.relpages"
+							 "\nFROM pg_catalog.pg_class t"
+							 "\nINNER JOIN relation r"
+							 "\nON r.reltoastrelid = t.oid");
+		if (opts.excludetbl)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (t.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.table_only"
+								 "\nWHERE ep.id IS NULL");
+		appendPQExpBufferStr(&sql,
+							 "\n)");
+	}
+	if (!opts.no_index_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with primary tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * btree index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ",\nindex (oid, relpages) AS ("
+						  "\nSELECT c.oid, c.relpages"
+						  "\nFROM relation r"
+						  "\nINNER JOIN pg_catalog.pg_index i"
+						  "\nON r.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c"
+						  "\nON i.indexrelid = c.oid");
+		if (opts.excludeidx)
+			appendPQExpBufferStr(&sql,
+								 "\nINNER JOIN pg_catalog.pg_namespace n"
+								 "\nON c.relnamespace = n.oid"
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.index_only"
+								 "\nWHERE ep.id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam = %u"
+						  "\nAND c.relkind = 'i'",
+						  BTREE_AM_OID);
+		if (opts.no_toast_expansion)
+			appendPQExpBuffer(&sql,
+							  "\nAND c.relnamespace != %u",
+							  PG_TOAST_NAMESPACE);
+		appendPQExpBufferStr(&sql, "\n)");
+	}
+
+	if (!opts.no_toast_expansion && !opts.no_index_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with toast tables of
+		 * primary tables selected above, filtering by exclusion patterns (if
+		 * any) that match the toast index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ",\ntoast_index (oid, relpages) AS ("
+						  "\nSELECT c.oid, c.relpages"
+						  "\nFROM toast t"
+						  "\nINNER JOIN pg_catalog.pg_index i"
+						  "\nON t.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c"
+						  "\nON i.indexrelid = c.oid");
+		if (opts.excludeidx)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.index_only"
+								 "\nWHERE ep.id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam = %u"
+						  "\nAND c.relkind = 'i'"
+						  "\n)",
+						  BTREE_AM_OID);
+	}
+
+	/*
+	 * Roll-up distinct rows from CTEs.
+	 *
+	 * Relations that match more than one pattern may occur more than once in
+	 * the list, and indexes and toast for primary relations may also have
+	 * matched in their own right, so we rely on UNION to deduplicate the
+	 * list.
+	 */
+	appendPQExpBuffer(&sql,
+					  "\nSELECT id, is_table, is_index, oid"
+					  "\nFROM (");
+	appendPQExpBufferStr(&sql,
+	/* Inclusion patterns that failed to match */
+						 "\nSELECT id, is_table, is_index,"
+						 "\nNULL::OID AS oid,"
+						 "\nNULL::INTEGER AS relpages"
+						 "\nFROM relation"
+						 "\nWHERE id IS NOT NULL"
+						 "\nUNION"
+	/* Primary relations */
+						 "\nSELECT NULL::INTEGER AS id,"
+						 "\nis_table, is_index,"
+						 "\noid, relpages"
+						 "\nFROM relation");
+	if (!opts.no_toast_expansion)
+		appendPQExpBufferStr(&sql,
+							 "\nUNION"
+		/* Toast tables for primary relations */
+							 "\nSELECT NULL::INTEGER AS id, TRUE AS is_table,"
+							 "\nFALSE AS is_index, oid, relpages"
+							 "\nFROM toast");
+	if (!opts.no_index_expansion)
+		appendPQExpBufferStr(&sql,
+							 "\nUNION"
+		/* Indexes for primary relations */
+							 "\nSELECT NULL::INTEGER AS id, FALSE AS is_table,"
+							 "\nTRUE AS is_index, oid, relpages"
+							 "\nFROM index");
+	if (!opts.no_toast_expansion && !opts.no_index_expansion)
+		appendPQExpBufferStr(&sql,
+							 "\nUNION"
+		/* Indexes for toast relations */
+							 "\nSELECT NULL::INTEGER AS id, FALSE AS is_table,"
+							 "\nTRUE AS is_index, oid, relpages"
+							 "\nFROM toast_index");
+	appendPQExpBufferStr(&sql,
+						 "\n) AS combined_records"
+						 "\nORDER BY relpages DESC NULLS FIRST, oid");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_error("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	/*
+	 * Allocate a single copy of the database name to be shared by all nodes
+	 * in the object list, constructed below.
+	 */
+	datname = pstrdup(PQdb(conn));
+
+	ntups = PQntuples(res);
+	for (i = 0; i < ntups; i++)
+	{
+		int			pattern_id = 0;
+		bool		is_table = false;
+		bool		is_index = false;
+		Oid			oid = InvalidOid;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			is_table = (PQgetvalue(res, i, 1)[0] == 't');
+		if (!PQgetisnull(res, i, 2))
+			is_index = (PQgetvalue(res, i, 2)[0] == 't');
+		if (!PQgetisnull(res, i, 3))
+			oid = atooid(PQgetvalue(res, i, 3));
+
+		if (pattern_id > 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern.  Find the
+			 * pattern in the list and record that it matched.  If we expected
+			 * a large number of command-line inclusion pattern arguments, the
+			 * datastructure here might need to be more efficient, but we
+			 * expect the list to be short.
+			 */
+
+			SimplePtrListCell *cell;
+			bool		found;
+
+			for (found = false, cell = opts.include.head; cell; cell = cell->next)
+			{
+				PatternInfo *info = (PatternInfo *) cell->ptr;
+
+				if (info->pattern_id == pattern_id)
+				{
+					info->matched = true;
+					found = true;
+					break;
+				}
+			}
+			if (!found)
+			{
+				pg_log_error("internal error: received unexpected pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+		}
+		else
+		{
+			/* Current record pertains to a relation */
+
+			RelationInfo *rel = (RelationInfo *) palloc0(sizeof(RelationInfo));
+
+			Assert(OidIsValid(oid));
+			Assert(is_table ^ is_index);
+
+			rel->datinfo = dat;
+			rel->reloid = oid;
+			rel->is_table = is_table;
+
+			simple_ptr_list_append(relations, rel);
+		}
+	}
+	PQclear(res);
+}
diff --git a/contrib/pg_amcheck/t/001_basic.pl b/contrib/pg_amcheck/t/001_basic.pl
new file mode 100644
index 0000000000..dfa0ae9e06
--- /dev/null
+++ b/contrib/pg_amcheck/t/001_basic.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_amcheck');
+program_version_ok('pg_amcheck');
+program_options_handling_ok('pg_amcheck');
diff --git a/contrib/pg_amcheck/t/002_nonesuch.pl b/contrib/pg_amcheck/t/002_nonesuch.pl
new file mode 100644
index 0000000000..8c6e267ee9
--- /dev/null
+++ b/contrib/pg_amcheck/t/002_nonesuch.pl
@@ -0,0 +1,213 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 60;
+
+# Test set-up
+my ($node, $port);
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+#########################################
+# Test connecting to a non-existent database
+
+# Failing to connect to the initial database is an error.
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, 'qqq' ],
+	qr/database "qqq" does not exist/,
+	'checking a non-existent database');
+
+# Failing to resolve a secondary database name is also an error, though since
+# the string is treated as a pattern, the error message looks different.
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, 'postgres', 'qqq' ],
+	qr/pg_amcheck: no checkable database: "qqq"/,
+	'checking a non-existent database');
+
+# Failing to connect to the initial database is still an error when using
+# --no-strict-names.
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-p', $port, 'qqq' ],
+	qr/database "qqq" does not exist/,
+	'checking a non-existent database with --no-strict-names');
+
+# But failing to resolve secondary database names is not an error when using
+# --no-strict-names.  We should still see the message, but as a non-fatal
+# warning
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-p', $port, '-d', 'no_such_database', 'postgres', 'qqq' ],
+	0,
+	[ ],
+	[ qr/no checkable database: "qqq"/ ],
+	'checking a non-existent secondary database with --no-strict-names');
+
+# Check that a substring of an existent database name does not get interpreted
+# as a matching pattern.
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, 'post' ],
+	qr/database "post" does not exist/,
+	'checking a non-existent primary database (substring of existent database)');
+
+# And again, but testing the secondary database name rather than the primary
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, 'postgres', 'post' ],
+	qr/pg_amcheck: no checkable database: "post"/,
+	'checking a non-existent secondary database (substring of existent database)');
+
+# Likewise, check that a superstring of an existent database name does not get
+# interpreted as a matching pattern.
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, 'postresql' ],
+	qr/database "postresql" does not exist/,
+	'checking a non-existent primary database (superstring of existent database)');
+
+# And again, but testing the secondary database name rather than the primary
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, 'postgres', 'postgresql' ],
+	qr/pg_amcheck: no checkable database: "postgresql"/,
+	'checking a non-existent secondary database (superstring of existent database)');
+
+#########################################
+# Test connecting with a non-existent user
+
+# Failing to connect to the initial database due to bad username is an error.
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, '-U=no_such_user', 'postgres' ],
+	qr/role "=no_such_user" does not exist/,
+	'checking with a non-existent user');
+
+# Failing to connect to the initial database due to bad username is still an
+# error when using --no-strict-names.
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-p', $port, '-U=no_such_user', 'postgres' ],
+	qr/role "=no_such_user" does not exist/,
+	'checking with a non-existent user, --no-strict-names');
+
+#########################################
+# Test checking databases without amcheck installed
+
+# Attempting to check a database by name where amcheck is not installed should
+# raise a warning.  If all databases are skipped, having no relations to check
+# raises an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $port, 'template1' ],
+	1,
+	[],
+	[ qr/pg_amcheck: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: no relations to check/ ],
+	'checking a database by name without amcheck installed');
+
+# Likewise, but by database pattern rather than by name, such that some
+# databases with amcheck installed are included, and so checking occurs and
+# only a warning is raised.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $port, '-d', '*', 'postgres' ],
+	0,
+	[],
+	[ qr/pg_amcheck: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by dbname implication without amcheck installed');
+
+# And again, but by checking all databases.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $port, '--all', 'postgres' ],
+	0,
+	[],
+	[ qr/pg_amcheck: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by --all implication without amcheck installed');
+
+#########################################
+# Test unreasonable patterns
+
+# Check three-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $port, 'postgres', '-t', '..' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: no checkable database: "\.\."/ ],
+	'checking table pattern ".."');
+
+# Again, but with non-trivial schema and relation parts
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $port, 'postgres', '-t', '.foo.bar' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: no checkable database: "\.foo\.bar"/ ],
+	'checking table pattern ".foo.bar"');
+
+# Check two-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $port, 'postgres', '-t', '.' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: no tables to check for "\."/ ],
+	'checking table pattern "."');
+
+#########################################
+# Test checking non-existent schemas, tables, and indexes
+
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, '-s', 'no_such_schema' ],
+	qr/pg_amcheck: no relations to check in schemas for "no_such_schema"/,
+	'checking a non-existent schema');
+
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-v', '-p', $port, '-s', 'no_such_schema' ],
+	qr/pg_amcheck: no relations to check/,
+	'checking a non-existent schema with --no-strict-names -v');
+
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, '-t', 'no_such_table' ],
+	qr/pg_amcheck: no tables to check for "no_such_table"/,
+	'checking a non-existent table');
+
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-v', '-p', $port, '-t', 'no_such_table' ],
+	qr/pg_amcheck: no relations to check/,
+	'checking a non-existent table with --no-strict-names -v');
+
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, '-i', 'no_such_index' ],
+	qr/pg_amcheck: no btree indexes to check for "no_such_index"/,
+	'checking a non-existent index');
+
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-v', '-p', $port, '-i', 'no_such_index' ],
+	qr/pg_amcheck: no relations to check/,
+	'checking a non-existent index with --no-strict-names -v');
+
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, '-s', 'no*such*schema*' ],
+	qr/pg_amcheck: no relations to check in schemas for "no\*such\*schema\*"/,
+	'no matching schemas');
+
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-v', '-p', $port, '-s', 'no*such*schema*' ],
+	qr/pg_amcheck: no relations to check/,
+	'no matching schemas with --no-strict-names -v');
+
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, '-t', 'no*such*table*' ],
+	qr/pg_amcheck: no tables to check for "no\*such\*table\*"/,
+	'no matching tables');
+
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-v', '-p', $port, '-t', 'no*such*table*' ],
+	qr/pg_amcheck: no relations to check/,
+	'no matching tables with --no-strict-names -v');
+
+command_fails_like(
+	[ 'pg_amcheck', '-p', $port, '-i', 'no*such*index*' ],
+	qr/pg_amcheck: no btree indexes to check for "no\*such\*index\*"/,
+	'no matching indexes');
+
+command_fails_like(
+	[ 'pg_amcheck', '--no-strict-names', '-v', '-p', $port, '-i', 'no*such*index*' ],
+	qr/pg_amcheck: no relations to check/,
+	'no matching indexes with --no-strict-names -v');
diff --git a/contrib/pg_amcheck/t/003_check.pl b/contrib/pg_amcheck/t/003_check.pl
new file mode 100644
index 0000000000..502b599fcd
--- /dev/null
+++ b/contrib/pg_amcheck/t/003_check.pl
@@ -0,0 +1,497 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 57;
+
+my ($node, $port, %corrupt_page, %remove_relation);
+
+# Returns the filesystem path for the named relation.
+#
+# Assumes the test node is running
+sub relation_filepath($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $pgdata = $node->data_dir;
+	my $rel = $node->safe_psql($dbname,
+							   qq(SELECT pg_relation_filepath('$relname')));
+	die "path not found for relation $relname" unless defined $rel;
+	return "$pgdata/$rel";
+}
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT ct.relname
+			FROM pg_catalog.pg_class cr, pg_catalog.pg_class ct
+			WHERE cr.oid = '$relname'::regclass
+			  AND cr.reltoastrelid = ct.oid
+			));
+	return undef unless defined $rel;
+	return "pg_toast.$rel";
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of overwriting junk in the first page.
+#
+# Assumes the test node is running.
+sub plan_to_corrupt_first_page($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$corrupt_page{$relpath} = 1;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of removing the file..
+#
+# Assumes the test node is running
+sub plan_to_remove_relation_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$remove_relation{$relpath} = 1;
+}
+
+# For the given (dbname, relname), if a corresponding toast table
+# exists, adds that toast table's relation file to the list to be
+# corrupted by means of removing the file.
+#
+# Assumes the test node is running.
+sub plan_to_remove_toast_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $toastname = relation_toast($dbname, $relname);
+	plan_to_remove_relation_file($dbname, $toastname) if ($toastname);
+}
+
+# Corrupts the first page of the given file path
+sub corrupt_first_page($)
+{
+	my ($relpath) = @_;
+
+	my $fh;
+	open($fh, '+<', $relpath)
+	  or BAIL_OUT("open failed: $!");
+	binmode $fh;
+
+	# Corrupt some line pointers.  The values are chosen to hit the
+	# various line-pointer-corruption checks in verify_heapam.c
+	# on both little-endian and big-endian architectures.
+	seek($fh, 32, 0)
+	  or BAIL_OUT("seek failed: $!");
+	syswrite(
+		$fh,
+		pack("L*",
+			0xAAA15550, 0xAAA0D550, 0x00010000,
+			0x00008000, 0x0000800F, 0x001e8000,
+			0xFFFFFFFF)
+	) or BAIL_OUT("syswrite failed: $!");
+	close($fh)
+	  or BAIL_OUT("close failed: $!");
+}
+
+# Stops the node, performs all the corruptions previously planned, and
+# starts the node again.
+#
+sub perform_all_corruptions()
+{
+	$node->stop();
+	for my $relpath (keys %corrupt_page)
+	{
+		corrupt_first_page($relpath);
+	}
+	for my $relpath (keys %remove_relation)
+	{
+		unlink($relpath);
+	}
+	$node->start;
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+for my $dbname (qw(db1 db2 db3))
+{
+	# Create the database
+	$node->safe_psql('postgres', qq(CREATE DATABASE $dbname));
+
+	# Load the amcheck extension, upon which pg_amcheck depends.  Put the
+	# extension in an unexpected location to test that pg_amcheck finds it
+	# correctly.  Create tables with names that look like pg_catalog names to
+	# check that pg_amcheck does not get confused by them.  Create functions in
+	# schema public that look like amcheck functions to check that pg_amcheck
+	# does not use them.
+	$node->safe_psql($dbname, q(
+		CREATE SCHEMA amcheck_schema;
+		CREATE EXTENSION amcheck WITH SCHEMA amcheck_schema;
+		CREATE TABLE amcheck_schema.pg_database (junk text);
+		CREATE TABLE amcheck_schema.pg_namespace (junk text);
+		CREATE TABLE amcheck_schema.pg_class (junk text);
+		CREATE TABLE amcheck_schema.pg_operator (junk text);
+		CREATE TABLE amcheck_schema.pg_proc (junk text);
+		CREATE TABLE amcheck_schema.pg_tablespace (junk text);
+
+		CREATE FUNCTION public.bt_index_check(index regclass,
+											  heapallindexed boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.bt_index_parent_check(index regclass,
+													 heapallindexed boolean default false,
+													 rootdescend boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_parent_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.verify_heapam(relation regclass,
+											 on_error_stop boolean default false,
+											 check_toast boolean default false,
+											 skip text default 'none',
+											 startblock bigint default null,
+											 endblock bigint default null,
+											 blkno OUT bigint,
+											 offnum OUT integer,
+											 attnum OUT integer,
+											 msg OUT text)
+		RETURNS SETOF record AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong verify_heapam!';
+		END;
+		$$ LANGUAGE plpgsql;
+	));
+
+	# Create schemas, tables and indexes in five separate
+	# schemas.  The schemas are all identical to start, but
+	# we will corrupt them differently later.
+	#
+	for my $schema (qw(s1 s2 s3 s4 s5))
+	{
+		$node->safe_psql($dbname, qq(
+			CREATE SCHEMA $schema;
+			CREATE SEQUENCE $schema.seq1;
+			CREATE SEQUENCE $schema.seq2;
+			CREATE TABLE $schema.t1 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE TABLE $schema.t2 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE VIEW $schema.t2_view AS (
+				SELECT i*2, t FROM $schema.t2
+			);
+			ALTER TABLE $schema.t2
+				ALTER COLUMN t
+				SET STORAGE EXTERNAL;
+
+			INSERT INTO $schema.t1 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			INSERT INTO $schema.t2 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			CREATE MATERIALIZED VIEW $schema.t1_mv AS SELECT * FROM $schema.t1;
+			CREATE MATERIALIZED VIEW $schema.t2_mv AS SELECT * FROM $schema.t2;
+
+			create table $schema.p1 (a int, b int) PARTITION BY list (a);
+			create table $schema.p2 (a int, b int) PARTITION BY list (a);
+
+			create table $schema.p1_1 partition of $schema.p1 for values in (1, 2, 3);
+			create table $schema.p1_2 partition of $schema.p1 for values in (4, 5, 6);
+			create table $schema.p2_1 partition of $schema.p2 for values in (1, 2, 3);
+			create table $schema.p2_2 partition of $schema.p2 for values in (4, 5, 6);
+
+			CREATE INDEX t1_btree ON $schema.t1 USING BTREE (i);
+			CREATE INDEX t2_btree ON $schema.t2 USING BTREE (i);
+
+			CREATE INDEX t1_hash ON $schema.t1 USING HASH (i);
+			CREATE INDEX t2_hash ON $schema.t2 USING HASH (i);
+
+			CREATE INDEX t1_brin ON $schema.t1 USING BRIN (i);
+			CREATE INDEX t2_brin ON $schema.t2 USING BRIN (i);
+
+			CREATE INDEX t1_gist ON $schema.t1 USING GIST (b);
+			CREATE INDEX t2_gist ON $schema.t2 USING GIST (b);
+
+			CREATE INDEX t1_gin ON $schema.t1 USING GIN (ia);
+			CREATE INDEX t2_gin ON $schema.t2 USING GIN (ia);
+
+			CREATE INDEX t1_spgist ON $schema.t1 USING SPGIST (ir);
+			CREATE INDEX t2_spgist ON $schema.t2 USING SPGIST (ir);
+		));
+	}
+}
+
+# Database 'db1' corruptions
+#
+
+# Corrupt indexes in schema "s1"
+plan_to_remove_relation_file('db1', 's1.t1_btree');
+plan_to_corrupt_first_page('db1', 's1.t2_btree');
+
+# Corrupt tables in schema "s2"
+plan_to_remove_relation_file('db1', 's2.t1');
+plan_to_corrupt_first_page('db1', 's2.t2');
+
+# Corrupt tables, partitions, matviews, and btrees in schema "s3"
+plan_to_remove_relation_file('db1', 's3.t1');
+plan_to_corrupt_first_page('db1', 's3.t2');
+
+plan_to_remove_relation_file('db1', 's3.t1_mv');
+plan_to_remove_relation_file('db1', 's3.p1_1');
+
+plan_to_corrupt_first_page('db1', 's3.t2_mv');
+plan_to_corrupt_first_page('db1', 's3.p2_1');
+
+plan_to_remove_relation_file('db1', 's3.t1_btree');
+plan_to_corrupt_first_page('db1', 's3.t2_btree');
+
+# Corrupt toast table, partitions, and materialized views in schema "s4"
+plan_to_remove_toast_file('db1', 's4.t2');
+
+# Corrupt all other object types in schema "s5".  We don't have amcheck support
+# for these types, but we check that their corruption does not trigger any
+# errors in pg_amcheck
+plan_to_remove_relation_file('db1', 's5.seq1');
+plan_to_remove_relation_file('db1', 's5.t1_hash');
+plan_to_remove_relation_file('db1', 's5.t1_gist');
+plan_to_remove_relation_file('db1', 's5.t1_gin');
+plan_to_remove_relation_file('db1', 's5.t1_brin');
+plan_to_remove_relation_file('db1', 's5.t1_spgist');
+
+plan_to_corrupt_first_page('db1', 's5.seq2');
+plan_to_corrupt_first_page('db1', 's5.t2_hash');
+plan_to_corrupt_first_page('db1', 's5.t2_gist');
+plan_to_corrupt_first_page('db1', 's5.t2_gin');
+plan_to_corrupt_first_page('db1', 's5.t2_brin');
+plan_to_corrupt_first_page('db1', 's5.t2_spgist');
+
+
+# Database 'db2' corruptions
+#
+plan_to_remove_relation_file('db2', 's1.t1');
+plan_to_remove_relation_file('db2', 's1.t1_btree');
+
+
+# Leave 'db3' uncorrupted
+#
+
+# Perform the corruptions we planned above using only a single database restart.
+#
+perform_all_corruptions();
+
+
+# Standard first arguments to TestLib functions
+my @cmd = ('pg_amcheck', '--quiet', '-p', $port);
+
+# Regular expressions to match various expected output
+my $no_output_re = qr/^$/;
+my $line_pointer_corruption_re = qr/line pointer/;
+my $missing_file_re = qr/could not open file ".*": No such file or directory/;
+my $index_missing_relation_fork_re = qr/index ".*" lacks a main relation fork/;
+
+# Checking databases with amcheck installed and corrupt relations, pg_amcheck
+# command itself should return exit status = 2, because tables and indexes are
+# corrupt, not exit status = 1, which would mean the pg_amcheck command itself
+# failed.  Corruption messages should go to stdout, and nothing to stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in database db1');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', 'db2', 'db3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in databases db1, db2, and db3');
+
+# Scans of indexes in s1 should detect the specific corruption that we created
+# above.  For missing relation forks, we know what the error message looks
+# like.  For corrupted index pages, the error might vary depending on how the
+# page was formatted on disk, including variations due to alignment differences
+# between platforms, so we accept any non-empty error message.
+#
+# If we don't limit the check to databases with amcheck installed, we expect
+# complaint on stderr, but otherwise stderr should be quiet.
+#
+$node->command_checks_all(
+	[ @cmd, '--all', '-s', 's1', '-i', 't1_btree' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ qr/pg_amcheck: skipping database "postgres": amcheck is not installed/ ],
+	'pg_amcheck index s1.t1_btree reports missing main relation fork');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't2_btree' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ $no_output_re ],
+	'pg_amcheck index s1.s2 reports index corruption');
+
+# Checking db1.s1 with indexes excluded should show no corruptions because we
+# did not corrupt any tables in db1.s1.  Verify that both stdout and stderr
+# are quiet.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-index-expansion' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db1.s1 excluding indexes');
+
+# Checking db2.s1 should show table corruptions if indexes are excluded
+#
+$node->command_checks_all(
+	[ @cmd, 'db2', '-t', 's1.*', '--no-index-expansion' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db2.s1 excluding indexes');
+
+# In schema db1.s3, the tables and indexes are both corrupt.  We should see
+# corruption messages on stdout, and nothing on stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck schema s3 reports table and index errors');
+
+# In schema db1.s4, only toast tables are corrupt.  Check that under default
+# options the toast corruption is reported, but when excluding toast we get no
+# error reports.
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's4' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 reports toast corruption');
+
+$node->command_checks_all(
+	[ @cmd, '--no-toast-expansion', '--exclude-toast-pointers', 'db1', '-s', 's4' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 excluding toast reports no corruption');
+
+# Check that no corruption is reported in schema db1.s5
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's5' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s5 reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we exclude
+# the indexes, no corruption is reported about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-I', 't1_btree', '-I', 't2_btree' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with corrupt indexes excluded reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we provide only
+# table inclusions, and disable index expansion, no corruption is reported
+# about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-index-expansion' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with all indexes excluded reports no corruption');
+
+# In schema db1.s2, only tables are corrupt.  Verify that when we exclude those
+# tables that no corruption is reported.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's2', '-T', 't1', '-T', 't2' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s2 with corrupt tables excluded reports no corruption');
+
+# Check errors about bad block range command line arguments.  We use schema s5
+# to avoid getting messages about corrupt tables or indexes.
+#
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', 'junk' ],
+	qr/relation starting block argument contains garbage characters/,
+	'pg_amcheck rejects garbage startblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--endblock', '1234junk' ],
+	qr/relation ending block argument contains garbage characters/,
+	'pg_amcheck rejects garbage endblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', '5', '--endblock', '4' ],
+	qr/relation ending block argument precedes starting block argument/,
+	'pg_amcheck rejects invalid block range');
+
+# Check bt_index_parent_check alternates.  We don't create any index corruption
+# that would behave differently under these modes, so just smoke test that the
+# arguments are handled sensibly.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--parent-check' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --parent-check');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --heapallindexed --rootdescend');
diff --git a/contrib/pg_amcheck/t/004_verify_heapam.pl b/contrib/pg_amcheck/t/004_verify_heapam.pl
new file mode 100644
index 0000000000..d5537a5b37
--- /dev/null
+++ b/contrib/pg_amcheck/t/004_verify_heapam.pl
@@ -0,0 +1,487 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+
+use Test::More tests => 20;
+
+# This regression test demonstrates that the pg_amcheck binary supplied with
+# the pg_amcheck contrib module correctly identifies specific kinds of
+# corruption within pages.  To test this, we need a mechanism to create corrupt
+# pages with predictable, repeatable corruption.  The postgres backend cannot
+# be expected to help us with this, as its design is not consistent with the
+# goal of intentionally corrupting pages.
+#
+# Instead, we create a table to corrupt, and with careful consideration of how
+# postgresql lays out heap pages, we seek to offsets within the page and
+# overwrite deliberately chosen bytes with specific values calculated to
+# corrupt the page in expected ways.  We then verify that pg_amcheck reports
+# the corruption, and that it runs without crashing.  Note that the backend
+# cannot simply be started to run queries against the corrupt table, as the
+# backend will crash, at least for some of the corruption types we generate.
+#
+# Autovacuum potentially touching the table in the background makes the exact
+# behavior of this test harder to reason about.  We turn it off to keep things
+# simpler.  We use a "belt and suspenders" approach, turning it off for the
+# system generally in postgresql.conf, and turning it off specifically for the
+# test table.
+#
+# This test depends on the table being written to the heap file exactly as we
+# expect it to be, so we take care to arrange the columns of the table, and
+# insert rows of the table, that give predictable sizes and locations within
+# the table page.
+#
+# The HeapTupleHeaderData has 23 bytes of fixed size fields before the variable
+# length t_bits[] array.  We have exactly 3 columns in the table, so natts = 3,
+# t_bits is 1 byte long, and t_hoff = MAXALIGN(23 + 1) = 24.
+#
+# We're not too fussy about which datatypes we use for the test, but we do care
+# about some specific properties.  We'd like to test both fixed size and
+# varlena types.  We'd like some varlena data inline and some toasted.  And
+# we'd like the layout of the table such that the datums land at predictable
+# offsets within the tuple.  We choose a structure without padding on all
+# supported architectures:
+#
+# 	a BIGINT
+#	b TEXT
+#	c TEXT
+#
+# We always insert a 7-ascii character string into field 'b', which with a
+# 1-byte varlena header gives an 8 byte inline value.  We always insert a long
+# text string in field 'c', long enough to force toast storage.
+#
+# We choose to read and write binary copies of our table's tuples, using perl's
+# pack() and unpack() functions.  Perl uses a packing code system in which:
+#
+#	L = "Unsigned 32-bit Long",
+#	S = "Unsigned 16-bit Short",
+#	C = "Unsigned 8-bit Octet",
+#	c = "signed 8-bit octet",
+#	q = "signed 64-bit quadword"
+#
+# Each tuple in our table has a layout as follows:
+#
+#    xx xx xx xx            t_xmin: xxxx		offset = 0		L
+#    xx xx xx xx            t_xmax: xxxx		offset = 4		L
+#    xx xx xx xx          t_field3: xxxx		offset = 8		L
+#    xx xx                   bi_hi: xx			offset = 12		S
+#    xx xx                   bi_lo: xx			offset = 14		S
+#    xx xx                ip_posid: xx			offset = 16		S
+#    xx xx             t_infomask2: xx			offset = 18		S
+#    xx xx              t_infomask: xx			offset = 20		S
+#    xx                     t_hoff: x			offset = 22		C
+#    xx                     t_bits: x			offset = 23		C
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		Cccccccc
+#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		SSSS
+#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued	SSSS
+#    xx xx                        : xx      	 ...continued	S
+#
+# We could choose to read and write columns 'b' and 'c' in other ways, but
+# it is convenient enough to do it this way.  We define packing code
+# constants here, where they can be compared easily against the layout.
+
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCcccccccSSSSSSSSS';
+use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
+
+# Read a tuple of our table from a heap page.
+#
+# Takes an open filehandle to the heap file, and the offset of the tuple.
+#
+# Rather than returning the binary data from the file, unpacks the data into a
+# perl hash with named fields.  These fields exactly match the ones understood
+# by write_tuple(), below.  Returns a reference to this hash.
+#
+sub read_tuple ($$)
+{
+	my ($fh, $offset) = @_;
+	my ($buffer, %tup);
+	seek($fh, $offset, 0);
+	sysread($fh, $buffer, HEAPTUPLE_PACK_LENGTH);
+
+	@_ = unpack(HEAPTUPLE_PACK_CODE, $buffer);
+	%tup = (t_xmin => shift,
+			t_xmax => shift,
+			t_field3 => shift,
+			bi_hi => shift,
+			bi_lo => shift,
+			ip_posid => shift,
+			t_infomask2 => shift,
+			t_infomask => shift,
+			t_hoff => shift,
+			t_bits => shift,
+			a => shift,
+			b_header => shift,
+			b_body1 => shift,
+			b_body2 => shift,
+			b_body3 => shift,
+			b_body4 => shift,
+			b_body5 => shift,
+			b_body6 => shift,
+			b_body7 => shift,
+			c1 => shift,
+			c2 => shift,
+			c3 => shift,
+			c4 => shift,
+			c5 => shift,
+			c6 => shift,
+			c7 => shift,
+			c8 => shift,
+			c9 => shift);
+	# Stitch together the text for column 'b'
+	$tup{b} = join('', map { chr($tup{"b_body$_"}) } (1..7));
+	return \%tup;
+}
+
+# Write a tuple of our table to a heap page.
+#
+# Takes an open filehandle to the heap file, the offset of the tuple, and a
+# reference to a hash with the tuple values, as returned by read_tuple().
+# Writes the tuple fields from the hash into the heap file.
+#
+# The purpose of this function is to write a tuple back to disk with some
+# subset of fields modified.  The function does no error checking.  Use
+# cautiously.
+#
+sub write_tuple($$$)
+{
+	my ($fh, $offset, $tup) = @_;
+	my $buffer = pack(HEAPTUPLE_PACK_CODE,
+					$tup->{t_xmin},
+					$tup->{t_xmax},
+					$tup->{t_field3},
+					$tup->{bi_hi},
+					$tup->{bi_lo},
+					$tup->{ip_posid},
+					$tup->{t_infomask2},
+					$tup->{t_infomask},
+					$tup->{t_hoff},
+					$tup->{t_bits},
+					$tup->{a},
+					$tup->{b_header},
+					$tup->{b_body1},
+					$tup->{b_body2},
+					$tup->{b_body3},
+					$tup->{b_body4},
+					$tup->{b_body5},
+					$tup->{b_body6},
+					$tup->{b_body7},
+					$tup->{c1},
+					$tup->{c2},
+					$tup->{c3},
+					$tup->{c4},
+					$tup->{c5},
+					$tup->{c6},
+					$tup->{c7},
+					$tup->{c8},
+					$tup->{c9});
+	seek($fh, $offset, 0);
+	syswrite($fh, $buffer, HEAPTUPLE_PACK_LENGTH);
+	return;
+}
+
+# Set umask so test directories and files are created with default permissions
+umask(0077);
+
+# Set up the node.  Once we create and corrupt the table,
+# autovacuum workers visiting the table could crash the backend.
+# Disable autovacuum so that won't happen.
+my $node = get_new_node('test');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
+
+# Start the node and load the extensions.  We depend on both
+# amcheck and pageinspect for this test.
+$node->start;
+my $port = $node->port;
+my $pgdata = $node->data_dir;
+$node->safe_psql('postgres', "CREATE EXTENSION amcheck");
+$node->safe_psql('postgres', "CREATE EXTENSION pageinspect");
+
+# Get a non-zero datfrozenxid
+$node->safe_psql('postgres', qq(VACUUM FREEZE));
+
+# Create the test table with precisely the schema that our corruption function
+# expects.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.test (a BIGINT, b TEXT, c TEXT);
+		ALTER TABLE public.test SET (autovacuum_enabled=false);
+		ALTER TABLE public.test ALTER COLUMN c SET STORAGE EXTERNAL;
+		CREATE INDEX test_idx ON public.test(a, b);
+	));
+
+# We want (0 < datfrozenxid < test.relfrozenxid).  To achieve this, we freeze
+# an otherwise unused table, public.junk, prior to inserting data and freezing
+# public.test
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.junk AS SELECT 'junk'::TEXT AS junk_column;
+		ALTER TABLE public.junk SET (autovacuum_enabled=false);
+		VACUUM FREEZE public.junk
+	));
+
+my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.test')));
+my $relpath = "$pgdata/$rel";
+
+# Insert data and freeze public.test
+use constant ROWCOUNT => 16;
+$node->safe_psql('postgres', qq(
+	INSERT INTO public.test (a, b, c)
+		VALUES (
+			12345678,
+			'abcdefg',
+			repeat('w', 10000)
+		);
+	VACUUM FREEZE public.test
+	)) for (1..ROWCOUNT);
+
+my $relfrozenxid = $node->safe_psql('postgres',
+	q(select relfrozenxid from pg_class where relname = 'test'));
+my $datfrozenxid = $node->safe_psql('postgres',
+	q(select datfrozenxid from pg_database where datname = 'postgres'));
+
+# Find where each of the tuples is located on the page.
+my @lp_off;
+for my $tup (0..ROWCOUNT-1)
+{
+	push (@lp_off, $node->safe_psql('postgres', qq(
+select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
+	offset $tup limit 1)));
+}
+
+# Check that pg_amcheck runs against the uncorrupted table without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table, prior to corruption');
+
+# Check that pg_amcheck runs against the uncorrupted table and index without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table and index, prior to corruption');
+
+$node->stop;
+
+# Sanity check that our 'test' table has a relfrozenxid newer than the
+# datfrozenxid for the database, and that the datfrozenxid is greater than the
+# first normal xid.  We rely on these invariants in some of our tests.
+if ($datfrozenxid <= 3 || $datfrozenxid >= $relfrozenxid)
+{
+	fail('Xid thresholds not as expected');
+	$node->clean_node;
+	exit;
+}
+
+# Some #define constants from access/htup_details.h for use while corrupting.
+use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
+use constant HEAP_XMIN_COMMITTED     => 0x0100;
+use constant HEAP_XMIN_INVALID       => 0x0200;
+use constant HEAP_XMAX_COMMITTED     => 0x0400;
+use constant HEAP_XMAX_INVALID       => 0x0800;
+use constant HEAP_NATTS_MASK         => 0x07FF;
+use constant HEAP_XMAX_IS_MULTI      => 0x1000;
+use constant HEAP_KEYS_UPDATED       => 0x2000;
+
+# Helper function to generate a regular expression matching the header we
+# expect verify_heapam() to return given which fields we expect to be non-null.
+sub header
+{
+	my ($blkno, $offnum, $attnum) = @_;
+	return qr/relation postgres\.public\.test, block $blkno, offset $offnum, attribute $attnum\s+/ms
+		if (defined $attnum);
+	return qr/relation postgres\.public\.test, block $blkno, offset $offnum\s+/ms
+		if (defined $offnum);
+	return qr/relation postgres\.public\.test\s+/ms
+		if (defined $blkno);
+	return qr/relation postgres\.public\.test\s+/ms;
+}
+
+# Corrupt the tuples, one type of corruption per tuple.  Some types of
+# corruption cause verify_heapam to skip to the next tuple without
+# performing any remaining checks, so we can't exercise the system properly if
+# we focus all our corruption on a single tuple.
+#
+my @expected;
+my $file;
+open($file, '+<', $relpath);
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	# Sanity-check that the data appears on the page where we expect.
+	if ($tup->{a} ne '12345678' || $tup->{b} ne 'abcdefg')
+	{
+		fail('Page layout differs from our expectations');
+		$node->clean_node;
+		exit;
+	}
+
+	my $header = header(0, $offnum, undef);
+	if ($offnum == 1)
+	{
+		# Corruptly set xmin < relfrozenxid
+		my $xmin = $relfrozenxid - 1;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		# Expected corruption report
+		push @expected,
+			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
+	}
+	if ($offnum == 2)
+	{
+		# Corruptly set xmin < datfrozenxid
+		my $xmin = 3;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin $xmin precedes oldest valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 3)
+	{
+		# Corruptly set xmin < datfrozenxid, further back, noting circularity
+		# of xid comparison.  For a new cluster with epoch = 0, the corrupt
+		# xmin will be interpreted as in the future
+		$tup->{t_xmin} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 4)
+	{
+		# Corruptly set xmax < relminmxid;
+		$tup->{t_xmax} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMAX_INVALID;
+
+		push @expected,
+			qr/${$header}xmax 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 5)
+	{
+		# Corrupt the tuple t_hoff, but keep it aligned properly
+		$tup->{t_hoff} += 128;
+
+		push @expected,
+			qr/${$header}data begins at offset 152 beyond the tuple length 58/,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 152 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 6)
+	{
+		# Corrupt the tuple t_hoff, wrong alignment
+		$tup->{t_hoff} += 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 27 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 7)
+	{
+		# Corrupt the tuple t_hoff, underflow but correct alignment
+		$tup->{t_hoff} -= 8;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 16 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 8)
+	{
+		# Corrupt the tuple t_hoff, underflow and wrong alignment
+		$tup->{t_hoff} -= 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 21 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 9)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, not just 3
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+
+		push @expected,
+			qr/${$header}number of attributes 2047 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 10)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, some of
+		# them null.  This falsely creates the impression that the t_bits
+		# array is longer than just one byte, but t_hoff still says otherwise.
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+		$tup->{t_bits} = 0xAA;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 280, but actually begins at byte 24 \(2047 attributes, has nulls\)/;
+	}
+	elsif ($offnum == 11)
+	{
+		# Same as above, but this time t_hoff plays along
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= (HEAP_NATTS_MASK & 0x40);
+		$tup->{t_bits} = 0xAA;
+		$tup->{t_hoff} = 32;
+
+		push @expected,
+			qr/${$header}number of attributes 67 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 12)
+	{
+		# Corrupt the bits in column 'b' 1-byte varlena header
+		$tup->{b_header} = 0x80;
+
+		$header = header(0, $offnum, 1);
+		push @expected,
+			qr/${header}attribute 1 with length 4294967295 ends at offset 416848000 beyond total tuple length 58/;
+	}
+	elsif ($offnum == 13)
+	{
+		# Corrupt the bits in column 'c' toast pointer
+		$tup->{c6} = 41;
+		$tup->{c7} = 41;
+
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}final toast chunk number 0 differs from expected value 6/,
+			qr/${header}toasted value for attribute 2 missing from toast table/;
+	}
+	elsif ($offnum == 14)
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4;
+
+		push @expected,
+			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
+	}
+	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4000000000;
+
+		push @expected,
+			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
+	}
+	write_tuple($file, $offset, $tup);
+}
+close($file);
+$node->start;
+
+# Run pg_amcheck against the corrupt table with epoch=0, comparing actual
+# corruption messages against the expected messages
+$node->command_checks_all(
+	['pg_amcheck', '--no-index-expansion', '-p', $port, 'postgres'],
+	2,
+	[ @expected ],
+	[ ],
+	'Expected corruption message output');
+
+$node->teardown_node;
+$node->clean_node;
diff --git a/contrib/pg_amcheck/t/005_opclass_damage.pl b/contrib/pg_amcheck/t/005_opclass_damage.pl
new file mode 100644
index 0000000000..eba8ea9cae
--- /dev/null
+++ b/contrib/pg_amcheck/t/005_opclass_damage.pl
@@ -0,0 +1,54 @@
+# This regression test checks the behavior of the btree validation in the
+# presence of breaking sort order changes.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create a custom operator class and an index which uses it.
+$node->safe_psql('postgres', q(
+	CREATE EXTENSION amcheck;
+
+	CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;
+
+	CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
+	    OPERATOR 1 < (int4, int4), OPERATOR 2 <= (int4, int4),
+	    OPERATOR 3 = (int4, int4), OPERATOR 4 >= (int4, int4),
+	    OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);
+
+	CREATE TABLE int4tbl (i int4);
+	INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);
+	CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
+));
+
+# We have not yet broken the index, so we should get no corruption
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $node->port, 'postgres' ],
+	qr/^$/,
+	'pg_amcheck all schemas, tables and indexes reports no corruption');
+
+# Change the operator class to use a function which sorts in a different
+# order to corrupt the btree index
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION int4_desc_cmp (int4, int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN -1 ELSE 1 END; $$;
+	UPDATE pg_catalog.pg_amproc
+		SET amproc = 'int4_desc_cmp'::regproc
+		WHERE amproc = 'int4_asc_cmp'::regproc
+));
+
+# Index corruption should now be reported
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $node->port, 'postgres' ],
+	2,
+	[ qr/item order invariant violated for index "fickleidx"/ ],
+	[ ],
+	'pg_amcheck all schemas, tables and indexes reports fickleidx corruption'
+);
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d3ca4b6932..7e101f7c11 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -185,6 +185,7 @@ pages.
   </para>
 
  &oid2name;
+ &pgamcheck;
  &vacuumlo;
  </sect1>
 
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index db1d369743..5115cb03d0 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY oldsnapshot     SYSTEM "oldsnapshot.sgml">
 <!ENTITY pageinspect     SYSTEM "pageinspect.sgml">
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
+<!ENTITY pgamcheck       SYSTEM "pgamcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
diff --git a/doc/src/sgml/pgamcheck.sgml b/doc/src/sgml/pgamcheck.sgml
new file mode 100644
index 0000000000..9bee92c30a
--- /dev/null
+++ b/doc/src/sgml/pgamcheck.sgml
@@ -0,0 +1,668 @@
+<!-- doc/src/sgml/pgamcheck.sgml -->
+
+<refentry id="pgamcheck">
+ <indexterm zone="pgamcheck">
+  <primary>pg_amcheck</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle><application>pg_amcheck</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_amcheck</refname>
+  <refpurpose>checks for corruption in one or more
+  <productname>PostgreSQL</productname> databases</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_amcheck</command>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+   <arg rep="repeat"><replaceable>dbname</replaceable></arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <application>pg_amcheck</application> supports running
+   <xref linkend="amcheck"/>'s corruption checking functions against one or
+   more databases, with options to select which schemas, tables and indexes to
+   check, which kinds of checking to perform, and whether to perform the checks
+   in parallel, and if so, the number of parallel connections to establish and
+   use.
+  </para>
+
+  <para>
+   Only table relations and btree indexes are currently supported.  Other
+   relation types are silently skipped.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <para>
+   pg_amcheck accepts the following command-line arguments:
+
+   <variablelist>
+    <varlistentry>
+     <term><option>--all</option></term>
+       <listitem>
+      <para>
+       Perform checking in all databases.
+      </para>
+      <para>
+       In the absence of any other options, selects all objects across all
+       schemas and databases.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-dbname</option> takes
+       precedence over <option>--all</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-d</option></term>
+     <term><option>--dbname</option></term>
+     <listitem>
+      <para>
+       Perform checking in the specified database.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       database (or database pattern) for checking.  By default, all objects
+       in the matching database(s) will be checked.
+      </para>
+      <para>
+       If no <option>maintenance-db</option> argument is given nor is any
+       database name given as a command line argument, the first argument
+       specified with <option>-d</option> <option>--dbname</option> will be
+       used for the initial connection.  If that argument is not a literal
+       database name, the attempt to connect will fail.
+      </para>
+      <para>
+       If <option>--all</option> is also specified, <option>-d</option>
+       <option>--dbname</option> does not affect which databases are checked,
+       but may be used to specify the database for the initial connection.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-dbname</option> takes
+       precedence over <option>-d</option> <option>--dbname</option>.
+      </para>
+      <para>
+       Examples:
+       <simplelist>
+        <member><literal>--dbname=africa</literal></member>
+        <member><literal>--dbname="a*"</literal></member>
+        <member><literal>--dbname="africa|asia|europe"</literal></member>
+       </simplelist>
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-D</option></term>
+     <term><option>--exclude-dbname</option></term>
+     <listitem>
+      <para>
+       Do not perform checking in the specified database.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       database (or database pattern) for exclusion.
+      </para>
+      <para>
+       If a database which is included using <option>--all</option> or
+       <option>-d</option> <option>--dbname</option> is also excluded using
+       <option>-D</option> <option>--exclude-dbname</option>, the database will
+       be excluded.
+      </para>
+      <para>
+       Examples:
+       <simplelist>
+        <member><literal>--exclude-dbname=america</literal></member>
+        <member><literal>--exclude-dbname="*pacific*"</literal></member>
+       </simplelist>
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-e</option></term>
+     <term><option>--echo</option></term>
+     <listitem>
+      <para>
+       Print to stdout all commands and queries being executed against the
+       server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--endblock=BLOCK</option></term>
+     <listitem>
+      <para>
+       Skip (do not check) all pages after the given ending block.
+      </para>
+      <para>
+       By default, no pages are skipped.  This option will be applied to all
+       table relations that are checked, including toast tables, but note that
+       unless <option>--exclude-toast-pointers</option> is given, toast
+       pointers found in the main table will be followed into the toast table
+       without regard for the location in the toast table.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--exclude-toast-pointers</option></term>
+     <listitem>
+      <para>
+       When checking main relations, do not look up entries in toast tables
+       corresponding to toast pointers in the main relation.
+      </para>
+      <para>
+       The default behavior checks each toast pointer encountered in the main
+       table to verify, as much as possible, that the pointer points at
+       something in the toast table that is reasonable.  Toast pointers which
+       point beyond the end of the toast table, or to the middle (rather than
+       the beginning) of a toast entry, are identified as corrupt.
+      </para>
+      <para>
+       The process by which <xref linkend="amcheck"/>'s
+       <function>verify_heapam</function> function checks each toast pointer is
+       slow and may be improved in a future release.  Some users may wish to
+       disable this check to save time.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--heapallindexed</option></term>
+     <listitem>
+      <para>
+       For each index checked, verify the presence of all heap tuples as index
+       tuples in the index using <xref linkend="amcheck"/>'s
+       <option>heapallindexed</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-?</option></term>
+     <term><option>--help</option></term>
+     <listitem>
+      <para>
+       Show help about <application>pg_amcheck</application> command line
+       arguments, and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-h</option></term>
+     <term><option>--host=HOSTNAME</option></term>
+     <listitem>
+      <para>
+       Specifies the host name of the machine on which the server is running.
+       If the value begins with a slash, it is used as the directory for the
+       Unix domain socket.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-i</option></term>
+     <term><option>--index</option></term>
+     <listitem>
+      <para>
+       Perform checks on the specified index(es).  This is an alias for the
+       <option>-r</option> <option>--relation</option> option, except that it
+       applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-I</option></term>
+     <term><option>--exclude-index</option></term>
+     <listitem>
+      <para>
+       Exclude checks on the specified index(es).  This is an alias for the
+       <option>-R</option> <option>--exclude-relation</option> option, except
+       that it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-j</option></term>
+     <term><option>--jobs=NUM</option></term>
+     <listitem>
+      <para>
+       Use the specified number of concurrent connections to the server, or
+       one per object to be checked, whichever number is smaller.
+      </para>
+      <para>
+       The default is to use a single connection.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--maintenance-db=DBNAME</option></term>
+     <listitem>
+      <para>
+       Specifies the name of the database to connect to when querying the
+       list of all databases.  If not specified, the
+       <literal>postgres</literal> database will be used; if that does not
+       exist <literal>template1</literal> will be used.  This can be a
+       <link linkend="libpq-connstring">connection string</link>.  If so,
+       connection string parameters will override any conflicting command
+       line options.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-index-expansion</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include btree indexes associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated btree indexes, if any.  If this option is given, only
+       those indexes which match a <option>--relation</option> or
+       <option>--index</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-strict-names</option></term>
+     <listitem>
+      <para>
+       When calculating the list of databases to check, and the objects within
+       those databases to be checked, do not raise an error for database,
+       schema, relation, table, nor index inclusion patterns which match no
+       corresponding objects.
+      </para>
+      <para>
+       Exclusion patterns are not required to match any objects, but by
+       default unmatched inclusion patterns raise an error, including when
+       they fail to match as a result of an exclusion pattern having
+       prohibited them matching an existent object, and when they fail to
+       match a database because it is unconnectable (datallowconn is false).
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-toast-expansion</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include toast tables associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated toast tables, if any.  If this option is given, only
+       those toast tables which match a <option>--relation</option> or
+       <option>--table</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--on-error-stop</option></term>
+     <listitem>
+      <para>
+       After reporting all corruptions on the first page of a table where
+       corruptions are found, stop processing that table relation and move on
+       to the next table or index.
+      </para>
+      <para>
+       Note that index checking always stops after the first corrupt page.
+       This option only has meaning relative to table relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--parent-check</option></term>
+     <listitem>
+      <para>
+       For each btree index checked, use <xref linkend="amcheck"/>'s
+       <function>bt_index_parent_check</function> function, which performs
+       additional checks of parent/child relationships during index checking.
+      </para>
+      <para>
+       The default is to use <application>amcheck</application>'s
+       <function>bt_index_check</function> function, but note that use of the
+       <option>--rootdescend</option> option implicitly selects
+       <function>bt_index_parent_check</function>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-p</option></term>
+     <term><option>--port=PORT</option></term>
+     <listitem>
+      <para>
+       Specifies the TCP port or local Unix domain socket file extension on
+       which the server is listening for connections.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--progress</option></term>
+     <listitem>
+      <para>
+       Show progress information about how many relations have been checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-q</option></term>
+     <term><option>--quiet</option></term>
+     <listitem>
+      <para>
+       Do not write additional messages beyond those about corruption.
+      </para>
+      <para>
+       This option does not quiet any output specifically due to the use of
+       the <option>-e</option> <option>--echo</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-r</option></term>
+     <term><option>--relation</option></term>
+     <listitem>
+      <para>
+       Perform checking on the specified relation(s).
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       relation (or relation pattern) for checking.
+      </para>
+      <para>
+       Option <option>-R</option> <option>--exclude-relation</option> takes
+       precedence over <option>-r</option> <option>--relation</option>.
+      </para>
+      <para>
+       Examples:
+       <simplelist>
+        <member><literal>--relation=accounts_table</literal></member>
+        <member><literal>--relation=accounting_department.accounts_table</literal></member>
+        <member><literal>--relation=corporate_database.accounting_department.*_table</literal></member>
+       </simplelist>
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-R</option></term>
+     <term><option>--exclude-relation</option></term>
+     <listitem>
+      <para>
+       Exclude checks on the specified relation(s).
+      </para>
+      <para>
+       Option <option>-R</option> <option>--exclude-relation</option> takes
+       precedence over <option>-r</option> <option>--relation</option>,
+       <option>-t</option> <option>--table</option> and <option>-i</option>
+       <option>--index</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--rootdescend</option></term>
+     <listitem>
+      <para>
+       For each index checked, re-find tuples on the leaf level by performing a
+       new search from the root page for each tuple using
+       <xref linkend="amcheck"/>'s <option>rootdescend</option> option.
+      </para>
+      <para>
+       Use of this option implicitly also selects the
+       <option>--parent-check</option> option.
+      </para>
+      <para>
+       This form of verification was originally written to help in the
+       development of btree index features.  It may be of limited use or even
+       of no use in helping detect the kinds of corruption that occur in
+       practice.  It may also cause corruption checking to take considerably
+       longer and consume considerably more resources on the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-s</option></term>
+     <term><option>--schema</option></term>
+     <listitem>
+      <para>
+       Perform checking in the specified schema(s).
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       schema (or schema pattern) for checking.  By default, all objects in
+       the matching schema(s) will be checked.
+      </para>
+      <para>
+       Option <option>-S</option> <option>--exclude-schema</option> takes
+       precedence over <option>-s</option> <option>--schema</option>.
+      </para>
+      <para>
+       Examples:
+       <simplelist>
+        <member><literal>--schema=corp</literal></member>
+        <member><literal>--schema="corp|llc|npo"</literal></member>
+       </simplelist>
+      </para>
+      <para>
+       Note that both tables and indexes are included using this option, which
+       might not be what you want if you are also using
+       <option>--no-index-expansion</option>.  To specify all tables in a schema
+       without also specifying all indexes, <option>--table</option> can be
+       used with a pattern that specifies the schema.  For example, to check
+       all tables in schema <literal>corp</literal>, the option
+       <literal>--table="corp.*"</literal> may be used.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-S</option></term>
+     <term><option>--exclude-schema</option></term>
+     <listitem>
+      <para>
+       Do not perform checking in the specified schema.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       schema (or schema pattern) for exclusion.
+      </para>
+      <para>
+       If a schema which is included using
+       <option>-s</option> <option>--schema</option> is also excluded using
+       <option>-S</option> <option>--exclude-schema</option>, the schema will
+       be excluded.
+      </para>
+      <para>
+       Examples:
+       <simplelist>
+        <member><literal>-S corp -S llc</literal></member>
+        <member><literal>--exclude-schema="*c*"</literal></member>
+       </simplelist>
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--skip=OPTION</option></term>
+     <listitem>
+      <para>
+       If <literal>"all-frozen"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all frozen.
+      </para>
+      <para>
+       If <literal>"all-visible"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all visible.
+      </para>
+      <para>
+       By default, no pages are skipped.  This can be specified as
+       <literal>"none"</literal>, but since this is the default, it need not be
+       mentioned.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--startblock=BLOCK</option></term>
+     <listitem>
+      <para>
+       Skip (do not check) pages prior to the given starting block.
+      </para>
+      <para>
+       By default, no pages are skipped.  This option will be applied to all
+       table relations that are checked, including toast tables, but note
+       that unless <option>--exclude-toast-pointers</option> is given, toast
+       pointers found in the main table will be followed into the toast table
+       without regard for the location in the toast table.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-t</option></term>
+     <term><option>--table</option></term>
+     <listitem>
+      <para>
+       Perform checks on the specified tables(s).  This is an alias for the
+       <option>-r</option> <option>--relation</option> option, except that it
+       applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-T</option></term>
+     <term><option>--exclude-table</option></term>
+     <listitem>
+      <para>
+       Exclude checks on the specified tables(s).  This is an alias for the
+       <option>-R</option> <option>--exclude-relation</option> option, except
+       that it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-U</option></term>
+     <term><option>--username=USERNAME</option></term>
+     <listitem>
+      <para>
+       User name to connect as.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-v</option></term>
+     <term><option>--verbose</option></term>
+     <listitem>
+      <para>
+       Increases the log level verbosity.  This option may be given more than
+       once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-V</option></term>
+     <term><option>--version</option></term>
+     <listitem>
+      <para>
+       Print the <application>pg_amcheck</application> version and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-w</option></term>
+     <term><option>--no-password</option></term>
+     <listitem>
+      <para>
+       Never issue a password prompt.  If the server requires password
+       authentication and a password is not available by other means such as
+       a <filename>.pgpass</filename> file, the connection attempt will fail.
+       This option can be useful in batch jobs and scripts where no user is
+       present to enter a password.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-W</option></term>
+     <term><option>--password</option></term>
+     <listitem>
+      <para>
+       Force <application>pg_amcheck</application> to prompt for a password
+       before connecting to a database.
+      </para>
+      <para>
+       This option is never essential, since
+       <application>pg_amcheck</application> will automatically prompt for a
+       password if the server demands password authentication.  However,
+       <application>pg_amcheck</application> will waste a connection attempt
+       finding out that the server wants a password.  In some cases it is
+       worth typing <option>-W</option> to avoid the extra connection attempt.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+   <application>pg_amcheck</application> is designed to work with
+   <productname>PostgreSQL</productname> 14.0 and later.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Author</title>
+
+  <para>
+   Mark Dilger <email>mark.dilger@enterprisedb.com</email>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="amcheck"/></member>
+  </simplelist>
+ </refsect1>
+</refentry>
diff --git a/src/tools/msvc/Install.pm b/src/tools/msvc/Install.pm
index ea3af48777..49ad558b74 100644
--- a/src/tools/msvc/Install.pm
+++ b/src/tools/msvc/Install.pm
@@ -18,7 +18,7 @@ our (@ISA, @EXPORT_OK);
 @EXPORT_OK = qw(Install);
 
 my $insttype;
-my @client_contribs = ('oid2name', 'pgbench', 'vacuumlo');
+my @client_contribs = ('oid2name', 'pg_amcheck', 'pgbench', 'vacuumlo');
 my @client_program_files = (
 	'clusterdb',      'createdb',   'createuser',    'dropdb',
 	'dropuser',       'ecpg',       'libecpg',       'libecpg_compat',
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 49614106dc..f680544e07 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -33,9 +33,9 @@ my @unlink_on_exit;
 
 # Set of variables for modules in contrib/ and src/test/modules/
 my $contrib_defines = { 'refint' => 'REFINT_VERBOSE' };
-my @contrib_uselibpq = ('dblink', 'oid2name', 'postgres_fdw', 'vacuumlo');
-my @contrib_uselibpgport   = ('oid2name', 'vacuumlo');
-my @contrib_uselibpgcommon = ('oid2name', 'vacuumlo');
+my @contrib_uselibpq = ('dblink', 'oid2name', 'pg_amcheck', 'postgres_fdw', 'vacuumlo');
+my @contrib_uselibpgport   = ('oid2name', 'pg_amcheck', 'vacuumlo');
+my @contrib_uselibpgcommon = ('oid2name', 'pg_amcheck', 'vacuumlo');
 my $contrib_extralibs      = undef;
 my $contrib_extraincludes = { 'dblink' => ['src/backend'] };
 my $contrib_extrasource = {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1dec43f9d..a0dfe164cd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -101,6 +101,7 @@ AlterUserMappingStmt
 AlteredTableInfo
 AlternativeSubPlan
 AlternativeSubPlanState
+AmcheckOptions
 AnalyzeAttrComputeStatsFunc
 AnalyzeAttrFetchFunc
 AnalyzeForeignTable_function
@@ -499,6 +500,7 @@ DSA
 DWORD
 DataDumperPtr
 DataPageDeleteStack
+DatabaseInfo
 DateADT
 Datum
 DatumTupleFields
@@ -2084,6 +2086,7 @@ RelToCluster
 RelabelType
 Relation
 RelationData
+RelationInfo
 RelationPtr
 RelationSyncEntry
 RelcacheCallbackFunction
-- 
2.21.1 (Apple Git-122.3)

v42-0003-Extending-PostgresNode-to-test-corruption.patchapplication/octet-stream; name=v42-0003-Extending-PostgresNode-to-test-corruption.patch; x-unix-mode=0644Download
From cd0e6068ce7e3d523050467fdf9bc0f21c02c7d8 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Feb 2021 12:37:58 -0800
Subject: [PATCH v42 3/3] Extending PostgresNode to test corruption.

PostgresNode now has functions for overwriting relation files
with full or partial prior versions of those files, creating
corruption beyond merely twiddling the bits of a heap relation
file.

Adding a regression test for pg_amcheck based on this new
functionality.
---
 contrib/pg_amcheck/t/006_relfile_damage.pl    | 145 ++++++++++
 src/test/modules/Makefile                     |   1 +
 src/test/modules/corruption/Makefile          |  16 ++
 .../modules/corruption/t/001_corruption.pl    |  83 ++++++
 src/test/perl/PostgresNode.pm                 | 261 ++++++++++++++++++
 5 files changed, 506 insertions(+)
 create mode 100644 contrib/pg_amcheck/t/006_relfile_damage.pl
 create mode 100644 src/test/modules/corruption/Makefile
 create mode 100644 src/test/modules/corruption/t/001_corruption.pl

diff --git a/contrib/pg_amcheck/t/006_relfile_damage.pl b/contrib/pg_amcheck/t/006_relfile_damage.pl
new file mode 100644
index 0000000000..45ad223531
--- /dev/null
+++ b/contrib/pg_amcheck/t/006_relfile_damage.pl
@@ -0,0 +1,145 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 22;
+use PostgresNode;
+
+my ($node, $port);
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT ct.relname
+			FROM pg_catalog.pg_class cr, pg_catalog.pg_class ct
+			WHERE cr.oid = '$relname'::regclass
+			  AND cr.reltoastrelid = ct.oid
+			));
+	return undef unless defined $rel;
+	return "pg_toast.$rel";
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+# Create a table with a btree index.  Use a fillfactor for the table and index
+# that will allow some fraction of updates to be on the original pages and some
+# on new pages.
+#
+$node->safe_psql('postgres', qq(
+create schema t;
+create table t.t1 (id integer, t text) with (fillfactor=75);
+alter table t.t1 alter column t set storage external;
+insert into t.t1 select gs, repeat('x',gs) from generate_series(9990,10000) gs;
+create index t1_idx on t.t1 (id) with (fillfactor=75);
+));
+
+my $toastrel = relation_toast('postgres', 't.t1');
+
+# Flush relation files to disk and take snapshots of the toast and index
+#
+$node->restart;
+$node->take_relfile_snapshot_minimal('postgres', 'idx', 't.t1_idx');
+$node->take_relfile_snapshot_minimal('postgres', 'toast', $toastrel);
+
+# Insert new data into the table and index
+#
+$node->safe_psql('postgres', qq(
+insert into t.t1 select gs, repeat('y',gs) from generate_series(10001,10100) gs;
+));
+
+# Revert index.  The reverted snapshot file is not corrupt, but it also
+# does not match the current contents of the table.
+#
+$node->stop;
+$node->revert_to_snapshot('idx');
+
+# Restart the node and check table and index with varying options.
+#
+$node->start;
+
+# Checks which do not reconcile the index and table via --heapallindexed will
+# not notice any problems
+#
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--parent-check' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --parent-check');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--rootdescend' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --rootdescend');
+
+# Checks which do reconcile the index and table via --heapallindexed will
+# notice the mismatch in their contents
+#
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed --rootdescend');
+
+# Revert the toast.  The reverted toast table is not corrupt, but it does not
+# have entries for all toast pointers in the main table
+#
+$node->stop;
+$node->revert_to_snapshot('toast');
+
+# Restart the node and check table and toast with varying options.  When
+# checking the toast pointers, we may get errors produced by verify_heapam, but
+# we may also get errors from failure to read toast blocks that are beyond the
+# end of the toast table, of the form /ERROR:  could not read block/.  To avoid
+# having a brittle test, we accept any error message.
+#
+$node->start;
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', $toastrel ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck reverted toast table');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--exclude-toast-pointers' ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck with reverted toast using --exclude-toast-pointers');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ ],
+	'pg_amcheck with reverted toast and default checking');
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 5391f461a2..c92d1702b4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  corruption \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/corruption/Makefile b/src/test/modules/corruption/Makefile
new file mode 100644
index 0000000000..ba461c645d
--- /dev/null
+++ b/src/test/modules/corruption/Makefile
@@ -0,0 +1,16 @@
+# src/test/modules/corruption/Makefile
+
+# EXTRA_INSTALL = contrib/pg_amcheck
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/corruption
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/corruption/t/001_corruption.pl b/src/test/modules/corruption/t/001_corruption.pl
new file mode 100644
index 0000000000..ae4a262e06
--- /dev/null
+++ b/src/test/modules/corruption/t/001_corruption.pl
@@ -0,0 +1,83 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 10;
+use PostgresNode;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create something non-trivial for the first snapshot
+$node->safe_psql('postgres', qq(
+create table t1 (id integer, short_text text, long_text text);
+insert into t1 (id, short_text, long_text)
+	(select gs, 'foo', repeat('x', gs)
+		from generate_series(1,10000) gs);
+create unique index idx1 on t1 (id, short_text);
+vacuum freeze;
+));
+
+# Flush relation files to disk and take snapshot of them
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap1', 'public.t1');
+
+# Update data in the table, toast table, and index
+$node->safe_psql('postgres', qq(
+update t1 set
+	short_text = 'bar',
+	long_text = repeat('y', id);
+));
+
+# Flush relation files to disk and take second snapshot
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap2', 'public.t1');
+
+# Revert the first page of t1 using a torn snapshot.  This should be a partial
+# and corrupt reverting of the update.
+$node->stop;
+$node->revert_to_torn_relfile_snapshot('snap1', 8192);
+
+# Restart the node and count the number of rows in t1 with the original
+# (pre-update) values.  It should not be zero, but nor will it be the full
+# 10000.
+$node->start;
+my ($old, $new, $oldtoast, $newtoast) = counts();
+ok($old > 0 && $old < 10000, "Torn snapshot reverts some of the main updates");
+ok($new > 0 && $new <= 10000, "Torn snapshot retains some of the main updates");
+
+# Revert t1 fully to the first snapshot.  This should fully restore the
+# original (pre-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap1');
+
+# Restart the node and verify only old values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 10000, "Full snapshot restores all the old main values");
+is($oldtoast, 10000, "Full snapshot restores all the old toast values");
+is($new, 0, "Full snapshot reverts all the new main values");
+is($newtoast, 0, "Full snapshot reverts all the new toast values");
+
+# Restore t1 fully to the second snapshot.  This should fully restore the
+# new (post-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap2');
+
+# Restart the node and verify only new values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 0, "Full snapshot reverts all the old main values");
+is($oldtoast, 0, "Full snapshot reverts all the old toast values");
+is($new, 10000, "Full snapshot restores all the new main values");
+is($newtoast, 10000, "Full snapshot restores all the new toast values");
+
+sub counts {
+	return map {
+		$node->safe_psql('postgres', qq(select count(*) from t1 where $_))
+	} ("short_text = 'foo'",
+	   "short_text = 'bar'",
+	   "long_text ~ 'x'",
+	   "long_text ~ 'y'");
+}
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..d470af93c5 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2225,6 +2225,267 @@ sub pg_recvlogical_upto
 
 =back
 
+=head1 DATABASE CORRUPTION METHODS
+
+=over
+
+=item $node->relfile_snapshot_repository()
+
+The path to the parent directory of all directories storing snapshots of
+relation backing files.
+
+=cut
+
+sub relfile_snapshot_repository
+{
+	my ($self) = @_;
+	my $snaprepo = join('/', $self->basedir, 'snapshot');
+	unless (-d $snaprepo)
+	{
+		mkdir $snaprepo
+			or $!{EEXIST}
+			or BAIL_OUT("could not create snapshot repository directory \"$snaprepo\": $!");
+	}
+	return $snaprepo;
+}
+
+=pod
+
+=item $node->relfile_snapshot_directory(snapname)
+
+The path to the directory for storing the named snapshot.
+
+=cut
+
+sub relfile_snapshot_directory
+{
+	my ($self, $snapname) = @_;
+
+	join("/", $self->relfile_snapshot_repository(), $snapname);
+}
+
+=pod
+
+=item $node->take_relfile_snapshot($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relname>, the associated
+toast relations (if any), and all associated indexes (if any).  No attempt is
+made to flush these files to disk, meaning the snapshot taken could be stale
+unless the caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relations.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+=pod
+
+=item $node->take_relfile_snapshot_minimal($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relnames>.  No attempt is made
+to flush these files to disk, meaning the snapshot taken could be stale unless the
+caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relation.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+sub take_relfile_snapshot
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 1, @relnames);
+}
+
+sub take_relfile_snapshot_minimal
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 0, @relnames);
+}
+
+sub take_relfile_snapshot_helper
+{
+	my ($self, $dbname, $snapname, $extended, @relnames) = @_;
+
+	croak "dbname must be specified" unless defined $dbname;
+	croak "relnames must be defined" unless scalar(grep { defined $_ } @relnames);
+	croak "snapname must be specified" unless defined $snapname;
+	croak "snapname must be unique" if exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snapdir = $self->relfile_snapshot_directory($snapname);
+	croak "snapname directory name already in use: $snapdir" if (-e $snapdir);
+	mkdir $snapdir
+		or BAIL_OUT("could not create snapshot directory \"$snapdir\": $!");
+
+	my @relpaths = map {
+		$self->safe_psql($dbname,
+			qq(SELECT pg_relation_filepath('$_')));
+	} @relnames;
+
+	my (@toastpaths, @idxpaths);
+	if ($extended)
+	{
+		for my $relname (@relnames)
+		{
+			push (@toastpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(c.reltoastrelid)
+					FROM pg_catalog.pg_class c
+					WHERE c.oid = '$relname'::regclass
+					AND c.reltoastrelid != 0::oid))));
+			push (@idxpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(i.indexrelid)
+					FROM pg_catalog.pg_index i
+					WHERE i.indrelid = '$relname'::regclass))));
+		}
+	}
+
+	$self->{snapshot}->{$snapname} = {};
+	for my $path (@relpaths, grep { defined($_) } @toastpaths, @idxpaths)
+	{
+		croak "file backing relation is missing: $pgdata/$path" unless -f "$pgdata/$path";
+		copy_file($snapdir, $pgdata, 0, $path);
+		$self->{snapshot}->{$snapname}->{$path} = 1;
+	}
+}
+
+=pod
+
+=item $node->revert_to_snapshot($self, $snapname)
+
+Overwrites the database's relation files with files previously saved in
+B<$snapname>.
+
+Dies if the given B<$snapname> does not exist.
+
+=cut
+
+=pod
+
+=item $node->revert_to_torn_relfile_snapshot($self, $snapname, $bytes)
+
+Partially overwrites the database's relation files using prefixes of the given
+number of bytes from the files saved in B<$snapname>.  If B<$bytes> is
+negative, uses suffixes of the given byte length rather than prefixes.
+
+If B<$bytes> is null, replaces the database's relation files using the saved
+files in the B<$snapname>, which unlike for non-undef values, means the file
+may become shorter if the saved file is shorter than the current file.
+
+=cut
+
+sub revert_to_snapshot
+{
+	my ($self, $snapname) = @_;
+	$self->revert_to_torn_relfile_snapshot($snapname, undef);
+}
+
+sub revert_to_torn_relfile_snapshot
+{
+	my ($self, $snapname, $bytes) = @_;
+
+	croak "no such snapshot" unless exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snaprepo = join('/', $self->relfile_snapshot_repository, $snapname);
+	croak "snapname directory missing: $snaprepo" unless (-d $snaprepo);
+
+	if (defined $bytes)
+	{
+		tear_file($pgdata, $snaprepo, $bytes, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+	else
+	{
+		copy_file($pgdata, $snaprepo, 1, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+}
+
+sub copy_file
+{
+	my ($dstdir, $srcdir, $overwrite, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	foreach my $part (split(m{/}, $path))
+	{
+		my $srcpart = "$srcdir/$part";
+		my $dstpart = "$dstdir/$part";
+
+		if (-d $srcpart)
+		{
+			$srcdir = $srcpart;
+			$dstdir = $dstpart;
+			die "$dstdir is in the way" if (-e $dstdir && ! -d $dstdir);
+			unless (-d $dstdir)
+			{
+				mkdir $dstdir
+					or BAIL_OUT("could not create directory \"$dstdir\": $!");
+			}
+		}
+		elsif (-f $srcpart)
+		{
+			die "$dstdir/$part is in the way" if (!$overwrite && -e "$dstdir/$part");
+
+			File::Copy::copy($srcpart, "$dstdir/$part");
+		}
+	}
+}
+
+sub tear_file
+{
+	my ($dstdir, $srcdir, $bytes, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	my $srcfile = "$srcdir/$path";
+	my $dstfile = "$dstdir/$path";
+
+	croak "No such file: $srcfile" unless -f $srcfile;
+	croak "No such file: $dstfile" unless -f $dstfile;
+
+	my ($srcfh, $dstfh);
+	open($srcfh, '<', $srcfile) or die "Cannot read $srcfile: $!";
+	open($dstfh, '+<', $dstfile) or die "Cannot modify $dstfile: $!";
+	binmode($srcfh);
+	binmode($dstfh);
+
+	my $buffer;
+	if ($bytes < 0)
+	{
+		$bytes *= -1;		# Easier to use positive value
+		my $srcsize = (stat($srcfh))[7];
+		my $offset = $srcsize - $bytes;
+		seek($srcfh, $offset, 0);
+		seek($dstfh, $offset, 0);
+		sysread($srcfh, $buffer, $bytes);
+		syswrite($dstfh, $buffer, $bytes);
+	}
+	else
+	{
+		seek($srcfh, 0, 0);
+		seek($dstfh, 0, 0);
+		sysread($srcfh, $buffer, $bytes);
+		syswrite($dstfh, $buffer, $bytes);
+	}
+
+	close($srcfh);
+	close($dstfh);
+}
+
+=pod
+
+=back
+
 =cut
 
 1;
-- 
2.21.1 (Apple Git-122.3)

#2Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#1)
Re: pg_amcheck contrib application

On Wed, Mar 3, 2021 at 10:22 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Your four messages about there being nothing to check seem like they
could be consolidated down to one: "nothing to check for pattern
\"%s\"".

I anticipated your review comment, but I'm worried about the case that somebody runs

pg_amcheck -t "foo" -i "foo"

and one of those matches and the other does not. The message 'nothing to check for pattern "foo"' will be wrong (because there was something to check for it) and unhelpful (because it doesn't say which failed to match.)

Fair point.

Changed, though I assumed your parens for corruption() were not intended.

Uh, yeah.

Thanks for the review!

+ fprintf(stderr, "%s: no relations to check", progname);

Missing newline.

Generally, I would favor using pg_log_whatever as a way of reporting
messages starting when option parsing is complete. In other words,
starting here:

+ fprintf(stderr, "%s: no databases to check\n", progname);

I see no real advantage in having a bunch of these using
fprintf(stderr, ...), which to me seems most appropriate only for very
early failures.

Perhaps amcheck_sql could be spread across fewer lines, now that it
doesn't have so many decorations?

pg_basebackup uses -P as a short form for --progress, so maybe we
should match that here.

When I do "pg_amcheck --progress", it just says "259/259 (100%)" which
I don't find too clear. The corresponding pg_basebackup output is
"32332/32332 kB (100%), 1/1 tablespace" which has the advantage of
including units. I think if you just add the word "relations" to your
message it will be nicer.

When I do "pg_amcheck -s public" it tells me that there are no
relations to check in schemas for "public". I think "schemas matching"
would read better than "schemas for." Similar with the other messages.
When I try "pg_amcheck -t nessie" it tells me that there are no tables
to check for "nessie" but saying that there are no tables to check
matching "nessie" to me sounds more natural.

The code doesn't seem real clear on the difference between a database
name and a pattern. Consider:

createdb rhaas
createdb 'rh*s'
PGDATABASE='rh*s' pg_amcheck

It checks the rhaas database, which I venture to say is just plain wrong.

The error message when I exclude the only checkable database is not
very clear. "pg_amcheck -D rhaas" says pg_amcheck: no checkable
database: "rhaas". Well, I get that there's no checkable database. But
as a user I have no idea what "rhaas" is. I can even get it to issue
this complaint more than once:

createdb q
createdb qq
pg_amcheck -D 'q*' q qq

Now it issues the "no checkable database" complaint twice, once for q
and once for qq. But if there's no checkable database, I only need to
know that once. Either the message is wrongly-worded, or it should
only be issued once and doesn't need to include the pattern. I think
it's the second one, but I could be wrong.

Using a pattern as the only or first argument doesn't work; i.e.
"pg_amcheck rhaas" works but "pg_amcheck rhaa?" fails because there is
no database with that exact literal name. This seems like another
instance of confusion between a literal database name and a database
name pattern. I'm not quite sure what the right solution is here. We
could give up on having database patterns altogether -- the comparable
issue does not arise for database and schema name patterns -- or the
maintenance database could default to something that's not going to be
a pattern, like "postgres," rather than being taken from a
command-line argument that is intended to be a pattern. Or some hybrid
approach e.g. -d options are patterns, but don't set the maintenance
database, while extra command line arguments are literal database
names, and thus are presumably OK to use as the maintenance DB. But
it's too weird IMHO to support patterns here and then have supplying
one inevitably fail unless you also specify --maintenance-db.

It's sorta annoying that there doesn't seem to be an easy way to find
out exactly what relations got checked as a result of whatever I did.
Perhaps pg_amcheck -v should print a line for each relation saying
that it's checking that relation; it's not actually that verbose as
things stand. If we thought that was overdoing it, we could set things
up so that multiple -v options keep increasing the verbosity level, so
that you can get this via pg_amcheck -vv. I submit that pg_amcheck -e
is not useful for this purpose because the queries, besides being
long, use the relation OIDs rather than the names, so it's not easy to
see what happened.

I think that something's not working in terms of schema exclusion. If
I create a brand-new database and then run "pg_amcheck -S pg_catalog
-S information_schema -S pg_toast" it still checks stuff. In fact it
seems to check the exact same amount of stuff that it checks if I run
it with no command-line options at all. In fact, if I run "pg_amcheck
-S '*'" that still checks everything. Unless I'm misunderstanding what
this option is supposed to do, the fact that a version of this patch
where this seemingly doesn't work at all escaped to the list suggests
that your testing has got some gaps.

I like the semantics of --no-toast-expansion and --no-index-expansion
as you now have them, but I find I don't really like the names. Could
I suggest --no-dependent-indexes and --no-dependent-toast?

I tried pg_amcheck --startblock=tsgsdg and got an error message
without a trailing newline. I tried --startblock=-525523 and got no
error. I tried --startblock=99999999999999999999999999 and got a
complaint that the value was out of bounds, but without a trailing
newline. Maybe there's an argument that the bounds don't need to be
checked, but surely there's no argument for checking one and not the
other. I haven't tried the corresponding cases with --endblock but you
should. I tried --startblock=2 --endblock=1 and got a complaint that
the ending block precedes the starting block, which is totally
reasonable (though I might say "start block" and "end block" rather
than using the -ing forms) but this message is prefixed with
"pg_amcheck: " whereas the messages about an altogether invalid
starting block where not so prefixed. Is there a reason not to make
this consistent?

I also tried using a random positive integer for startblock, and for
every relation I am told "ERROR: starting block number must be
between 0 and <whatever>". That makes sense, because I used a big
number for the start block and I don't have any big relations, but it
makes for an absolute ton of output, because every verify_heapam query
is 11 lines long. This suggests a couple of possible improvements.
First, maybe we should only display the query that produced the error
in verbose mode. Second, maybe the verify_heapam() query should be
tightened up so that it doesn't stretch across quite so many lines. I
think the call to verify_heapam() could be spread across like 2 lines
rather than 7, which would improve readability. On a related note, I
wonder why we need every verify_heapam() call to join to pg_class and
pg_namespace just to fetch the schema and table name which,
presumably, we should or at least could already have. This kinda
relates to my comment earlier about making -v print a message per
relation so that we can see, in human-readable format, which relations
are getting checked. Right now, if you got an error checking just one
relation, how would you know which relation you got it from? Unless
the server happens to report that information in the message, you're
just in the dark, because pg_amcheck won't tell you.

The line "Read the description of the amcheck contrib module for
details" seems like it could be omitted. Perhaps the first line of the
help message could be changed to read "pg_amcheck uses amcheck to find
corruption in a PostgreSQL database." or something like that, instead.

--
Robert Haas
EDB: http://www.enterprisedb.com

#3Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#2)
3 attachment(s)
Re: pg_amcheck contrib application

On Mar 3, 2021, at 9:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 3, 2021 at 10:22 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Your four messages about there being nothing to check seem like they
could be consolidated down to one: "nothing to check for pattern
\"%s\"".

I anticipated your review comment, but I'm worried about the case that somebody runs

pg_amcheck -t "foo" -i "foo"

and one of those matches and the other does not. The message 'nothing to check for pattern "foo"' will be wrong (because there was something to check for it) and unhelpful (because it doesn't say which failed to match.)

Fair point.

Changed, though I assumed your parens for corruption() were not intended.

Uh, yeah.

Thanks for the review!

+ fprintf(stderr, "%s: no relations to check", progname);

Missing newline.

Generally, I would favor using pg_log_whatever as a way of reporting
messages starting when option parsing is complete. In other words,
starting here:

+ fprintf(stderr, "%s: no databases to check\n", progname);

I see no real advantage in having a bunch of these using
fprintf(stderr, ...), which to me seems most appropriate only for very
early failures.

Ok, the newline issues should be fixed, and the use of pg_log_{error,warning,info} is now used more consistently.

Perhaps amcheck_sql could be spread across fewer lines, now that it
doesn't have so many decorations?

Done.

pg_basebackup uses -P as a short form for --progress, so maybe we
should match that here.

Done.

When I do "pg_amcheck --progress", it just says "259/259 (100%)" which
I don't find too clear. The corresponding pg_basebackup output is
"32332/32332 kB (100%), 1/1 tablespace" which has the advantage of
including units. I think if you just add the word "relations" to your
message it will be nicer.

Done. It now shows:

% pg_amcheck -P
259/259 relations (100%) 870/870 pages (100%)

As you go along, the percent of relations processed may not be equal to the percent of pages, though at the end they are both 100%. The value of printing both can only be seen while things are underway.

When I do "pg_amcheck -s public" it tells me that there are no
relations to check in schemas for "public". I think "schemas matching"
would read better than "schemas for." Similar with the other messages.
When I try "pg_amcheck -t nessie" it tells me that there are no tables
to check for "nessie" but saying that there are no tables to check
matching "nessie" to me sounds more natural.

Done.

% pg_amcheck -s public
pg_amcheck: error: no relations to check in schemas matching "public"

The code doesn't seem real clear on the difference between a database
name and a pattern. Consider:

createdb rhaas
createdb 'rh*s'
PGDATABASE='rh*s' pg_amcheck

It checks the rhaas database, which I venture to say is just plain wrong.

This next version treats any arguments supplied with -d and -D as database patterns, and all others as database names. Exclusion patterns (-D) only override inclusion patterns, not names.

The error message when I exclude the only checkable database is not
very clear. "pg_amcheck -D rhaas" says pg_amcheck: no checkable
database: "rhaas". Well, I get that there's no checkable database. But
as a user I have no idea what "rhaas" is. I can even get it to issue
this complaint more than once:

createdb q
createdb qq
pg_amcheck -D 'q*' q qq

Now it issues the "no checkable database" complaint twice, once for q
and once for qq. But if there's no checkable database, I only need to
know that once. Either the message is wrongly-worded, or it should
only be issued once and doesn't need to include the pattern. I think
it's the second one, but I could be wrong.

I think this whole problem goes away with the change to how -D/-d work and don't interact with database names. At least, I don't get any problems like the one you mention:

% PGDATABASE=postgres pg_amcheck -D postgres
pg_amcheck: warning: skipping database "postgres": amcheck is not installed
pg_amcheck: error: no relations to check

% PGDATABASE=mark.dilger pg_amcheck -D mark.dilger --progress
259/259 relations (100%) 870/870 pages (100%)

Using a pattern as the only or first argument doesn't work; i.e.
"pg_amcheck rhaas" works but "pg_amcheck rhaa?" fails because there is
no database with that exact literal name. This seems like another
instance of confusion between a literal database name and a database
name pattern. I'm not quite sure what the right solution is here. We
could give up on having database patterns altogether -- the comparable
issue does not arise for database and schema name patterns -- or the
maintenance database could default to something that's not going to be
a pattern, like "postgres," rather than being taken from a
command-line argument that is intended to be a pattern. Or some hybrid
approach e.g. -d options are patterns, but don't set the maintenance
database, while extra command line arguments are literal database
names, and thus are presumably OK to use as the maintenance DB. But
it's too weird IMHO to support patterns here and then have supplying
one inevitably fail unless you also specify --maintenance-db.

Right. I think the changes in this next version address all your concerns as stated, but here are some examples:

% pg_amcheck "mark.d*" --progress
pg_amcheck: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "mark.d*" does not exist

% PGDATABASE=postgres pg_amcheck "mark.d*" --progress
pg_amcheck: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "mark.d*" does not exist

% PGDATABASE=postgres pg_amcheck -d "mark.d*" --progress
520/520 relations (100%) 1815/1815 pages (100%)

% pg_amcheck --all --maintenance-db="mark.d*" --progress
pg_amcheck: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "mark.d*" does not exist

% pg_amcheck --all -D="mark.d*" --progress
pg_amcheck: warning: skipping database "template1": amcheck is not installed
520/520 relations (100%) 1815/1815 pages (100%)

It's sorta annoying that there doesn't seem to be an easy way to find
out exactly what relations got checked as a result of whatever I did.
Perhaps pg_amcheck -v should print a line for each relation saying
that it's checking that relation; it's not actually that verbose as
things stand. If we thought that was overdoing it, we could set things
up so that multiple -v options keep increasing the verbosity level, so
that you can get this via pg_amcheck -vv. I submit that pg_amcheck -e
is not useful for this purpose because the queries, besides being
long, use the relation OIDs rather than the names, so it's not easy to
see what happened.

I added that, as shown here:

% pg_amcheck mark.dilger --table=pg_subscription --table=pg_publication -v
pg_amcheck: in database "mark.dilger": using amcheck version "1.3" in schema "public"
pg_amcheck: checking btree index "mark.dilger"."pg_toast"."pg_toast_6100_index" (oid 4184) (1/1 page)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_publication_oid_index" (oid 6110) (1/1 page)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_publication_pubname_index" (oid 6111) (1/1 page)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_subscription_oid_index" (oid 6114) (1/1 page)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_subscription_subname_index" (oid 6115) (1/1 page)
pg_amcheck: checking table "mark.dilger"."pg_toast"."pg_toast_6100" (oid 4183) (0/0 pages)
pg_amcheck: checking table "mark.dilger"."pg_catalog"."pg_subscription" (oid 6100) (0/0 pages)
pg_amcheck: checking table "mark.dilger"."pg_catalog"."pg_publication" (oid 6104) (0/0 pages)

I think that something's not working in terms of schema exclusion. If
I create a brand-new database and then run "pg_amcheck -S pg_catalog
-S information_schema -S pg_toast" it still checks stuff. In fact it
seems to check the exact same amount of stuff that it checks if I run
it with no command-line options at all. In fact, if I run "pg_amcheck
-S '*'" that still checks everything. Unless I'm misunderstanding what
this option is supposed to do, the fact that a version of this patch
where this seemingly doesn't work at all escaped to the list suggests
that your testing has got some gaps.

Good catch. That works now, but beware that -S doesn't apply to excluding things brought in by toast or index expansion, so:

% pg_amcheck mark.dilger -S pg_catalog -S pg_toast --progress -v
pg_amcheck: in database "mark.dilger": using amcheck version "1.3" in schema "public"
0/14 relations (0%) 0/90 pages (0%)
pg_amcheck: checking table "mark.dilger"."public"."foo" (oid 16385) (45/45 pages)
pg_amcheck: checking btree index "mark.dilger"."public"."foo_idx" (oid 16388) (30/30 pages)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_features" (oid 13051) (8/8 pages)
pg_amcheck: checking btree index "mark.dilger"."pg_toast"."pg_toast_13051_index" (oid 13055) (1/1 page)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_implementation_info" (oid 13056) (1/1 page)
pg_amcheck: checking btree index "mark.dilger"."pg_toast"."pg_toast_13056_index" (oid 13060) (1/1 page)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_parts" (oid 13061) (1/1 page)
pg_amcheck: checking btree index "mark.dilger"."pg_toast"."pg_toast_13061_index" (oid 13065) (1/1 page)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_sizing" (oid 13066) (1/1 page)
pg_amcheck: checking btree index "mark.dilger"."pg_toast"."pg_toast_13066_index" (oid 13070) (1/1 page)
pg_amcheck: checking table "mark.dilger"."pg_toast"."pg_toast_13051" (oid 13054) (0/0 pages)
pg_amcheck: checking table "mark.dilger"."pg_toast"."pg_toast_13056" (oid 13059) (0/0 pages)
pg_amcheck: checking table "mark.dilger"."pg_toast"."pg_toast_13061" (oid 13064) (0/0 pages)
pg_amcheck: checking table "mark.dilger"."pg_toast"."pg_toast_13066" (oid 13069) (0/0 pages)
14/14 relations (100%) 90/90 pages (100%)

but

% pg_amcheck mark.dilger -S pg_catalog -S pg_toast -S information_schema --progress -v
pg_amcheck: in database "mark.dilger": using amcheck version "1.3" in schema "public"
0/2 relations (0%) 0/75 pages (0%)
pg_amcheck: checking table "mark.dilger"."public"."foo" (oid 16385) (45/45 pages)
pg_amcheck: checking btree index "mark.dilger"."public"."foo_idx" (oid 16388) (30/30 pages)
2/2 relations (100%) 75/75 pages (100%)

The first one checks so much because the toast and indexes for tables in the "information_schema" are not excluded by -S, but:

% pg_amcheck mark.dilger -S pg_catalog -S pg_toast --progress --no-dependent-indexes --no-dependent-toast -v
pg_amcheck: in database "mark.dilger": using amcheck version "1.3" in schema "public"
0/5 relations (0%) 0/56 pages (0%)
pg_amcheck: checking table "mark.dilger"."public"."foo" (oid 16385) (45/45 pages)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_features" (oid 13051) (8/8 pages)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_implementation_info" (oid 13056) (1/1 page)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_parts" (oid 13061) (1/1 page)
pg_amcheck: checking table "mark.dilger"."information_schema"."sql_sizing" (oid 13066) (1/1 page)
5/5 relations (100%) 56/56 pages (100%)

works as you might expect.

I like the semantics of --no-toast-expansion and --no-index-expansion
as you now have them, but I find I don't really like the names. Could
I suggest --no-dependent-indexes and --no-dependent-toast?

Changed.

I tried pg_amcheck --startblock=tsgsdg and got an error message
without a trailing newline.

Fixed.

I tried --startblock=-525523 and got no
error.

Fixed.

I tried --startblock=99999999999999999999999999 and got a
complaint that the value was out of bounds, but without a trailing
newline.

Fixed.

Maybe there's an argument that the bounds don't need to be
checked, but surely there's no argument for checking one and not the
other.

It checks both now, and also for --endblock

I haven't tried the corresponding cases with --endblock but you
should. I tried --startblock=2 --endblock=1 and got a complaint that
the ending block precedes the starting block, which is totally
reasonable (though I might say "start block" and "end block" rather
than using the -ing forms)

I think this is fixed up now. There is an interaction with amcheck's verify_heapam(), where that function raises an error if the startblock or endblock arguments are out of bounds for the relation in question. Rather than aborting the entire pg_amcheck run, it avoids passing inappropriate block ranges to verify_heapam() and outputs a warning, so:

% pg_amcheck mark.dilger -t foo -t pg_class --progress -v --startblock=35 --endblock=77
pg_amcheck: in database "mark.dilger": using amcheck version "1.3" in schema "public"
0/6 relations (0%) 0/55 pages (0%)
pg_amcheck: checking table "mark.dilger"."public"."foo" (oid 16385) (10/45 pages)
pg_amcheck: warning: ignoring endblock option 77 beyond end of table "mark.dilger"."public"."foo"
pg_amcheck: checking btree index "mark.dilger"."public"."foo_idx" (oid 16388) (30/30 pages)
pg_amcheck: checking table "mark.dilger"."pg_catalog"."pg_class" (oid 1259) (0/13 pages)
pg_amcheck: warning: ignoring startblock option 35 beyond end of table "mark.dilger"."pg_catalog"."pg_class"
pg_amcheck: warning: ignoring endblock option 77 beyond end of table "mark.dilger"."pg_catalog"."pg_class"
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_relname_nsp_index" (oid 2663) (6/6 pages)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_tblspc_relfilenode_index" (oid 3455) (5/5 pages)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_oid_index" (oid 2662) (4/4 pages)
6/6 relations (100%) 55/55 pages (100%)

The way the (x/y pages) is printed takes into account that the [startblock..endblock] range may reduce the number of pages to check (x) to something less than the number of pages in the relation (y), but the reporting is a bit of a lie when the startblock is beyond the end of the table, as it doesn't get passed to verify_heapam and so the number of blocks checked may be more than the zero blocks reported. I think I might need to fix this up tomorrow, but I want to get what I have in this patch set posted tonight, so it's not fixed here. Also, there are multiple ways of addressing this, and I'm having trouble deciding which way is best. I can exclude the relation from being checked at all, or realize earlier that I'm not going to honor the startblock argument and compute the blocks to check correctly. Thoughts?

but this message is prefixed with
"pg_amcheck: " whereas the messages about an altogether invalid
starting block where not so prefixed. Is there a reason not to make
this consistent?

That was a stray usage of pg_log_error where fprintf should have been used. Fixed.

I also tried using a random positive integer for startblock, and for
every relation I am told "ERROR: starting block number must be
between 0 and <whatever>". That makes sense, because I used a big
number for the start block and I don't have any big relations, but it
makes for an absolute ton of output, because every verify_heapam query
is 11 lines long.

This happens because the range was being passed down to verify_heapam. It won't do that now.

This suggests a couple of possible improvements.
First, maybe we should only display the query that produced the error
in verbose mode.

No longer relevant.

Second, maybe the verify_heapam() query should be
tightened up so that it doesn't stretch across quite so many lines.

Not a bad idea, but no longer relevant to the startblock/endblock issues. Done.

I
think the call to verify_heapam() could be spread across like 2 lines
rather than 7, which would improve readability.

Done.

On a related note, I
wonder why we need every verify_heapam() call to join to pg_class and
pg_namespace just to fetch the schema and table name which,
presumably, we should or at least could already have.

We didn't have it, but we do now, so that join is removed.

This kinda
relates to my comment earlier about making -v print a message per
relation so that we can see, in human-readable format, which relations
are getting checked.

Done.

Right now, if you got an error checking just one
relation, how would you know which relation you got it from? Unless
the server happens to report that information in the message, you're
just in the dark, because pg_amcheck won't tell you.

That information is now included in the query text, so you can see it in the error message along with the oid.

The line "Read the description of the amcheck contrib module for
details" seems like it could be omitted. Perhaps the first line of the
help message could be changed to read "pg_amcheck uses amcheck to find
corruption in a PostgreSQL database." or something like that, instead.

Done.

Attachments:

v43-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patchapplication/octet-stream; name=v43-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patch; x-unix-mode=0644Download
From efdbcf09297ac12d3440ad7cc9d7d805bf5e0b51 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 3 Mar 2021 07:16:55 -0800
Subject: [PATCH v43 1/3] Reworking ParallelSlots for mutliple DB use

The existing implementation of ParallelSlots is used by reindexdb
and vacuumdb to process tables in parallel in only one database at
a time.  The ParallelSlots interface reflects this usage pattern.
The function to set up the slots assumes all slots should be
connected to the same database, and the function for getting the
next idle slot pays no attention to which database the slot may be
connected to.

In anticipation of pg_amcheck using parallel slots to process
multiple databases in parallel, reworking the interface while
trying to remain reasonably simple for reindexdb and vacuumdb to
use:

ParallelSlotsSetup() no longer creates or receives database
connections.  It takes arguments that it stores for use in
subsequent operations when a connection needs to be formed.

Callers who already have a connection and want to reuse it can give
it to the parallel slots using a new function,
ParallelSlotsAdoptConn().  Both reindexdb and vacuumdb use this.

ParallelSlotsGetIdle() is extended to take a dbname argument
indicating the database to which a connection is desired, and to
manage a heterogeneous set of slots potentially connected to varying
databases and some perhaps not yet connected.  The function will
reuse an existing connection or form a new connection as necessary.

The logic for determining whether a slot's connection is suitable
for reuse is based on the database the slot's connection is
connected to, and whether that matches the database desired.  Other
connection parameters (user, host, port, etc.) are assumed not to
change from slot to slot.
---
 src/bin/scripts/reindexdb.c          |  17 +-
 src/bin/scripts/vacuumdb.c           |  46 +--
 src/fe_utils/parallel_slot.c         | 411 +++++++++++++++++++--------
 src/include/fe_utils/parallel_slot.h |  27 +-
 src/tools/pgindent/typedefs.list     |   2 +
 5 files changed, 342 insertions(+), 161 deletions(-)

diff --git a/src/bin/scripts/reindexdb.c b/src/bin/scripts/reindexdb.c
index cf28176243..fc0681538a 100644
--- a/src/bin/scripts/reindexdb.c
+++ b/src/bin/scripts/reindexdb.c
@@ -36,7 +36,7 @@ static SimpleStringList *get_parallel_object_list(PGconn *conn,
 												  ReindexType type,
 												  SimpleStringList *user_list,
 												  bool echo);
-static void reindex_one_database(const ConnParams *cparams, ReindexType type,
+static void reindex_one_database(ConnParams *cparams, ReindexType type,
 								 SimpleStringList *user_list,
 								 const char *progname,
 								 bool echo, bool verbose, bool concurrently,
@@ -330,7 +330,7 @@ main(int argc, char *argv[])
 }
 
 static void
-reindex_one_database(const ConnParams *cparams, ReindexType type,
+reindex_one_database(ConnParams *cparams, ReindexType type,
 					 SimpleStringList *user_list,
 					 const char *progname, bool echo,
 					 bool verbose, bool concurrently, int concurrentCons,
@@ -341,7 +341,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 	bool		parallel = concurrentCons > 1;
 	SimpleStringList *process_list = user_list;
 	ReindexType process_type = type;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	bool		failed = false;
 	int			items_count = 0;
 
@@ -461,7 +461,8 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 
 	Assert(process_list != NULL);
 
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, NULL);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	cell = process_list->head;
 	do
@@ -475,7 +476,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -489,7 +490,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
@@ -499,8 +500,8 @@ finish:
 		pg_free(process_list);
 	}
 
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pfree(slots);
+	ParallelSlotsTerminate(sa);
+	pfree(sa);
 
 	if (failed)
 		exit(1);
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 602fd45c42..7901c41f16 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -45,7 +45,7 @@ typedef struct vacuumingOptions
 } vacuumingOptions;
 
 
-static void vacuum_one_database(const ConnParams *cparams,
+static void vacuum_one_database(ConnParams *cparams,
 								vacuumingOptions *vacopts,
 								int stage,
 								SimpleStringList *tables,
@@ -408,7 +408,7 @@ main(int argc, char *argv[])
  * a list of tables from the database.
  */
 static void
-vacuum_one_database(const ConnParams *cparams,
+vacuum_one_database(ConnParams *cparams,
 					vacuumingOptions *vacopts,
 					int stage,
 					SimpleStringList *tables,
@@ -421,13 +421,14 @@ vacuum_one_database(const ConnParams *cparams,
 	PGresult   *res;
 	PGconn	   *conn;
 	SimpleStringListCell *cell;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	SimpleStringList dbtables = {NULL, NULL};
 	int			i;
 	int			ntups;
 	bool		failed = false;
 	bool		tables_listed = false;
 	bool		has_where = false;
+	const char *initcmd;
 	const char *stage_commands[] = {
 		"SET default_statistics_target=1; SET vacuum_cost_delay=0;",
 		"SET default_statistics_target=10; RESET vacuum_cost_delay;",
@@ -684,26 +685,25 @@ vacuum_one_database(const ConnParams *cparams,
 		concurrentCons = 1;
 
 	/*
-	 * Setup the database connections. We reuse the connection we already have
-	 * for the first slot.  If not in parallel mode, the first slot in the
-	 * array contains the connection.
+	 * All slots need to be prepared to run the appropriate analyze stage, if
+	 * caller requested that mode.  We have to prepare the initial connection
+	 * ourselves before setting up the slots.
 	 */
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	if (stage == ANALYZE_NO_STAGE)
+		initcmd = NULL;
+	else
+	{
+		initcmd = stage_commands[stage];
+		executeCommand(conn, initcmd, echo);
+	}
 
 	/*
-	 * Prepare all the connections to run the appropriate analyze stage, if
-	 * caller requested that mode.
+	 * Setup the database connections. We reuse the connection we already have
+	 * for the first slot.  If not in parallel mode, the first slot in the
+	 * array contains the connection.
 	 */
-	if (stage != ANALYZE_NO_STAGE)
-	{
-		int			j;
-
-		/* We already emitted the message above */
-
-		for (j = 0; j < concurrentCons; j++)
-			executeCommand((slots + j)->connection,
-						   stage_commands[stage], echo);
-	}
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, initcmd);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	initPQExpBuffer(&sql);
 
@@ -719,7 +719,7 @@ vacuum_one_database(const ConnParams *cparams,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -740,12 +740,12 @@ vacuum_one_database(const ConnParams *cparams,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pg_free(slots);
+	ParallelSlotsTerminate(sa);
+	pg_free(sa);
 
 	termPQExpBuffer(&sql);
 
diff --git a/src/fe_utils/parallel_slot.c b/src/fe_utils/parallel_slot.c
index b625deb254..a09e5460e5 100644
--- a/src/fe_utils/parallel_slot.c
+++ b/src/fe_utils/parallel_slot.c
@@ -25,31 +25,23 @@
 #include "common/logging.h"
 #include "fe_utils/cancel.h"
 #include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
 
 #define ERRCODE_UNDEFINED_TABLE  "42P01"
 
-static void init_slot(ParallelSlot *slot, PGconn *conn);
 static int	select_loop(int maxFd, fd_set *workerset);
 static bool processQueryResult(ParallelSlot *slot, PGresult *result);
 
-static void
-init_slot(ParallelSlot *slot, PGconn *conn)
-{
-	slot->connection = conn;
-	/* Initially assume connection is idle */
-	slot->isFree = true;
-	ParallelSlotClearHandler(slot);
-}
-
 /*
  * Process (and delete) a query result.  Returns true if there's no problem,
- * false otherwise. It's up to the handler to decide what cosntitutes a
+ * false otherwise. It's up to the handler to decide what constitutes a
  * problem.
  */
 static bool
 processQueryResult(ParallelSlot *slot, PGresult *result)
 {
 	Assert(slot->handler != NULL);
+	Assert(slot->connection != NULL);
 
 	/* On failure, the handler should return NULL after freeing the result */
 	if (!slot->handler(result, slot->connection, slot->handler_context))
@@ -71,6 +63,9 @@ consumeQueryResult(ParallelSlot *slot)
 	bool		ok = true;
 	PGresult   *result;
 
+	Assert(slot != NULL);
+	Assert(slot->connection != NULL);
+
 	SetCancelConn(slot->connection);
 	while ((result = PQgetResult(slot->connection)) != NULL)
 	{
@@ -137,151 +132,316 @@ select_loop(int maxFd, fd_set *workerset)
 }
 
 /*
- * ParallelSlotsGetIdle
- *		Return a connection slot that is ready to execute a command.
- *
- * This returns the first slot we find that is marked isFree, if one is;
- * otherwise, we loop on select() until one socket becomes available.  When
- * this happens, we read the whole set and mark as free all sockets that
- * become available.  If an error occurs, NULL is returned.
+ * Return the offset of a suitable idle slot, or -1 if none are available.  If
+ * the given dbname is not null, only idle slots connected to the given
+ * database are considered suitable, otherwise all idle connected slots are
+ * considered suitable.
  */
-ParallelSlot *
-ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
+static int
+find_matching_idle_slot(const ParallelSlotArray *sa, const char *dbname)
 {
 	int			i;
-	int			firstFree = -1;
 
-	/*
-	 * Look for any connection currently free.  If there is one, mark it as
-	 * taken and let the caller know the slot to use.
-	 */
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (slots[i].isFree)
-		{
-			slots[i].isFree = false;
-			return slots + i;
-		}
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			continue;
+
+		if (dbname == NULL ||
+			strcmp(PQdb(sa->slots[i].connection), dbname) == 0)
+			return i;
+	}
+	return -1;
+}
+
+/*
+ * Return the offset of the first slot without a database connection, or -1 if
+ * all slots are connected.
+ */
+static int
+find_unconnected_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			return i;
+	}
+
+	return -1;
+}
+
+/*
+ * Return the offset of the first idle slot, or -1 if all slots are busy.
+ */
+static int
+find_any_idle_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+		if (!sa->slots[i].inUse)
+			return i;
+
+	return -1;
+}
+
+/*
+ * Wait for any slot's connection to have query results, consume the results,
+ * and update the slot's status as appropriate.  Returns true on success,
+ * false on cancellation, on error, or if no slots are connected.
+ */
+static bool
+wait_on_slots(ParallelSlotArray *sa)
+{
+	int			i;
+	fd_set		slotset;
+	int			maxFd = 0;
+	PGconn	   *cancelconn = NULL;
+
+	/* We must reconstruct the fd_set for each call to select_loop */
+	FD_ZERO(&slotset);
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		int			sock;
+
+		/* We shouldn't get here if we still have slots without connections */
+		Assert(sa->slots[i].connection != NULL);
+
+		sock = PQsocket(sa->slots[i].connection);
+
+		/*
+		 * We don't really expect any connections to lose their sockets after
+		 * startup, but just in case, cope by ignoring them.
+		 */
+		if (sock < 0)
+			continue;
+
+		/* Keep track of the first valid connection we see. */
+		if (cancelconn == NULL)
+			cancelconn = sa->slots[i].connection;
+
+		FD_SET(sock, &slotset);
+		if (sock > maxFd)
+			maxFd = sock;
 	}
 
 	/*
-	 * No free slot found, so wait until one of the connections has finished
-	 * its task and return the available slot.
+	 * If we get this far with no valid connections, processing cannot
+	 * continue.
 	 */
-	while (firstFree < 0)
+	if (cancelconn == NULL)
+		return false;
+
+	SetCancelConn(sa->slots->connection);
+	i = select_loop(maxFd, &slotset);
+	ResetCancelConn();
+
+	/* failure? */
+	if (i < 0)
+		return false;
+
+	for (i = 0; i < sa->numslots; i++)
 	{
-		fd_set		slotset;
-		int			maxFd = 0;
+		int			sock;
 
-		/* We must reconstruct the fd_set for each call to select_loop */
-		FD_ZERO(&slotset);
+		sock = PQsocket(sa->slots[i].connection);
 
-		for (i = 0; i < numslots; i++)
+		if (sock >= 0 && FD_ISSET(sock, &slotset))
 		{
-			int			sock = PQsocket(slots[i].connection);
-
-			/*
-			 * We don't really expect any connections to lose their sockets
-			 * after startup, but just in case, cope by ignoring them.
-			 */
-			if (sock < 0)
-				continue;
-
-			FD_SET(sock, &slotset);
-			if (sock > maxFd)
-				maxFd = sock;
+			/* select() says input is available, so consume it */
+			PQconsumeInput(sa->slots[i].connection);
 		}
 
-		SetCancelConn(slots->connection);
-		i = select_loop(maxFd, &slotset);
-		ResetCancelConn();
-
-		/* failure? */
-		if (i < 0)
-			return NULL;
-
-		for (i = 0; i < numslots; i++)
+		/* Collect result(s) as long as any are available */
+		while (!PQisBusy(sa->slots[i].connection))
 		{
-			int			sock = PQsocket(slots[i].connection);
+			PGresult   *result = PQgetResult(sa->slots[i].connection);
 
-			if (sock >= 0 && FD_ISSET(sock, &slotset))
+			if (result != NULL)
 			{
-				/* select() says input is available, so consume it */
-				PQconsumeInput(slots[i].connection);
+				/* Handle and discard the command result */
+				if (!processQueryResult(&sa->slots[i], result))
+					return false;
 			}
-
-			/* Collect result(s) as long as any are available */
-			while (!PQisBusy(slots[i].connection))
+			else
 			{
-				PGresult   *result = PQgetResult(slots[i].connection);
-
-				if (result != NULL)
-				{
-					/* Handle and discard the command result */
-					if (!processQueryResult(slots + i, result))
-						return NULL;
-				}
-				else
-				{
-					/* This connection has become idle */
-					slots[i].isFree = true;
-					ParallelSlotClearHandler(slots + i);
-					if (firstFree < 0)
-						firstFree = i;
-					break;
-				}
+				/* This connection has become idle */
+				sa->slots[i].inUse = false;
+				ParallelSlotClearHandler(&sa->slots[i]);
+				break;
 			}
 		}
 	}
+	return true;
+}
 
-	slots[firstFree].isFree = false;
-	return slots + firstFree;
+/*
+ * Open a new database connection using the stored connection parameters and
+ * optionally a given dbname if not null, execute the stored initial command if
+ * any, and associate the new connection with the given slot.
+ */
+static void
+connect_slot(ParallelSlotArray *sa, int slotno, const char *dbname)
+{
+	const char *old_override;
+	ParallelSlot *slot = &sa->slots[slotno];
+
+	old_override = sa->cparams->override_dbname;
+	if (dbname)
+		sa->cparams->override_dbname = dbname;
+	slot->connection = connectDatabase(sa->cparams, sa->progname, sa->echo, false, true);
+	sa->cparams->override_dbname = old_override;
+
+	if (PQsocket(slot->connection) >= FD_SETSIZE)
+	{
+		pg_log_fatal("too many jobs for this platform");
+		exit(1);
+	}
+
+	/* Setup the connection using the supplied command, if any. */
+	if (sa->initcmd)
+		executeCommand(slot->connection, sa->initcmd, sa->echo);
 }
 
 /*
- * ParallelSlotsSetup
- *		Prepare a set of parallel slots to use on a given database.
+ * ParallelSlotsGetIdle
+ *		Return a connection slot that is ready to execute a command.
+ *
+ * The slot returned is chosen as follows:
+ *
+ * If any idle slot already has an open connection, and if either dbname is
+ * null or the existing connection is to the given database, that slot will be
+ * returned allowing the connection to be reused.
+ *
+ * Otherwise, if any idle slot is not yet connected to any database, the slot
+ * will be returned with it's connection opened using the stored cparams and
+ * optionally the given dbname if not null.
+ *
+ * Otherwise, if any idle slot exists, an idle slot will be chosen and returned
+ * after having it's connection disconnected and reconnected using the stored
+ * cparams and optionally the given dbname if not null.
  *
- * This creates and initializes a set of connections to the database
- * using the information given by the caller, marking all parallel slots
- * as free and ready to use.  "conn" is an initial connection set up
- * by the caller and is associated with the first slot in the parallel
- * set.
+ * Otherwise, if any slots have connections that are busy, we loop on select()
+ * until one socket becomes available.  When this happens, we read the whole
+ * set and mark as free all sockets that become available.  We then select a
+ * slot using the same rules as above.
+ *
+ * Otherwise, we cannot return a slot, which is an error, and NULL is returned.
+ *
+ * For any connection created, if the stored initcmd is not null, it will be
+ * executed as a command on the newly formed connection before the slot is
+ * returned.
+ *
+ * If an error occurs, NULL is returned.
  */
 ParallelSlot *
-ParallelSlotsSetup(const ConnParams *cparams,
-				   const char *progname, bool echo,
-				   PGconn *conn, int numslots)
+ParallelSlotsGetIdle(ParallelSlotArray *sa, const char *dbname)
 {
-	ParallelSlot *slots;
-	int			i;
+	int			offset;
 
-	Assert(conn != NULL);
+	Assert(sa);
+	Assert(sa->numslots > 0);
 
-	slots = (ParallelSlot *) pg_malloc(sizeof(ParallelSlot) * numslots);
-	init_slot(slots, conn);
-	if (numslots > 1)
+	while (1)
 	{
-		for (i = 1; i < numslots; i++)
+		/* First choice: a slot already connected to the desired database. */
+		offset = find_matching_idle_slot(sa, dbname);
+		if (offset >= 0)
 		{
-			conn = connectDatabase(cparams, progname, echo, false, true);
-
-			/*
-			 * Fail and exit immediately if trying to use a socket in an
-			 * unsupported range.  POSIX requires open(2) to use the lowest
-			 * unused file descriptor and the hint given relies on that.
-			 */
-			if (PQsocket(conn) >= FD_SETSIZE)
-			{
-				pg_log_fatal("too many jobs for this platform -- try %d", i);
-				exit(1);
-			}
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
+
+		/* Second choice: a slot not connected to any database. */
+		offset = find_unconnected_slot(sa);
+		if (offset >= 0)
+		{
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
 
-			init_slot(slots + i, conn);
+		/* Third choice: a slot connected to the wrong database. */
+		offset = find_any_idle_slot(sa);
+		if (offset >= 0)
+		{
+			disconnectDatabase(sa->slots[offset].connection);
+			sa->slots[offset].connection = NULL;
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
 		}
+
+		/*
+		 * Fourth choice: block until one or more slots become available. If
+		 * any slot's hit a fatal error, we'll find out about that here and
+		 * return NULL.
+		 */
+		if (!wait_on_slots(sa))
+			return NULL;
 	}
+}
+
+/*
+ * ParallelSlotsSetup
+ *		Prepare a set of parallel slots but do not connect to any database.
+ *
+ * This creates and initializes a set of slots, marking all parallel slots as
+ * free and ready to use.  Establishing connections is delayed until requesting
+ * a free slot.  The cparams, progname, echo, and initcmd are stored for later
+ * use and must remain valid for the lifetime of the returned array.
+ */
+ParallelSlotArray *
+ParallelSlotsSetup(int numslots, ConnParams *cparams, const char *progname,
+				   bool echo, const char *initcmd)
+{
+	ParallelSlotArray *sa;
 
-	return slots;
+	Assert(numslots > 0);
+	Assert(cparams != NULL);
+	Assert(progname != NULL);
+
+	sa = (ParallelSlotArray *) palloc0(offsetof(ParallelSlotArray, slots) +
+									   numslots * sizeof(ParallelSlot));
+
+	sa->numslots = numslots;
+	sa->cparams = cparams;
+	sa->progname = progname;
+	sa->echo = echo;
+	sa->initcmd = initcmd;
+
+	return sa;
+}
+
+/*
+ * ParallelSlotsAdoptConn
+ *		Assign an open connection to the slots array for reuse.
+ *
+ * This turns over ownership of an open connection to a slots array.  The
+ * caller should not further use or close the connection.  All the connection's
+ * parameters (user, host, port, etc.) except possibly dbname should match
+ * those of the slots array's cparams, as given in ParallelSlotsSetup.  If
+ * these parameters differ, subsequent behavior is undefined.
+ */
+void
+ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn)
+{
+	int		offset;
+
+	offset = find_unconnected_slot(sa);
+	if (offset >= 0)
+		sa->slots[offset].connection = conn;
+	else
+		disconnectDatabase(conn);
 }
 
 /*
@@ -292,13 +452,13 @@ ParallelSlotsSetup(const ConnParams *cparams,
  * terminate all connections.
  */
 void
-ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
+ParallelSlotsTerminate(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		PGconn	   *conn = slots[i].connection;
+		PGconn	   *conn = sa->slots[i].connection;
 
 		if (conn == NULL)
 			continue;
@@ -314,13 +474,15 @@ ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
  * error has been found on the way.
  */
 bool
-ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
+ParallelSlotsWaitCompletion(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (!consumeQueryResult(slots + i))
+		if (sa->slots[i].connection == NULL)
+			continue;
+		if (!consumeQueryResult(&sa->slots[i]))
 			return false;
 	}
 
@@ -350,6 +512,9 @@ ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
 bool
 TableCommandResultHandler(PGresult *res, PGconn *conn, void *context)
 {
+	Assert(res != NULL);
+	Assert(conn != NULL);
+
 	/*
 	 * If it's an error, report it.  Errors about a missing table are harmless
 	 * so we continue processing; but die for other errors.
diff --git a/src/include/fe_utils/parallel_slot.h b/src/include/fe_utils/parallel_slot.h
index 8902f8d4f4..b7e2b0a29b 100644
--- a/src/include/fe_utils/parallel_slot.h
+++ b/src/include/fe_utils/parallel_slot.h
@@ -21,7 +21,7 @@ typedef bool (*ParallelSlotResultHandler) (PGresult *res, PGconn *conn,
 typedef struct ParallelSlot
 {
 	PGconn	   *connection;		/* One connection */
-	bool		isFree;			/* Is it known to be idle? */
+	bool		inUse;			/* Is the slot being used? */
 
 	/*
 	 * Prior to issuing a command or query on 'connection', a handler callback
@@ -33,6 +33,16 @@ typedef struct ParallelSlot
 	void	   *handler_context;
 } ParallelSlot;
 
+typedef struct ParallelSlotArray
+{
+	int			numslots;
+	ConnParams *cparams;
+	const char *progname;
+	bool		echo;
+	const char *initcmd;
+	ParallelSlot slots[FLEXIBLE_ARRAY_MEMBER];
+} ParallelSlotArray;
+
 static inline void
 ParallelSlotSetHandler(ParallelSlot *slot, ParallelSlotResultHandler handler,
 					   void *context)
@@ -48,15 +58,18 @@ ParallelSlotClearHandler(ParallelSlot *slot)
 	slot->handler_context = NULL;
 }
 
-extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlot *slots, int numslots);
+extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlotArray *slots,
+										  const char *dbname);
+
+extern ParallelSlotArray *ParallelSlotsSetup(int numslots, ConnParams *cparams,
+											 const char *progname, bool echo,
+											 const char *initcmd);
 
-extern ParallelSlot *ParallelSlotsSetup(const ConnParams *cparams,
-										const char *progname, bool echo,
-										PGconn *conn, int numslots);
+extern void ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn);
 
-extern void ParallelSlotsTerminate(ParallelSlot *slots, int numslots);
+extern void ParallelSlotsTerminate(ParallelSlotArray *sa);
 
-extern bool ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots);
+extern bool ParallelSlotsWaitCompletion(ParallelSlotArray *sa);
 
 extern bool TableCommandResultHandler(PGresult *res, PGconn *conn,
 									  void *context);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95aefa1..b1dec43f9d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -403,6 +403,7 @@ ConfigData
 ConfigVariable
 ConnCacheEntry
 ConnCacheKey
+ConnParams
 ConnStatusType
 ConnType
 ConnectionStateEnum
@@ -1729,6 +1730,7 @@ ParallelHashJoinState
 ParallelIndexScanDesc
 ParallelReadyList
 ParallelSlot
+ParallelSlotArray
 ParallelState
 ParallelTableScanDesc
 ParallelTableScanDescData
-- 
2.21.1 (Apple Git-122.3)

v43-0002-Adding-contrib-module-pg_amcheck.patchapplication/octet-stream; name=v43-0002-Adding-contrib-module-pg_amcheck.patch; x-unix-mode=0644Download
From 143419d86d5ad30b4720cb70adbf028bd7e0158c Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Mar 2021 08:34:40 -0800
Subject: [PATCH v43 2/3] Adding contrib module pg_amcheck

Adding new contrib module pg_amcheck, which is a command line
interface for running amcheck's verifications against tables and
indexes.
---
 contrib/Makefile                           |    1 +
 contrib/pg_amcheck/.gitignore              |    3 +
 contrib/pg_amcheck/Makefile                |   29 +
 contrib/pg_amcheck/pg_amcheck.c            | 2150 ++++++++++++++++++++
 contrib/pg_amcheck/t/001_basic.pl          |    9 +
 contrib/pg_amcheck/t/002_nonesuch.pl       |  264 +++
 contrib/pg_amcheck/t/003_check.pl          |  497 +++++
 contrib/pg_amcheck/t/004_verify_heapam.pl  |  487 +++++
 contrib/pg_amcheck/t/005_opclass_damage.pl |   54 +
 doc/src/sgml/contrib.sgml                  |    1 +
 doc/src/sgml/filelist.sgml                 |    1 +
 doc/src/sgml/pgamcheck.sgml                |  682 +++++++
 src/tools/msvc/Install.pm                  |    2 +-
 src/tools/msvc/Mkvcbuild.pm                |    6 +-
 src/tools/pgindent/typedefs.list           |    3 +
 15 files changed, 4185 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pg_amcheck/.gitignore
 create mode 100644 contrib/pg_amcheck/Makefile
 create mode 100644 contrib/pg_amcheck/pg_amcheck.c
 create mode 100644 contrib/pg_amcheck/t/001_basic.pl
 create mode 100644 contrib/pg_amcheck/t/002_nonesuch.pl
 create mode 100644 contrib/pg_amcheck/t/003_check.pl
 create mode 100644 contrib/pg_amcheck/t/004_verify_heapam.pl
 create mode 100644 contrib/pg_amcheck/t/005_opclass_damage.pl
 create mode 100644 doc/src/sgml/pgamcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index f27e458482..a72dcf7304 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -30,6 +30,7 @@ SUBDIRS = \
 		old_snapshot	\
 		pageinspect	\
 		passwordcheck	\
+		pg_amcheck	\
 		pg_buffercache	\
 		pg_freespacemap \
 		pg_prewarm	\
diff --git a/contrib/pg_amcheck/.gitignore b/contrib/pg_amcheck/.gitignore
new file mode 100644
index 0000000000..c21a14de31
--- /dev/null
+++ b/contrib/pg_amcheck/.gitignore
@@ -0,0 +1,3 @@
+pg_amcheck
+
+/tmp_check/
diff --git a/contrib/pg_amcheck/Makefile b/contrib/pg_amcheck/Makefile
new file mode 100644
index 0000000000..bc61ee7970
--- /dev/null
+++ b/contrib/pg_amcheck/Makefile
@@ -0,0 +1,29 @@
+# contrib/pg_amcheck/Makefile
+
+PGFILEDESC = "pg_amcheck - detects corruption within database relations"
+PGAPPICON = win32
+
+PROGRAM = pg_amcheck
+OBJS = \
+	$(WIN32RES) \
+	pg_amcheck.o
+
+REGRESS_OPTS += --load-extension=amcheck --load-extension=pageinspect
+EXTRA_INSTALL += contrib/amcheck contrib/pageinspect
+
+TAP_TESTS = 1
+
+PG_CPPFLAGS = -I$(libpq_srcdir)
+PG_LIBS_INTERNAL = -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+SHLIB_PREREQS = submake-libpq
+subdir = contrib/pg_amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_amcheck/pg_amcheck.c b/contrib/pg_amcheck/pg_amcheck.c
new file mode 100644
index 0000000000..829e4081cc
--- /dev/null
+++ b/contrib/pg_amcheck/pg_amcheck.c
@@ -0,0 +1,2150 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_amcheck.c
+ *		Detects corruption within database relations.
+ *
+ * Copyright (c) 2017-2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_amcheck/pg_amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <time.h>
+
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_namespace_d.h"
+#include "common/logging.h"
+#include "common/username.h"
+#include "fe_utils/cancel.h"
+#include "fe_utils/option_utils.h"
+#include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
+#include "fe_utils/simple_list.h"
+#include "fe_utils/string_utils.h"
+#include "getopt_long.h"		/* pgrminclude ignore */
+#include "pgtime.h"
+#include "storage/block.h"
+
+/* pg_amcheck command line options controlled by user flags */
+typedef struct AmcheckOptions
+{
+	bool		alldb;
+	bool		echo;
+	bool		quiet;
+	bool		verbose;
+	bool		strict_names;
+	bool		show_progress;
+	int			jobs;
+
+	/* Objects to check or not to check, as lists of PatternInfo structs. */
+	SimplePtrList include;
+	SimplePtrList exclude;
+
+	/*
+	 * Databases to process, as literal names.
+	 *
+	 * This list may not be exhaustive, as database portions of the "include"
+	 * list above may match additional databases.
+	 */
+	SimpleStringList dbnames;
+
+	/*
+	 * As an optimization, if any pattern in the exclude list applies to heap
+	 * tables, or similarly if any such pattern applies to btree indexes, or to
+	 * schemas, then these will be true, otherwise false.  These should always
+	 * agree with what you'd conclude by grep'ing through the exclude list.
+	 */
+	bool		excludetbl;
+	bool		excludeidx;
+	bool		excludensp;
+
+	/*
+	 * If any inclusion pattern exists, then we should only be checking
+	 * matching relations rather than all relations, so this is true iff
+	 * include is empty.
+	 */
+	bool		allrel;
+
+	/* heap table checking options */
+	bool		no_toast_expansion;
+	bool		reconcile_toast;
+	bool		on_error_stop;
+	long		startblock;
+	long		endblock;
+	const char *skip;
+
+	/* btree index checking options */
+	bool		parent_check;
+	bool		rootdescend;
+	bool		heapallindexed;
+
+	/* heap and btree hybrid option */
+	bool		no_index_expansion;
+} AmcheckOptions;
+
+static AmcheckOptions opts = {
+	.alldb = false,
+	.echo = false,
+	.quiet = false,
+	.verbose = false,
+	.strict_names = true,
+	.show_progress = false,
+	.jobs = 1,
+	.include = {NULL, NULL},
+	.exclude = {NULL, NULL},
+	.dbnames = {NULL, NULL},
+	.excludetbl = false,
+	.excludeidx = false,
+	.excludensp = false,
+	.allrel = true,
+	.no_toast_expansion = false,
+	.reconcile_toast = true,
+	.on_error_stop = false,
+	.startblock = -1,
+	.endblock = -1,
+	.skip = "none",
+	.parent_check = false,
+	.rootdescend = false,
+	.heapallindexed = false,
+	.no_index_expansion = false
+};
+
+static const char *progname = NULL;
+
+typedef struct PatternInfo
+{
+	int			pattern_id;		/* Unique ID of this pattern */
+	const char *pattern;		/* Unaltered pattern from the command line */
+	char	   *db_regex;		/* Database regexp parsed from pattern, or
+								 * NULL */
+	char	   *nsp_regex;		/* Schema regexp parsed from pattern, or NULL */
+	char	   *rel_regex;		/* Relation regexp parsed from pattern, or
+								 * NULL */
+	bool		table_only;		/* true if rel_regex should only match tables */
+	bool		index_only;		/* true if rel_regex should only match indexes */
+	bool		matched;		/* true if the pattern matched in any database */
+}			PatternInfo;
+
+/* Unique pattern id counter */
+static int	next_id = 1;
+
+/* Whether all relations have so far passed their corruption checks */
+static bool all_checks_pass = true;
+
+/* Time last progress report was displayed */
+static pg_time_t last_progress_report = 0;
+static bool progress_since_last_stderr = false;
+
+typedef struct DatabaseInfo
+{
+	char	   *datname;
+	char	   *amcheck_schema; /* escaped, quoted literal */
+} DatabaseInfo;
+
+typedef struct RelationInfo
+{
+	const DatabaseInfo *datinfo;	/* shared by other relinfos */
+	Oid			reloid;
+	bool		is_table;		/* true if heap, false if btree */
+	char	   *nspname;
+	char	   *relname;
+	int			relpages;
+	int			blocks_to_check;
+} RelationInfo;
+
+/*
+ * Query for determining if contrib's amcheck is installed.  If so, selects the
+ * namespace name where amcheck's functions can be found.
+ */
+static const char *amcheck_sql =
+"SELECT n.nspname, x.extversion FROM pg_catalog.pg_extension x"
+"\nJOIN pg_catalog.pg_namespace n ON x.extnamespace = n.oid"
+"\nWHERE x.extname = 'amcheck'";
+
+static void prepare_table_command(PQExpBuffer sql, RelationInfo *rel,
+								  PGconn *conn);
+static void prepare_btree_command(PQExpBuffer sql, RelationInfo *rel,
+								  PGconn *conn);
+static void run_command(ParallelSlot *slot, const char *sql,
+						ConnParams *cparams);
+static bool verify_heapam_slot_handler(PGresult *res, PGconn *conn,
+									   void *context);
+static bool verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context);
+static void help(const char *progname);
+static void progress_report(uint64 relations_total, uint64 relations_checked,
+							uint64 relpages_total, uint64 relpages_checked,
+							const char *datname, bool force, bool finished);
+
+static void append_database_pattern(SimplePtrList *list, const char *pattern,
+									int encoding);
+static void append_schema_pattern(SimplePtrList *list, const char *pattern,
+								  int encoding);
+static void append_relation_pattern(SimplePtrList *list, const char *pattern,
+									int encoding);
+static void append_table_pattern(SimplePtrList *list, const char *pattern,
+								 int encoding);
+static void append_index_pattern(SimplePtrList *list, const char *pattern,
+								 int encoding);
+static void compile_database_list(PGconn *conn, SimplePtrList *databases);
+static void compile_database_list_from_all(PGconn *conn,
+										   SimplePtrList *databases);
+static void compile_database_list_from_dbnames(SimplePtrList *databases);
+
+static void compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+										 const DatabaseInfo *datinfo,
+										 long long unsigned int *pagecount);
+
+#define log_no_match(...) do { \
+		if (opts.strict_names) \
+			pg_log_generic(PG_LOG_ERROR, __VA_ARGS__); \
+		else \
+			pg_log_generic(PG_LOG_WARNING, __VA_ARGS__); \
+	} while(0)
+
+int
+main(int argc, char *argv[])
+{
+	PGconn	   *conn;
+	SimplePtrListCell *cell;
+	SimplePtrList databases = {NULL, NULL};
+	SimplePtrList relations = {NULL, NULL};
+	bool		failed = false;
+	const char *latest_datname;
+	int			parallel_workers;
+	ParallelSlotArray *sa;
+	PQExpBufferData sql;
+	long long unsigned int reltotal = 0;
+	long long unsigned int pageschecked = 0;
+	long long unsigned int pagestotal = 0;
+	long long unsigned int relprogress = 0;
+
+	static struct option long_options[] = {
+		/* Connection options */
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"maintenance-db", required_argument, NULL, 1},
+
+		/* check options */
+		{"all", no_argument, NULL, 'a'},
+		{"database", required_argument, NULL, 'd'},
+		{"exclude-database", required_argument, NULL, 'D'},
+		{"echo", no_argument, NULL, 'e'},
+		{"index", required_argument, NULL, 'i'},
+		{"exclude-index", required_argument, NULL, 'I'},
+		{"jobs", required_argument, NULL, 'j'},
+		{"progress", no_argument, NULL, 'P'},
+		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'r'},
+		{"exclude-relation", required_argument, NULL, 'R'},
+		{"schema", required_argument, NULL, 's'},
+		{"exclude-schema", required_argument, NULL, 'S'},
+		{"table", required_argument, NULL, 't'},
+		{"exclude-table", required_argument, NULL, 'T'},
+		{"verbose", no_argument, NULL, 'v'},
+		{"no-dependent-indexes", no_argument, NULL, 2},
+		{"no-dependent-toast", no_argument, NULL, 3},
+		{"exclude-toast-pointers", no_argument, NULL, 4},
+		{"on-error-stop", no_argument, NULL, 5},
+		{"skip", required_argument, NULL, 6},
+		{"startblock", required_argument, NULL, 7},
+		{"endblock", required_argument, NULL, 8},
+		{"rootdescend", no_argument, NULL, 9},
+		{"no-strict-names", no_argument, NULL, 10},
+		{"heapallindexed", no_argument, NULL, 11},
+		{"parent-check", no_argument, NULL, 12},
+
+		{NULL, 0, NULL, 0}
+	};
+
+	int			optindex;
+	int			c;
+
+	const char *maintenance_db = NULL;
+
+	const char *host = NULL;
+	const char *port = NULL;
+	const char *username = NULL;
+	enum trivalue prompt_password = TRI_DEFAULT;
+	int			encoding = pg_get_encoding_from_locale(NULL, false);
+	ConnParams	cparams;
+
+	pg_logging_init(argv[0]);
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("contrib"));
+
+	handle_help_version_opts(argc, argv, progname, help);
+
+	/* process command-line options */
+	while ((c = getopt_long(argc, argv, "ad:D:eh:Hi:I:j:p:Pqr:R:s:S:t:T:U:wWv",
+							long_options, &optindex)) != -1)
+	{
+		char	   *endptr;
+
+		switch (c)
+		{
+			case 'a':
+				opts.alldb = true;
+				break;
+			case 'd':
+				append_database_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'D':
+				append_database_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'e':
+				opts.echo = true;
+				break;
+			case 'h':
+				host = pg_strdup(optarg);
+				break;
+			case 'i':
+				opts.allrel = false;
+				append_index_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'I':
+				opts.excludeidx = true;
+				append_index_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'j':
+				opts.jobs = atoi(optarg);
+				if (opts.jobs < 1)
+				{
+					fprintf(stderr,
+							"number of parallel jobs must be at least 1\n");
+					exit(1);
+				}
+				break;
+			case 'p':
+				port = pg_strdup(optarg);
+				break;
+			case 'P':
+				opts.show_progress = true;
+				break;
+			case 'q':
+				opts.quiet = true;
+				break;
+			case 'r':
+				opts.allrel = false;
+				append_relation_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'R':
+				opts.excludeidx = true;
+				opts.excludetbl = true;
+				append_relation_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 's':
+				opts.allrel = false;
+				append_schema_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'S':
+				opts.excludensp = true;
+				append_schema_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 't':
+				opts.allrel = false;
+				append_table_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'T':
+				opts.excludetbl = true;
+				append_table_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'U':
+				username = pg_strdup(optarg);
+				break;
+			case 'w':
+				prompt_password = TRI_NO;
+				break;
+			case 'W':
+				prompt_password = TRI_YES;
+				break;
+			case 'v':
+				opts.verbose = true;
+				pg_logging_increase_verbosity();
+				break;
+			case 1:
+				maintenance_db = pg_strdup(optarg);
+				break;
+			case 2:
+				opts.no_index_expansion = true;
+				break;
+			case 3:
+				opts.no_toast_expansion = true;
+				break;
+			case 4:
+				opts.reconcile_toast = false;
+				break;
+			case 5:
+				opts.on_error_stop = true;
+				break;
+			case 6:
+				if (pg_strcasecmp(optarg, "all-visible") == 0)
+					opts.skip = "all visible";
+				else if (pg_strcasecmp(optarg, "all-frozen") == 0)
+					opts.skip = "all frozen";
+				else
+				{
+					fprintf(stderr, "invalid skip options\n");
+					exit(1);
+				}
+				break;
+			case 7:
+				opts.startblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"relation start block argument contains garbage characters\n");
+					exit(1);
+				}
+				if (opts.startblock > (long) MaxBlockNumber || opts.startblock < (long) 0)
+				{
+					fprintf(stderr,
+							"relation start block argument out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 8:
+				opts.endblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"relation end block argument contains garbage characters\n");
+					exit(1);
+				}
+				if (opts.endblock > (long) MaxBlockNumber || opts.endblock < (long) 0)
+				{
+					fprintf(stderr,
+							"relation end block argument out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 9:
+				opts.rootdescend = true;
+				opts.parent_check = true;
+				break;
+			case 10:
+				opts.strict_names = false;
+				break;
+			case 11:
+				opts.heapallindexed = true;
+				break;
+			case 12:
+				opts.parent_check = true;
+				break;
+			default:
+				fprintf(stderr,
+						"Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+		}
+	}
+
+	if (opts.endblock >= 0 && opts.endblock < opts.startblock)
+	{
+		fprintf(stderr,
+				"relation end block argument precedes start block argument\n");
+		exit(1);
+	}
+
+	/* non-option arguments specify database names (not patterns) */
+	while (optind < argc)
+	{
+		simple_string_list_append(&opts.dbnames, argv[optind]);
+		optind++;
+	}
+
+	/* fill cparams except for dbname, which is set below */
+	cparams.pghost = host;
+	cparams.pgport = port;
+	cparams.pguser = username;
+	cparams.prompt_password = prompt_password;
+	cparams.override_dbname = NULL;
+
+	setup_cancel_handler(NULL);
+
+	/* choose the database for our initial connection */
+	if (opts.alldb)
+		cparams.dbname = maintenance_db;
+	else if (opts.dbnames.head != NULL)
+		cparams.dbname = opts.dbnames.head->val;
+	else
+	{
+		const char *default_db;
+
+		if (getenv("PGDATABASE"))
+			default_db = getenv("PGDATABASE");
+		else if (getenv("PGUSER"))
+			default_db = getenv("PGUSER");
+		else
+			default_db = get_user_name_or_exit(progname);
+
+		/*
+		 * Users expect the database name inferred from the environment to get
+		 * checked, not just get used for the initial connection.
+		 */
+		simple_string_list_append(&opts.dbnames, default_db);
+
+		cparams.dbname = default_db;
+	}
+
+	conn = connectMaintenanceDatabase(&cparams, progname, opts.echo);
+	compile_database_list(conn, &databases);
+	disconnectDatabase(conn);
+
+	if (databases.head == NULL)
+	{
+		pg_log_error("no databases to check");
+		exit(0);
+	}
+
+	/*
+	 * Compile a list of all relations spanning all databases to be checked.
+	 */
+	for (cell = databases.head; cell; cell = cell->next)
+	{
+		PGresult   *result;
+		int			ntups;
+		const char *amcheck_schema = NULL;
+		DatabaseInfo *dat = (DatabaseInfo *) cell->ptr;
+
+		cparams.override_dbname = dat->datname;
+		conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+
+		/*
+		 * Verify that amcheck is installed for this next database.  User
+		 * error could result in a database not having amcheck that should
+		 * have it, but we also could be iterating over multiple databases
+		 * where not all of them have amcheck installed (for example,
+		 * 'template1').
+		 */
+		result = executeQuery(conn, amcheck_sql, opts.echo);
+		if (PQresultStatus(result) != PGRES_TUPLES_OK)
+		{
+			/* Querying the catalog failed. */
+			pg_log_error("database \"%s\": %s",
+						 PQdb(conn), PQerrorMessage(conn));
+			pg_log_error("query was: %s", amcheck_sql);
+			PQclear(result);
+			disconnectDatabase(conn);
+			exit(1);
+		}
+		ntups = PQntuples(result);
+		if (ntups == 0)
+		{
+			/* Querying the catalog succeeded, but amcheck is missing. */
+			pg_log_warning("skipping database \"%s\": amcheck is not installed",
+						   PQdb(conn));
+			disconnectDatabase(conn);
+			continue;
+		}
+		amcheck_schema = PQgetvalue(result, 0, 0);
+		if (opts.verbose)
+			pg_log_info("in database \"%s\": using amcheck version \"%s\" in schema \"%s\"",
+						PQdb(conn), PQgetvalue(result, 0, 1), amcheck_schema);
+		dat->amcheck_schema = PQescapeIdentifier(conn, amcheck_schema,
+												 strlen(amcheck_schema));
+		PQclear(result);
+
+		compile_relation_list_one_db(conn, &relations, dat, &pagestotal);
+		disconnectDatabase(conn);
+	}
+
+	/*
+	 * Check that all inclusion patterns matched at least one schema or
+	 * relation that we can check.
+	 */
+	for (cell = opts.include.head; cell; cell = cell->next)
+	{
+		PatternInfo *pat = (PatternInfo *) cell->ptr;
+
+		if (!pat->matched && (pat->nsp_regex != NULL || pat->rel_regex != NULL))
+		{
+			failed = opts.strict_names;
+
+			if (!opts.quiet || failed)
+			{
+				if (pat->table_only)
+					log_no_match("no tables to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->index_only)
+					log_no_match("no btree indexes to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->rel_regex == NULL)
+					log_no_match("no relations to check in schemas matching \"%s\"",
+								 pat->pattern);
+				else
+					log_no_match("no relations to check matching \"%s\"",
+								 pat->pattern);
+			}
+		}
+	}
+
+	if (failed)
+		exit(1);
+
+	/*
+	 * Set parallel_workers to the lesser of opts.jobs and the number of
+	 * relations.
+	 */
+	parallel_workers = 0;
+	for (cell = relations.head; cell; cell = cell->next)
+	{
+		reltotal++;
+		if (parallel_workers < opts.jobs)
+			parallel_workers++;
+	}
+
+	if (reltotal == 0)
+	{
+		pg_log_error("no relations to check");
+		exit(1);
+	}
+	progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, false);
+
+	/*
+	 * Main event loop.
+	 *
+	 * We use server-side parallelism to check up to parallel_workers
+	 * relations in parallel.  The list of relations was computed in database
+	 * order, which minimizes the number of connects and disconnects as we
+	 * process the list.
+	 */
+	latest_datname = NULL;
+	sa = ParallelSlotsSetup(parallel_workers, &cparams, progname, opts.echo,
+							NULL);
+
+	initPQExpBuffer(&sql);
+	for (relprogress = 0, cell = relations.head; cell; cell = cell->next)
+	{
+		ParallelSlot *free_slot;
+		RelationInfo *rel;
+
+		rel = (RelationInfo *) cell->ptr;
+
+		if (CancelRequested)
+		{
+			failed = true;
+			break;
+		}
+
+		/*
+		 * The list of relations is in database sorted order.  If this next
+		 * relation is in a different database than the last one seen, we are
+		 * about to start checking this database.  Note that other slots may
+		 * still be working on relations from prior databases.
+		 */
+		latest_datname = rel->datinfo->datname;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, latest_datname, false, false);
+
+		relprogress++;
+		pageschecked += rel->blocks_to_check;
+
+		/*
+		 * Get a parallel slot for the next amcheck command, blocking if
+		 * necessary until one is available, or until a previously issued slot
+		 * command fails, indicating that we should abort checking the
+		 * remaining objects.
+		 */
+		free_slot = ParallelSlotsGetIdle(sa, rel->datinfo->datname);
+		if (!free_slot)
+		{
+			/*
+			 * Something failed.  We don't need to know what it was, because
+			 * the handler should already have emitted the necessary error
+			 * messages.
+			 */
+			failed = true;
+			break;
+		}
+
+		/*
+		 * Execute the appropriate amcheck command for this relation using our
+		 * slot's database connection.  We do not wait for the command to
+		 * complete, nor do we perform any error checking, as that is done by
+		 * the parallel slots and our handler callback functions.
+		 */
+		if (rel->is_table)
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+				pg_log_info(ngettext("checking table \"%s\".\"%s\".\"%s\" (oid %u) (%u/%u page)",
+									 "checking table \"%s\".\"%s\".\"%s\" (oid %u) (%u/%u pages)",
+									 rel->blocks_to_check),
+							rel->datinfo->datname, rel->nspname, rel->relname,
+							rel->reloid, rel->blocks_to_check, rel->relpages);
+				progress_since_last_stderr = false;
+			}
+			prepare_table_command(&sql, rel, free_slot->connection);
+			ParallelSlotSetHandler(free_slot, verify_heapam_slot_handler,
+								   sql.data);
+			run_command(free_slot, sql.data, &cparams);
+		}
+		else
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+
+				pg_log_info(ngettext("checking btree index \"%s\".\"%s\".\"%s\" (oid %u) (%u/%u page)",
+									 "checking btree index \"%s\".\"%s\".\"%s\" (oid %u) (%u/%u pages)",
+									 rel->relpages),
+							rel->datinfo->datname, rel->nspname, rel->relname,
+							rel->reloid, rel->blocks_to_check, rel->relpages);
+				progress_since_last_stderr = false;
+			}
+			prepare_btree_command(&sql, rel, free_slot->connection);
+			ParallelSlotSetHandler(free_slot, verify_btree_slot_handler, NULL);
+			run_command(free_slot, sql.data, &cparams);
+		}
+	}
+	termPQExpBuffer(&sql);
+
+	if (!failed)
+	{
+
+		/*
+		 * Wait for all slots to complete, or for one to indicate that an error
+		 * occurred.  Like above, we rely on the handler emitting the necessary
+		 * error messages.
+		 */
+		if (sa && !ParallelSlotsWaitCompletion(sa))
+			failed = true;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, true);
+	}
+
+	if (sa)
+	{
+		ParallelSlotsTerminate(sa);
+		pg_free(sa);
+	}
+
+	if (failed)
+		exit(1);
+
+	if (!all_checks_pass)
+		exit(2);
+}
+
+/*
+ * prepare_table_command
+ *
+ * Creates a SQL command for running amcheck checking on the given heap
+ * relation.  The command is phrased as a SQL query, with column order and
+ * names matching the expectations of verify_heapam_slot_handler, which will
+ * receive and handle each row returned from the verify_heapam() function.
+ *
+ * sql: buffer into which the table checking command will be written
+ * rel: relation information for the table to be checked
+ * conn: the connection to be used, for string escaping purposes
+ *
+ * returns whether the prepared command should be run.
+ */
+static void
+prepare_table_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+	appendPQExpBuffer(sql, "SELECT ");
+	appendStringLiteralConn(sql, rel->nspname, conn);
+	appendPQExpBuffer(sql, " AS nspname, ");
+	appendStringLiteralConn(sql, rel->relname, conn);
+	appendPQExpBuffer(sql, " AS relname, ");
+	appendPQExpBuffer(sql,
+					  "blkno, offnum, attnum, msg"
+					  "\nFROM %s.verify_heapam("
+					  "relation := %u, on_error_stop := %s,"
+					  "check_toast := %s, skip := '%s'",
+					  rel->datinfo->amcheck_schema,
+					  rel->reloid,
+					  opts.on_error_stop ? "true" : "false",
+					  opts.reconcile_toast ? "true" : "false",
+					  opts.skip);
+
+	if (opts.startblock >= 0)
+	{
+		if (opts.startblock < rel->relpages && rel->relpages > 0)
+		{
+			appendPQExpBuffer(sql, ", startblock := %ld", opts.startblock);
+		}
+		else if (!opts.quiet)
+		{
+			pg_log_warning("ignoring startblock option %ld beyond end of table \"%s\".\"%s\".\"%s\"",
+							opts.startblock, rel->datinfo->datname,
+							rel->nspname, rel->relname);
+			progress_since_last_stderr = false;
+		}
+	}
+	if (opts.endblock >= 0)
+	{
+		if (opts.endblock < rel->relpages)
+			appendPQExpBuffer(sql, ", endblock := %ld", opts.endblock);
+		else if (!opts.quiet)
+		{
+			pg_log_warning("ignoring endblock option %ld beyond end of table \"%s\".\"%s\".\"%s\"",
+						   opts.endblock, rel->datinfo->datname, rel->nspname,
+						   rel->relname);
+			progress_since_last_stderr = false;
+		}
+	}
+
+	appendPQExpBuffer(sql, ")");
+}
+
+/*
+ * prepare_btree_command
+ *
+ * Creates a SQL command for running amcheck checking on the given btree index
+ * relation.  The command does not select any columns, as btree checking
+ * functions do not return any, but rather return corruption information by
+ * raising errors, which verify_btree_slot_handler expects.
+ *
+ * sql: buffer into which the table checking command will be written
+ * rel: relation information for the index to be checked
+ * conn: the connection to be used, for string escaping purposes
+ *
+ * returns whether the prepared command should be run.
+ */
+static void
+prepare_btree_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+
+	/*
+	 * Embed the database, schema, and relation name in the query, so if
+	 * the check throws an error, the user knows which relation the error
+	 * came from.
+	 */
+	appendPQExpBuffer(sql, "SELECT %u AS oid, ", rel->reloid);
+	appendStringLiteralConn(sql, rel->datinfo->datname, conn);
+	appendPQExpBuffer(sql, " AS datname, ");
+	appendStringLiteralConn(sql, rel->nspname, conn);
+	appendPQExpBuffer(sql, " AS nspname, ");
+	appendStringLiteralConn(sql, rel->relname, conn);
+	appendPQExpBuffer(sql, " AS relname");
+
+	if (opts.parent_check)
+		appendPQExpBuffer(sql,
+						  "\nFROM %s.bt_index_parent_check("
+						  "\nindex := '%u'::regclass, heapallindexed := %s,"
+						  "\nrootdescend := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"),
+						  (opts.rootdescend ? "true" : "false"));
+	else
+		appendPQExpBuffer(sql,
+						  "\nFROM %s.bt_index_check("
+						  "\nindex := '%u'::regclass,"
+						  "\nheapallindexed := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"));
+}
+
+/*
+ * run_command
+ *
+ * Sends a command to the server without waiting for the command to complete.
+ * Logs an error if the command cannot be sent, but otherwise any errors are
+ * expected to be handled by a ParallelSlotHandler.
+ *
+ * If reconnecting to the database is necessary, the cparams argument may be
+ * modified.
+ *
+ * slot: slot with connection to the server we should use for the command
+ * sql: query to send
+ * cparams: connection parameters in case the slot needs to be reconnected
+ */
+static void
+run_command(ParallelSlot *slot, const char *sql, ConnParams *cparams)
+{
+	if (opts.echo)
+		printf("%s\n", sql);
+
+	if (PQsendQuery(slot->connection, sql) == 0)
+	{
+		pg_log_error("error sending command to database \"%s\": %s",
+					 PQdb(slot->connection),
+					 PQerrorMessage(slot->connection));
+		pg_log_error("command was: %s", sql);
+		exit(1);
+	}
+}
+
+/*
+ * should_processing_continue
+ *
+ * Checks a query result returned from a query (presumably issued on a slot's
+ * connection) to determine if parallel slots should continue issuing further
+ * commands.
+ *
+ * Note: Heap relation corruption is reported by verify_heapam() via the result
+ * set, rather than an ERROR, but running verify_heapam() on a corrupted table
+ * may still result in an error being returned from the server due to missing
+ * relation files, bad checksums, etc.  The btree corruption checking functions
+ * always use errors to communicate corruption messages.  We can't just abort
+ * processing because we got a mere ERROR.
+ *
+ * res: result from an executed sql query
+ */
+static bool
+should_processing_continue(PGresult *res)
+{
+	const char *severity;
+
+	switch (PQresultStatus(res))
+	{
+			/* These are expected and ok */
+		case PGRES_COMMAND_OK:
+		case PGRES_TUPLES_OK:
+		case PGRES_NONFATAL_ERROR:
+			break;
+
+			/* This is expected but requires closer scrutiny */
+		case PGRES_FATAL_ERROR:
+			severity = PQresultErrorField(res, PG_DIAG_SEVERITY_NONLOCALIZED);
+			if (strcmp(severity, "FATAL") == 0)
+				return false;
+			if (strcmp(severity, "PANIC") == 0)
+				return false;
+			break;
+
+			/* These are unexpected */
+		case PGRES_BAD_RESPONSE:
+		case PGRES_EMPTY_QUERY:
+		case PGRES_COPY_OUT:
+		case PGRES_COPY_IN:
+		case PGRES_COPY_BOTH:
+		case PGRES_SINGLE_TUPLE:
+			return false;
+	}
+	return true;
+}
+
+/*
+ * verify_heapam_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a table checking command
+ * created by prepare_table_command and outputs the results for the user.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: the sql query being handled, as a cstring
+ */
+static bool
+verify_heapam_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			i;
+		int			ntups = PQntuples(res);
+
+		if (ntups > 0)
+			all_checks_pass = false;
+
+		for (i = 0; i < ntups; i++)
+		{
+			if (!PQgetisnull(res, i, 4))
+				printf("relation %s.%s.%s, block %s, offset %s, attribute %s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+					   PQgetvalue(res, i, 2),	/* blkno */
+					   PQgetvalue(res, i, 3),	/* offnum */
+					   PQgetvalue(res, i, 4),	/* attnum */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else if (!PQgetisnull(res, i, 3))
+				printf("relation %s.%s.%s, block %s, offset %s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+					   PQgetvalue(res, i, 2),	/* blkno */
+					   PQgetvalue(res, i, 3),	/* offnum */
+				/* attnum is null: 4 */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else if (!PQgetisnull(res, i, 2))
+				printf("relation %s.%s.%s, block %s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+					   PQgetvalue(res, i, 2),	/* blkno */
+				/* offnum is null: 3 */
+				/* attnum is null: 4 */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else if (!PQgetisnull(res, i, 1))
+				printf("relation %s.%s.%s\n    %s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 0),	/* schema */
+					   PQgetvalue(res, i, 1),	/* relname */
+				/* blkno is null:  2 */
+				/* offnum is null: 3 */
+				/* attnum is null: 4 */
+					   PQgetvalue(res, i, 5));	/* msg */
+
+			else
+				printf("%s.%s\n",
+					   PQdb(conn),
+					   PQgetvalue(res, i, 5));	/* msg */
+		}
+	}
+	else if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		all_checks_pass = false;
+		printf("%s: %s\n", PQdb(conn), PQerrorMessage(conn));
+		printf("%s: query was: %s\n", PQdb(conn), (const char *) context);
+	}
+
+	return should_processing_continue(res);
+}
+
+/*
+ * verify_btree_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a btree checking command
+ * created by prepare_btree_command and outputs them for the user.  The results
+ * from the btree checking command is assumed to be empty, but when the results
+ * are an error code, the useful information about the corruption is expected
+ * in the connection's error message.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: unused
+ */
+static bool
+verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			ntups = PQntuples(res);
+
+		if (ntups != 1)
+		{
+			/*
+			 * We expect the btree checking functions to return one void row
+			 * each, so we should output some sort of warning if we get
+			 * anything else, not because it indicates corruption, but because
+			 * it suggests a mismatch between amcheck and pg_amcheck versions.
+			 *
+			 * In conjunction with --progress, anything written to stderr at
+			 * this time would present strangely to the user without an extra
+			 * newline, so we print one.  If we were multithreaded, we'd have
+			 * to avoid splitting this across multiple calls, but we're in an
+			 * event loop, so it doesn't matter.
+			 */
+			if (opts.show_progress && progress_since_last_stderr)
+				fprintf(stderr, "\n");
+			pg_log_warning("btree checking function returned unexpected number of rows: %d",
+						   ntups);
+			pg_log_warning("are %s's and amcheck's versions compatible?",
+						   progname);
+			progress_since_last_stderr = false;
+		}
+	}
+	else
+	{
+		all_checks_pass = false;
+		printf("%s: %s\n", PQdb(conn), PQerrorMessage(conn));
+	}
+
+	return should_processing_continue(res);
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_amcheck"
+ */
+static void
+help(const char *progname)
+{
+	printf("%s uses amcheck module to check objects in a PostgreSQL database for corruption.\n\n", progname);
+	printf("Usage:\n");
+	printf("  %s [OPTION]... [DBNAME]\n", progname);
+	printf("\nTarget Options:\n");
+	printf("  -a, --all                      check all databases\n");
+	printf("  -d, --database=PATTERN         check matching database(s)\n");
+	printf("  -D, --exclude-database=PATTERN do NOT check matching database(s)\n");
+	printf("  -i, --index=PATTERN            check matching index(es)\n");
+	printf("  -I, --exclude-index=PATTERN    do NOT check matching index(es)\n");
+	printf("  -r, --relation=PATTERN         check matching relation(s)\n");
+	printf("  -R, --exclude-relation=PATTERN do NOT check matching relation(s)\n");
+	printf("  -s, --schema=PATTERN           check matching schema(s)\n");
+	printf("  -S, --exclude-schema=PATTERN   do NOT check matching schema(s)\n");
+	printf("  -t, --table=PATTERN            check matching table(s)\n");
+	printf("  -T, --exclude-table=PATTERN    do NOT check matching table(s)\n");
+	printf("      --no-dependent-indexes     do NOT expand list of relations to include indexes\n");
+	printf("      --no-dependent-toast       do NOT expand list of relations to include toast\n");
+	printf("      --no-strict-names          do NOT require patterns to match objects\n");
+	printf("\nTable Checking Options:\n");
+	printf("      --exclude-toast-pointers   do NOT follow relation toast pointers\n");
+	printf("      --on-error-stop            stop checking at end of first corrupt page\n");
+	printf("      --skip=OPTION              do NOT check \"all-frozen\" or \"all-visible\" blocks\n");
+	printf("      --startblock=BLOCK         begin checking table(s) at the given block number\n");
+	printf("      --endblock=BLOCK           check table(s) only up to the given block number\n");
+	printf("\nBtree Index Checking Options:\n");
+	printf("      --heapallindexed           check all heap tuples are found within indexes\n");
+	printf("      --parent-check             check index parent/child relationships\n");
+	printf("      --rootdescend              search from root page to refind tuples\n");
+	printf("\nConnection options:\n");
+	printf("  -h, --host=HOSTNAME            database server host or socket directory\n");
+	printf("  -p, --port=PORT                database server port\n");
+	printf("  -U, --username=USERNAME        user name to connect as\n");
+	printf("  -w, --no-password              never prompt for password\n");
+	printf("  -W, --password                 force password prompt\n");
+	printf("      --maintenance-db=DBNAME    alternate maintenance database\n");
+	printf("\nOther Options:\n");
+	printf("  -e, --echo                     show the commands being sent to the server\n");
+	printf("  -j, --jobs=NUM                 use this many concurrent connections to the server\n");
+	printf("  -q, --quiet                    don't write any messages\n");
+	printf("  -v, --verbose                  write a lot of output\n");
+	printf("  -V, --version                  output version information, then exit\n");
+	printf("  -P, --progress                 show progress information\n");
+	printf("  -?, --help                     show this help, then exit\n");
+
+	printf("\nReport bugs to <%s>.\n", PACKAGE_BUGREPORT);
+	printf("%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Print a progress report based on the global variables.
+ *
+ * Progress report is written at maximum once per second, unless the force
+ * parameter is set to true.
+ *
+ * If finished is set to true, this is the last progress report. The cursor
+ * is moved to the next line.
+ */
+static void
+progress_report(uint64 relations_total, uint64 relations_checked,
+				uint64 relpages_total, uint64 relpages_checked,
+				const char *datname, bool force, bool finished)
+{
+	int			percent_rel = 0;
+	int			percent_pages = 0;
+	char		checked_rel[32];
+	char		total_rel[32];
+	char		checked_pages[32];
+	char		total_pages[32];
+	pg_time_t	now;
+
+	if (!opts.show_progress)
+		return;
+
+	now = time(NULL);
+	if (now == last_progress_report && !force && !finished)
+		return;					/* Max once per second */
+
+	last_progress_report = now;
+	if (relations_total)
+		percent_rel = (int) (relations_checked * 100 / relations_total);
+	if (relpages_total)
+		percent_pages = (int) (relpages_checked * 100 / relpages_total);
+
+	/*
+	 * Separate step to keep platform-dependent format code out of fprintf
+	 * calls.  We only test for INT64_FORMAT availability in snprintf, not
+	 * fprintf.
+	 */
+	snprintf(checked_rel, sizeof(checked_rel), INT64_FORMAT, relations_checked);
+	snprintf(total_rel, sizeof(total_rel), INT64_FORMAT, relations_total);
+	snprintf(checked_pages, sizeof(checked_pages), INT64_FORMAT, relpages_checked);
+	snprintf(total_pages, sizeof(total_pages), INT64_FORMAT, relpages_total);
+
+#define VERBOSE_DATNAME_LENGTH 35
+	if (opts.verbose)
+	{
+		if (!datname)
+
+			/*
+			 * No datname given, so clear the status line (used for first and
+			 * last call)
+			 */
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%) %*s",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+					VERBOSE_DATNAME_LENGTH + 2, "");
+		else
+		{
+			bool		truncate = (strlen(datname) > VERBOSE_DATNAME_LENGTH);
+
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%), (%s%-*.*s)",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+			/* Prefix with "..." if we do leading truncation */
+					truncate ? "..." : "",
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+			/* Truncate datname at beginning if it's too long */
+					truncate ? datname + strlen(datname) - VERBOSE_DATNAME_LENGTH + 3 : datname);
+		}
+	}
+	else
+		fprintf(stderr,
+				"%*s/%s relations (%d%%) %*s/%s pages (%d%%)",
+				(int) strlen(total_rel),
+				checked_rel, total_rel, percent_rel,
+				(int) strlen(total_pages),
+				checked_pages, total_pages, percent_pages);
+
+	/*
+	 * Stay on the same line if reporting to a terminal and we're not done
+	 * yet.
+	 */
+	if (!finished && isatty(fileno(stderr)))
+	{
+		fputc('\r', stderr);
+		progress_since_last_stderr = true;
+	}
+	else
+		fputc('\n', stderr);
+}
+
+/*
+ * append_database_pattern
+ *
+ * Adds to a list the given pattern interpreted as a database name pattern.
+ *
+ * list: the list to be appended
+ * pattern: the database name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_database_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = (PatternInfo *) palloc0(sizeof(PatternInfo));
+
+	info->pattern_id = next_id++;
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->db_regex = pstrdup(buf.data);
+
+	termPQExpBuffer(&buf);
+
+	simple_ptr_list_append(list, info);
+}
+
+/*
+ * append_schema_pattern
+ *
+ * Adds to a list the given pattern interpreted as a schema name pattern.
+ *
+ * list: the list to be appended
+ * pattern: the schema name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_schema_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = (PatternInfo *) palloc0(sizeof(PatternInfo));
+
+	info->pattern_id = next_id++;
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->nsp_regex = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+
+	simple_ptr_list_append(list, info);
+}
+
+/*
+ * append_relation_pattern_helper
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ * table_only: whether the pattern should only be matched against heap tables
+ * index_only: whether the pattern should only be matched against btree indexes
+ */
+static void
+append_relation_pattern_helper(SimplePtrList *list, const char *pattern,
+							   int encoding, bool table_only, bool index_only)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PQExpBufferData relbuf;
+	PatternInfo *info = (PatternInfo *) palloc0(sizeof(PatternInfo));
+
+	info->pattern_id = next_id++;
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+	initPQExpBuffer(&relbuf);
+
+	patternToSQLRegex(encoding, &dbbuf, &nspbuf, &relbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+		info->db_regex = pstrdup(dbbuf.data);
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+	if (relbuf.data[0])
+		info->rel_regex = pstrdup(relbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+	termPQExpBuffer(&relbuf);
+
+	info->table_only = table_only;
+	info->index_only = index_only;
+
+	simple_ptr_list_append(list, info);
+}
+
+/*
+ * append_relation_pattern
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern, to be
+ * matched against both tables and indexes.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_relation_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(list, pattern, encoding, false, false);
+}
+
+/*
+ * append_table_pattern
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern, to be
+ * matched only against tables.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_table_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(list, pattern, encoding, true, false);
+}
+
+/*
+ * append_index_pattern
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern, to be
+ * matched only against indexes.
+ *
+ * list: the list to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_index_pattern(SimplePtrList *list, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(list, pattern, encoding, false, true);
+}
+
+/*
+ * append_db_pattern_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the database portions filtered from the list of patterns expressed as three
+ * columns:
+ *
+ *     id: the unique pattern ID
+ *     pat: the full user specified pattern from the command line
+ *     rgx: the database regular expression parsed from the pattern
+ *
+ * Patterns without a database portion are skipped.  Patterns with more than
+ * just a database portion are optionally skipped, depending on argument
+ * 'inclusive'.
+ *
+ * buf: the buffer to be appended
+ * patterns: the list of patterns to be inserted into the CTE
+ * conn: the database connection
+ * inclusive: whether to include patterns with schema and/or relation parts
+ *
+ * Returns whether any database patterns were appended.
+ */
+static bool
+append_db_pattern_cte(PQExpBuffer buf, const SimplePtrList *patterns,
+					  PGconn *conn, bool inclusive)
+{
+	SimplePtrListCell *cell;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (cell = patterns->head; cell; cell = cell->next)
+	{
+		PatternInfo *info = (PatternInfo *) cell->ptr;
+
+		if (info->db_regex != NULL &&
+			(inclusive || (info->nsp_regex == NULL && info->rel_regex == NULL)))
+		{
+			if (!have_values)
+				appendPQExpBufferStr(buf, "\nVALUES");
+			have_values = true;
+			appendPQExpBuffer(buf, "%s\n(%d, ", comma, info->pattern_id);
+			appendStringLiteralConn(buf, info->pattern, conn);
+			appendPQExpBufferStr(buf, ", ");
+			appendStringLiteralConn(buf, info->db_regex, conn);
+			appendPQExpBufferStr(buf, ")");
+			comma = ",";
+		}
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf, "\nSELECT NULL, NULL, NULL WHERE false");
+
+	return have_values;
+}
+
+/*
+ * compile_database_list
+ *
+ * Compiles a list of all databases to check from the string list of database
+ * names in opts.dbnames plus the database portions of all patterns in
+ * opts.include
+ */
+static void
+compile_database_list(PGconn *conn, SimplePtrList *databases)
+{
+	compile_database_list_from_all(conn, databases);
+	if (!databases->head)
+		compile_database_list_from_dbnames(databases);
+}
+
+/*
+ * compile_database_list_from_dbnames
+ *
+ * Transforms the string list of database names in opts.dbnames into a pointer
+ * list of DatabaseInfo structs.  This behavior is chosen to match the output
+ * of compile_database_list_from_all().
+ */
+static void
+compile_database_list_from_dbnames(SimplePtrList *databases)
+{
+	SimpleStringListCell *outer;
+
+	for (outer = opts.dbnames.head; outer; outer = outer->next)
+	{
+		DatabaseInfo *dat;
+		SimpleStringListCell *inner;
+		bool	duplicate = false;
+
+		/* Check if we've already seen this database name */
+		for (inner = opts.dbnames.head; inner != outer; inner = inner->next)
+			if (strcmp(outer->val, inner->val) == 0)
+			{
+				duplicate = true;
+				break;
+			}
+
+		/* Skip appending duplicates to the list */
+		if (duplicate)
+			continue;
+
+		dat = (DatabaseInfo *) palloc0(sizeof(DatabaseInfo));
+		dat->datname = pstrdup(outer->val);
+		simple_ptr_list_append(databases, dat);
+	}
+}
+
+/*
+ * compile_database_list_from_all
+ *
+ * If any database patterns exist, or if --all was given, compiles a distinct
+ * list of databases to check using a SQL query based on the patterns plus any
+ * literal database names.  If no database patterns exist and --all was not
+ * given, the query is not necessary, and no list is compiled.
+ *
+ * conn: connection to the initial database
+ * databases: the list onto which databases should be appended
+ */
+static void
+compile_database_list_from_all(PGconn *conn, SimplePtrList *databases)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	bool		fatal;
+
+	initPQExpBuffer(&sql);
+
+	/* Append the include patterns CTE. */
+	appendPQExpBufferStr(&sql, "WITH include_raw (id, pat, rgx) AS (");
+	if (!append_db_pattern_cte(&sql, &opts.include, conn, true) &&
+		!opts.alldb)
+	{
+		/*
+		 * None of the inclusion patterns (if any) contain database portions,
+		 * so there is no need to query the database to resolve database
+		 * patterns.
+		 *
+		 * Since we're also not operating under --all, we don't need to query
+		 * the exhaustive list of connectable databases, either.
+		 */
+		termPQExpBuffer(&sql);
+		return;
+	}
+
+	/* Append the exclude patterns CTE. */
+	appendPQExpBufferStr(&sql, "\n),\nexclude_raw (id, pat, rgx) AS (");
+	append_db_pattern_cte(&sql, &opts.exclude, conn, false);
+	appendPQExpBufferStr(&sql, "\n),");
+
+	/*
+	 * Append the database CTE, which includes whether each database is
+	 * connectable and also joins against exclude_raw to determine whether
+	 * each database is excluded.
+	 */
+	appendPQExpBufferStr(&sql,
+						 "\ndatabase (datname) AS ("
+						 "\nSELECT d.datname"
+						 "\nFROM pg_catalog.pg_database d"
+						 "\nLEFT OUTER JOIN exclude_raw e"
+						 "\nON d.datname ~ e.rgx"
+						 "\nWHERE d.datallowconn"
+						 "\nAND e.id IS NULL"
+						 "\n),"
+
+	/*
+	 * Append the include_pat CTE, which joins the include_raw CTE against the
+	 * databases CTE to determine if all the inclusion patterns had matches,
+	 * and whether each matched pattern had the misfortune of only matching
+	 * excluded or unconnectable databases.
+	 */
+						 "\ninclude_pat (id, pat, checkable) AS ("
+						 "\nSELECT i.id, i.pat,"
+						 "\nCOUNT(*) FILTER ("
+						 "\nWHERE d IS NOT NULL"
+						 "\n) AS checkable"
+						 "\nFROM include_raw i"
+						 "\nLEFT OUTER JOIN database d"
+						 "\nON d.datname ~ i.rgx"
+						 "\nGROUP BY i.id, i.pat"
+						 "\n),"
+
+	/*
+	 * Append the filtered_databases CTE, which selects from the database CTE
+	 * optionally joined against the include_raw CTE to only select databases
+	 * that match an inclusion pattern.  This appears to duplicate what the
+	 * include_pat CTE already did above, but here we want only databases, and
+	 * there we wanted patterns.  Also, here we include literal database names
+	 * from the command line, whereas there we only used database patterns.
+	 */
+						 "\nfiltered_databases (datname) AS ("
+						 "\nSELECT DISTINCT d.datname"
+						 "\nFROM database d");
+	if (!opts.alldb)
+		appendPQExpBufferStr(&sql,
+							 "\nINNER JOIN include_raw i"
+							 "\nON d.datname ~ i.rgx");
+	if (opts.dbnames.head != NULL)
+	{
+		SimpleStringListCell *cell;
+		bool		need_comma = false;
+
+		/* Append the database name literals */
+		appendPQExpBufferStr(&sql,
+							 "\nUNION"
+							 "\nVALUES");
+		for (cell = opts.dbnames.head; cell; cell = cell->next)
+		{
+			if (need_comma)
+				appendPQExpBufferStr(&sql, ", (");
+			else
+				appendPQExpBufferStr(&sql, " (");
+			appendStringLiteralConn(&sql, cell->val, conn);
+			appendPQExpBufferStr(&sql, ")");
+			need_comma = true;
+		}
+	}
+	appendPQExpBufferStr(&sql,
+						 "\n)"
+
+	/*
+	 * Select the checkable databases and the unmatched inclusion patterns.
+	 */
+						 "\nSELECT pat, datname"
+						 "\nFROM ("
+						 "\nSELECT id, pat, NULL::TEXT AS datname"
+						 "\nFROM include_pat"
+						 "\nWHERE checkable = 0"
+						 "\nUNION ALL"
+						 "\nSELECT NULL, NULL, datname"
+						 "\nFROM filtered_databases"
+						 "\n) AS combined_records"
+						 "\nORDER BY id NULLS LAST, datname");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_error("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (fatal = false, i = 0; i < ntups; i++)
+	{
+		const char *pat = NULL;
+		const char *datname = NULL;
+
+		if (!PQgetisnull(res, i, 0))
+			pat = PQgetvalue(res, i, 0);
+		if (!PQgetisnull(res, i, 1))
+			datname = PQgetvalue(res, i, 1);
+
+		if (pat != NULL)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern that matched no
+			 * checkable databases.
+			 */
+			fatal = opts.strict_names;
+			log_no_match("no connectable databases to check matching \"%s\"",
+						 pat);
+		}
+		else
+		{
+			/* Current record pertains to a database */
+			Assert(datname != NULL);
+
+			DatabaseInfo *dat = (DatabaseInfo *) palloc0(sizeof(DatabaseInfo));
+
+			/* This database is included.  Add to list */
+			if (opts.verbose)
+				pg_log_warning("including database: \"%s\"", datname);
+
+			dat->datname = pstrdup(datname);
+			simple_ptr_list_append(databases, dat);
+		}
+	}
+	PQclear(res);
+
+	if (fatal)
+	{
+		disconnectDatabase(conn);
+		exit(1);
+	}
+}
+
+/*
+ * append_rel_pattern_raw_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the patterns from the given list as seven columns:
+ *
+ *     id: the unique pattern ID
+ *     pat: the full user specified pattern from the command line
+ *     db_regex: the database regexp parsed from the pattern, or NULL if the
+ *               pattern had no database part
+ *     nsp_regex: the namespace regexp parsed from the pattern, or NULL if the
+ *                pattern had no namespace part
+ *     rel_regex: the relname regexp parsed from the pattern, or NULL if the
+ *                pattern had no relname part
+ *     table_only: true if the pattern applies only to tables (not indexes)
+ *     index_only: true if the pattern applies only to indexes (not tables)
+ *
+ * buf: the buffer to be appended
+ * patterns: the list of patterns to be inserted into the CTE
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_raw_cte(PQExpBuffer buf, const SimplePtrList *patterns,
+						   PGconn *conn)
+{
+	SimplePtrListCell *cell;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (cell = patterns->head; cell; cell = cell->next)
+	{
+		PatternInfo *info = (PatternInfo *) cell->ptr;
+
+		if (!have_values)
+			appendPQExpBufferStr(buf, "\nVALUES");
+		have_values = true;
+		appendPQExpBuffer(buf, "%s\n(%d::INTEGER, ", comma, info->pattern_id);
+		appendStringLiteralConn(buf, info->pattern, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->db_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->db_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->nsp_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->nsp_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->rel_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->rel_regex, conn);
+		if (info->table_only)
+			appendPQExpBufferStr(buf, "::TEXT, true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, "::TEXT, false::BOOLEAN");
+		if (info->index_only)
+			appendPQExpBufferStr(buf, ", true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, ", false::BOOLEAN");
+		appendPQExpBufferStr(buf, ")");
+		comma = ",";
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf,
+							 "\nSELECT NULL::INTEGER, NULL::TEXT, NULL::TEXT,"
+							 "\nNULL::TEXT, NULL::TEXT, NULL::BOOLEAN,"
+							 "\nNULL::BOOLEAN"
+							 "\nWHERE false");
+}
+
+/*
+ * append_rel_pattern_filtered_cte
+ *
+ * Appends to the buffer a Common Table Expression (CTE) which selects
+ * all patterns from the named raw CTE, filtered by database.  All patterns
+ * which have no database portion or whose database portion matches our
+ * connection's database name are selected, with other patterns excluded.
+ *
+ * The basic idea here is that if we're connected to database "foo" and we have
+ * patterns "foo.bar.baz", "alpha.beta" and "one.two.three", we only want to
+ * use the first two while processing relations in this database, as the third
+ * one is not relevant.
+ *
+ * buf: the buffer to be appended
+ * raw: the name of the CTE to select from
+ * filtered: the name of the CTE to create
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_filtered_cte(PQExpBuffer buf, const char *raw,
+								const char *filtered, PGconn *conn)
+{
+	appendPQExpBuffer(buf,
+					  "\n%s (id, pat, nsp_regex, rel_regex, table_only, index_only) AS ("
+					  "\nSELECT id, pat, nsp_regex, rel_regex, table_only, index_only"
+					  "\nFROM %s r"
+					  "\nWHERE (r.db_regex IS NULL"
+					  "\nOR ",
+					  filtered, raw);
+	appendStringLiteralConn(buf, PQdb(conn), conn);
+	appendPQExpBufferStr(buf, " ~ r.db_regex)");
+	appendPQExpBufferStr(buf,
+						 "\nAND (r.nsp_regex IS NOT NULL"
+						 "\nOR r.rel_regex IS NOT NULL)"
+						 "\n),");
+}
+
+/*
+ * compile_relation_list_one_db
+ *
+ * Compiles a list of relations to check within the currently connected
+ * database based on the user supplied options, sorted by descending size,
+ * and appends them to the given list of relations.
+ *
+ * The cells of the constructed list contain all information about the relation
+ * necessary to connect to the database and check the object, including which
+ * database to connect to, where contrib/amcheck is installed, and the Oid and
+ * type of object (table vs. index).  Rather than duplicating the database
+ * details per relation, the relation structs use references to the same
+ * database object, provided by the caller.
+ *
+ * conn: connection to this next database, which should be the same as in 'dat'
+ * relations: list onto which the relations information should be appended
+ * dat: the database info struct for use by each relation
+ * pagecount: gets incremented by the number of blocks to check in all
+ * relations added
+ */
+static void
+compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+							 const DatabaseInfo *dat,
+							 long long unsigned int *pagecount)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	const char *datname;
+
+	initPQExpBuffer(&sql);
+	appendPQExpBufferStr(&sql, "WITH");
+
+	/* Append CTEs for the relation inclusion patterns, if any */
+	if (!opts.allrel)
+	{
+		appendPQExpBufferStr(&sql,
+							 "\ninclude_raw (id, pat, db_regex, nsp_regex, rel_regex, table_only, index_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.include, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "include_raw", "include_pat", conn);
+	}
+
+	/* Append CTEs for the relation exclusion patterns, if any */
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+	{
+		appendPQExpBufferStr(&sql,
+							 "\nexclude_raw (id, pat, db_regex, nsp_regex, rel_regex, table_only, index_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.exclude, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "exclude_raw", "exclude_pat", conn);
+	}
+
+	/* Append the relation CTE. */
+	appendPQExpBufferStr(&sql,
+						 "\nrelation (id, pat, oid, nspname, relname, reltoastrelid, relpages, is_table, is_index) AS ("
+						 "\nSELECT DISTINCT ON (c.oid");
+	if (!opts.allrel)
+		appendPQExpBufferStr(&sql, ", ip.id) ip.id, ip.pat,");
+	else
+		appendPQExpBufferStr(&sql, ") NULL::INTEGER AS id, NULL::TEXT AS pat,");
+	appendPQExpBuffer(&sql,
+					  "\nc.oid, n.nspname, c.relname, c.reltoastrelid, c.relpages,"
+					  "\nc.relam = %u AS is_table,"
+					  "\nc.relam = %u AS is_index"
+					  "\nFROM pg_catalog.pg_class c"
+					  "\nINNER JOIN pg_catalog.pg_namespace n"
+					  "\nON c.relnamespace = n.oid",
+					  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (!opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nINNER JOIN include_pat ip"
+						  "\nON (n.nspname ~ ip.nsp_regex OR ip.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ip.rel_regex OR ip.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ip.table_only)"
+						  "\nAND (c.relam = %u OR NOT ip.index_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBuffer(&sql,
+						  "\nLEFT OUTER JOIN exclude_pat ep"
+						  "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.table_only)"
+						  "\nAND (c.relam = %u OR NOT ep.index_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBufferStr(&sql, "\nWHERE ep.pat IS NULL");
+	else
+		appendPQExpBufferStr(&sql, "\nWHERE true");
+
+	/*
+	 * We need to be careful not to break the --no-dependent-toast and
+	 * --no-dependent-indexes options.  By default, the indexes, toast tables,
+	 * and toast table indexes associated with primary tables are included,
+	 * using their own CTEs below.  We implement the --exclude-* options by
+	 * not creating those CTEs, but that's no use if we've already selected
+	 * the toast and indexes here.  On the other hand, we want inclusion
+	 * patterns that match indexes or toast tables to be honored.  So, if
+	 * inclusion patterns were given, we want to select all tables, toast
+	 * tables, or indexes that match the patterns.  But if no inclusion
+	 * patterns were given, and we're simply matching all relations, then we
+	 * only want to match the primary tables here.
+	 */
+	if (opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam = %u"
+						  "\nAND c.relkind IN ('r', 'm', 't')"
+						  "\nAND c.relnamespace != %u",
+						  HEAP_TABLE_AM_OID, PG_TOAST_NAMESPACE);
+	else
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam IN (%u, %u)"
+						  "\nAND c.relkind IN ('r', 'm', 't', 'i')"
+						  "\nAND ((c.relam = %u AND c.relkind IN ('r', 'm', 't')) OR"
+						  "\n(c.relam = %u AND c.relkind = 'i'))",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID,
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	appendPQExpBufferStr(&sql,
+						 "\nORDER BY c.oid"
+						 "\n)");
+
+	if (!opts.no_toast_expansion)
+	{
+		/*
+		 * Include a CTE for toast tables associated with primary tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * toast table names.
+		 */
+		appendPQExpBufferStr(&sql,
+							 ",\ntoast (oid, nspname, relname, relpages) AS ("
+							 "\nSELECT t.oid, 'pg_toast', t.relname, t.relpages"
+							 "\nFROM pg_catalog.pg_class t"
+							 "\nINNER JOIN relation r"
+							 "\nON r.reltoastrelid = t.oid");
+		if (opts.excludetbl || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (t.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.table_only"
+								 "\nWHERE ep.id IS NULL");
+		appendPQExpBufferStr(&sql,
+							 "\n)");
+	}
+	if (!opts.no_index_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with primary tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * btree index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ",\nindex (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, r.nspname, c.relname, c.relpages"
+						  "\nFROM relation r"
+						  "\nINNER JOIN pg_catalog.pg_index i"
+						  "\nON r.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c"
+						  "\nON i.indexrelid = c.oid");
+		if (opts.excludeidx || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nINNER JOIN pg_catalog.pg_namespace n"
+								 "\nON c.relnamespace = n.oid"
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.index_only"
+								 "\nWHERE ep.id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam = %u"
+						  "\nAND c.relkind = 'i'",
+						  BTREE_AM_OID);
+		if (opts.no_toast_expansion)
+			appendPQExpBuffer(&sql,
+							  "\nAND c.relnamespace != %u",
+							  PG_TOAST_NAMESPACE);
+		appendPQExpBufferStr(&sql, "\n)");
+	}
+
+	if (!opts.no_toast_expansion && !opts.no_index_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with toast tables of
+		 * primary tables selected above, filtering by exclusion patterns (if
+		 * any) that match the toast index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ",\ntoast_index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, 'pg_toast', c.relname, c.relpages"
+						  "\nFROM toast t"
+						  "\nINNER JOIN pg_catalog.pg_index i"
+						  "\nON t.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c"
+						  "\nON i.indexrelid = c.oid");
+		if (opts.excludeidx)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.index_only"
+								 "\nWHERE ep.id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  "\nAND c.relam = %u"
+						  "\nAND c.relkind = 'i'"
+						  "\n)",
+						  BTREE_AM_OID);
+	}
+
+	/*
+	 * Roll-up distinct rows from CTEs.
+	 *
+	 * Relations that match more than one pattern may occur more than once in
+	 * the list, and indexes and toast for primary relations may also have
+	 * matched in their own right, so we rely on UNION to deduplicate the
+	 * list.
+	 */
+	appendPQExpBuffer(&sql,
+					  "\nSELECT id, is_table, is_index, oid, nspname, relname, relpages"
+					  "\nFROM (");
+	appendPQExpBufferStr(&sql,
+	/* Inclusion patterns that failed to match */
+						 "\nSELECT id, is_table, is_index,"
+						 "\nNULL::OID AS oid,"
+						 "\nNULL::TEXT AS nspname,"
+						 "\nNULL::TEXT AS relname,"
+						 "\nNULL::INTEGER AS relpages"
+						 "\nFROM relation"
+						 "\nWHERE id IS NOT NULL"
+						 "\nUNION"
+	/* Primary relations */
+						 "\nSELECT NULL::INTEGER AS id,"
+						 "\nis_table, is_index, oid, nspname, relname, relpages"
+						 "\nFROM relation");
+	if (!opts.no_toast_expansion)
+		appendPQExpBufferStr(&sql,
+							 "\nUNION"
+		/* Toast tables for primary relations */
+							 "\nSELECT NULL::INTEGER AS id, TRUE AS is_table,"
+							 "\nFALSE AS is_index, oid, nspname, relname, relpages"
+							 "\nFROM toast");
+	if (!opts.no_index_expansion)
+		appendPQExpBufferStr(&sql,
+							 "\nUNION"
+		/* Indexes for primary relations */
+							 "\nSELECT NULL::INTEGER AS id, FALSE AS is_table,"
+							 "\nTRUE AS is_index, oid, nspname, relname, relpages"
+							 "\nFROM index");
+	if (!opts.no_toast_expansion && !opts.no_index_expansion)
+		appendPQExpBufferStr(&sql,
+							 "\nUNION"
+		/* Indexes for toast relations */
+							 "\nSELECT NULL::INTEGER AS id, FALSE AS is_table,"
+							 "\nTRUE AS is_index, oid, nspname, relname, relpages"
+							 "\nFROM toast_index");
+	appendPQExpBufferStr(&sql,
+						 "\n) AS combined_records"
+						 "\nORDER BY relpages DESC NULLS FIRST, oid");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_error("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	/*
+	 * Allocate a single copy of the database name to be shared by all nodes
+	 * in the object list, constructed below.
+	 */
+	datname = pstrdup(PQdb(conn));
+
+	ntups = PQntuples(res);
+	for (i = 0; i < ntups; i++)
+	{
+		int			pattern_id = 0;
+		bool		is_table = false;
+		bool		is_index = false;
+		Oid			oid = InvalidOid;
+		const char *nspname = NULL;
+		const char *relname = NULL;
+		int			relpages = 0;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			is_table = (PQgetvalue(res, i, 1)[0] == 't');
+		if (!PQgetisnull(res, i, 2))
+			is_index = (PQgetvalue(res, i, 2)[0] == 't');
+		if (!PQgetisnull(res, i, 3))
+			oid = atooid(PQgetvalue(res, i, 3));
+		if (!PQgetisnull(res, i, 4))
+			nspname = PQgetvalue(res, i, 4);
+		if (!PQgetisnull(res, i, 5))
+			relname = PQgetvalue(res, i, 5);
+		if (!PQgetisnull(res, i, 6))
+			relpages = atoi(PQgetvalue(res, i, 6));
+
+		if (pattern_id > 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern.  Find the
+			 * pattern in the list and record that it matched.  If we expected
+			 * a large number of command-line inclusion pattern arguments, the
+			 * datastructure here might need to be more efficient, but we
+			 * expect the list to be short.
+			 */
+
+			SimplePtrListCell *cell;
+			bool		found;
+
+			for (found = false, cell = opts.include.head; cell; cell = cell->next)
+			{
+				PatternInfo *info = (PatternInfo *) cell->ptr;
+
+				if (info->pattern_id == pattern_id)
+				{
+					info->matched = true;
+					found = true;
+					break;
+				}
+			}
+			if (!found)
+			{
+				pg_log_error("internal error: received unexpected pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+		}
+		else
+		{
+			/* Current record pertains to a relation */
+
+			RelationInfo *rel = (RelationInfo *) palloc0(sizeof(RelationInfo));
+
+			Assert(OidIsValid(oid));
+			Assert(is_table ^ is_index);
+
+			rel->datinfo = dat;
+			rel->reloid = oid;
+			rel->is_table = is_table;
+			rel->nspname = pstrdup(nspname);
+			rel->relname = pstrdup(relname);
+			rel->relpages = relpages;
+			rel->blocks_to_check = relpages;
+			if (is_table && (opts.startblock >= 0 || opts.endblock >= 0))
+			{
+				/*
+				 * We apply --startblock and --endblock to tables, but not
+				 * indexes, and for progress purposes we need to track how many
+				 * blocks we will actually check.
+				 */
+				if (opts.endblock >= 0 && rel->blocks_to_check > opts.endblock)
+					rel->blocks_to_check = opts.endblock + 1;
+				if (opts.startblock >= 0)
+				{
+					if (rel->blocks_to_check > opts.startblock)
+						rel->blocks_to_check -= opts.startblock;
+					else
+						rel->blocks_to_check = 0;
+				}
+			}
+			*pagecount += rel->blocks_to_check;
+
+			simple_ptr_list_append(relations, rel);
+		}
+	}
+	PQclear(res);
+}
diff --git a/contrib/pg_amcheck/t/001_basic.pl b/contrib/pg_amcheck/t/001_basic.pl
new file mode 100644
index 0000000000..dfa0ae9e06
--- /dev/null
+++ b/contrib/pg_amcheck/t/001_basic.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_amcheck');
+program_version_ok('pg_amcheck');
+program_options_handling_ok('pg_amcheck');
diff --git a/contrib/pg_amcheck/t/002_nonesuch.pl b/contrib/pg_amcheck/t/002_nonesuch.pl
new file mode 100644
index 0000000000..1e5842aead
--- /dev/null
+++ b/contrib/pg_amcheck/t/002_nonesuch.pl
@@ -0,0 +1,264 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 82;
+
+# Test set-up
+my ($node, $port);
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+#########################################
+# Test non-existent databases
+
+# Failing to connect to the initial database is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/FATAL:  database "qqq" does not exist/ ],
+	'checking a non-existent database');
+
+# Failing to connect to an additional database is also an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/FATAL:  database "qqq" does not exist/ ],
+	'checking a non-existent additional database');
+
+# Failing to resolve a database pattern is an error by default.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern');
+
+# But only a warning under --no-strict-names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '--no-strict-names', '-d', 'qqq' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern under --no-strict-names');
+
+# Check that a substring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'post' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "post"/ ],
+	'checking an unresolvable database pattern (substring of existent database)');
+
+# Check that a superstring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'postgresql' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "postgresql"/ ],
+	'checking an unresolvable database pattern (superstring of existent database)');
+
+#########################################
+# Test connecting with a non-existent user
+
+# Failing to connect to the initial database due to bad username is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user');
+
+# Failing to connect to the initial database due to bad username is an still an
+# error under --no-strict-names.
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user under --no-strict-names');
+
+#########################################
+# Test checking databases without amcheck installed
+
+# Attempting to check a database by name where amcheck is not installed should
+# raise a warning.  If all databases are skipped, having no relations to check
+# raises an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'checking a database by name without amcheck installed, no other databases');
+
+# Again, but this time with another database to check, so no error is raised.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by name without amcheck installed, with other databases');
+
+# Again, but by way of database pattern rather than name
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'template*' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by pattern without amcheck installed, with other databases');
+
+# Again, but by way of checking all databases
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by pattern without amcheck installed, with other databases');
+
+#########################################
+# Test unreasonable patterns
+
+# Check three-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '..' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.\."/ ],
+	'checking table pattern ".."');
+
+# Again, but with non-trivial schema and relation parts
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '.foo.bar' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.foo\.bar"/ ],
+	'checking table pattern ".foo.bar"');
+
+# Check two-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '.' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no tables to check matching "\."/ ],
+	'checking table pattern "."');
+
+#########################################
+# Test checking non-existent databases, schemas, tables, and indexes
+
+# Use --no-strict-names and a single existent table so we only get warnings
+# about the failed pattern matches
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names',
+		'-t', 'no_such_table',
+		'-t', 'no*such*table',
+		'-i', 'no_such_index',
+		'-i', 'no*such*index',
+		'-r', 'no_such_relation',
+		'-r', 'no*such*relation',
+		'-d', 'no_such_database',
+		'-d', 'no*such*database',
+		'-r', 'none.none',
+		'-r', 'none.none.none',
+		'-r', 'this.is.a.really.long.dotted.string',
+		'-r', 'postgres.none.none',
+		'-r', 'postgres.long.dotted.string',
+		'-r', 'postgres.pg_catalog.none',
+		'-r', 'postgres.none.pg_class',
+		'-t', 'postgres.pg_catalog.pg_class',	# This exists
+	],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no tables to check matching "no_such_table"/,
+	  qr/pg_amcheck: warning: no tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no_such_index"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no\*such\*index"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no_such_relation"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no\*such\*relation"/,
+	  qr/pg_amcheck: warning: no tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no\*such\*database"/,
+	  qr/pg_amcheck: warning: no relations to check matching "none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "none\.none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "this\.is\.a\.really\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.pg_catalog\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.pg_class"/,
+	],
+	'many unmatched patterns and one matched pattern under --no-strict-names');
+
+#########################################
+# Test checking otherwise existent objects but in databases where they do not exist
+
+$node->safe_psql('postgres', q(
+	CREATE TABLE public.foo (f integer);
+	CREATE INDEX foo_idx ON foo(f);
+));
+$node->safe_psql('postgres', q(CREATE DATABASE another_db));
+
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '--no-strict-names',
+		'-t', 'template1.public.foo',
+		'-t', 'another_db.public.foo',
+		'-t', 'no_such_database.public.foo',
+		'-i', 'template1.public.foo_idx',
+		'-i', 'another_db.public.foo_idx',
+		'-i', 'no_such_database.public.foo_idx',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: warning: no tables to check matching "template1\.public\.foo"/,
+	  qr/pg_amcheck: warning: no tables to check matching "another_db\.public\.foo"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "template1\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "another_db\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo_idx"/,
+	  qr/pg_amcheck: error: no relations to check/,
+	],
+	'checking otherwise existent objets in the wrong databases');
+
+
+#########################################
+# Test schema exclusion patterns
+
+# Check with only schema exclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-S', 'public',
+		'-S', 'pg_catalog',
+		'-S', 'pg_toast',
+		'-S', 'information_schema',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion patterns exclude all relations');
+
+# Check with schema exclusion patterns overriding relation and schema inclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-s', 'public',
+		'-s', 'pg_catalog',
+		'-s', 'pg_toast',
+		'-s', 'information_schema',
+		'-t', 'pg_catalog.pg_class',
+		'-S', '*'
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion pattern overrides all inclusion patterns');
diff --git a/contrib/pg_amcheck/t/003_check.pl b/contrib/pg_amcheck/t/003_check.pl
new file mode 100644
index 0000000000..c26267417a
--- /dev/null
+++ b/contrib/pg_amcheck/t/003_check.pl
@@ -0,0 +1,497 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 57;
+
+my ($node, $port, %corrupt_page, %remove_relation);
+
+# Returns the filesystem path for the named relation.
+#
+# Assumes the test node is running
+sub relation_filepath($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $pgdata = $node->data_dir;
+	my $rel = $node->safe_psql($dbname,
+							   qq(SELECT pg_relation_filepath('$relname')));
+	die "path not found for relation $relname" unless defined $rel;
+	return "$pgdata/$rel";
+}
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT ct.relname
+			FROM pg_catalog.pg_class cr, pg_catalog.pg_class ct
+			WHERE cr.oid = '$relname'::regclass
+			  AND cr.reltoastrelid = ct.oid
+			));
+	return undef unless defined $rel;
+	return "pg_toast.$rel";
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of overwriting junk in the first page.
+#
+# Assumes the test node is running.
+sub plan_to_corrupt_first_page($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$corrupt_page{$relpath} = 1;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of removing the file..
+#
+# Assumes the test node is running
+sub plan_to_remove_relation_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$remove_relation{$relpath} = 1;
+}
+
+# For the given (dbname, relname), if a corresponding toast table
+# exists, adds that toast table's relation file to the list to be
+# corrupted by means of removing the file.
+#
+# Assumes the test node is running.
+sub plan_to_remove_toast_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $toastname = relation_toast($dbname, $relname);
+	plan_to_remove_relation_file($dbname, $toastname) if ($toastname);
+}
+
+# Corrupts the first page of the given file path
+sub corrupt_first_page($)
+{
+	my ($relpath) = @_;
+
+	my $fh;
+	open($fh, '+<', $relpath)
+	  or BAIL_OUT("open failed: $!");
+	binmode $fh;
+
+	# Corrupt some line pointers.  The values are chosen to hit the
+	# various line-pointer-corruption checks in verify_heapam.c
+	# on both little-endian and big-endian architectures.
+	seek($fh, 32, 0)
+	  or BAIL_OUT("seek failed: $!");
+	syswrite(
+		$fh,
+		pack("L*",
+			0xAAA15550, 0xAAA0D550, 0x00010000,
+			0x00008000, 0x0000800F, 0x001e8000,
+			0xFFFFFFFF)
+	) or BAIL_OUT("syswrite failed: $!");
+	close($fh)
+	  or BAIL_OUT("close failed: $!");
+}
+
+# Stops the node, performs all the corruptions previously planned, and
+# starts the node again.
+#
+sub perform_all_corruptions()
+{
+	$node->stop();
+	for my $relpath (keys %corrupt_page)
+	{
+		corrupt_first_page($relpath);
+	}
+	for my $relpath (keys %remove_relation)
+	{
+		unlink($relpath);
+	}
+	$node->start;
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+for my $dbname (qw(db1 db2 db3))
+{
+	# Create the database
+	$node->safe_psql('postgres', qq(CREATE DATABASE $dbname));
+
+	# Load the amcheck extension, upon which pg_amcheck depends.  Put the
+	# extension in an unexpected location to test that pg_amcheck finds it
+	# correctly.  Create tables with names that look like pg_catalog names to
+	# check that pg_amcheck does not get confused by them.  Create functions in
+	# schema public that look like amcheck functions to check that pg_amcheck
+	# does not use them.
+	$node->safe_psql($dbname, q(
+		CREATE SCHEMA amcheck_schema;
+		CREATE EXTENSION amcheck WITH SCHEMA amcheck_schema;
+		CREATE TABLE amcheck_schema.pg_database (junk text);
+		CREATE TABLE amcheck_schema.pg_namespace (junk text);
+		CREATE TABLE amcheck_schema.pg_class (junk text);
+		CREATE TABLE amcheck_schema.pg_operator (junk text);
+		CREATE TABLE amcheck_schema.pg_proc (junk text);
+		CREATE TABLE amcheck_schema.pg_tablespace (junk text);
+
+		CREATE FUNCTION public.bt_index_check(index regclass,
+											  heapallindexed boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.bt_index_parent_check(index regclass,
+													 heapallindexed boolean default false,
+													 rootdescend boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_parent_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.verify_heapam(relation regclass,
+											 on_error_stop boolean default false,
+											 check_toast boolean default false,
+											 skip text default 'none',
+											 startblock bigint default null,
+											 endblock bigint default null,
+											 blkno OUT bigint,
+											 offnum OUT integer,
+											 attnum OUT integer,
+											 msg OUT text)
+		RETURNS SETOF record AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong verify_heapam!';
+		END;
+		$$ LANGUAGE plpgsql;
+	));
+
+	# Create schemas, tables and indexes in five separate
+	# schemas.  The schemas are all identical to start, but
+	# we will corrupt them differently later.
+	#
+	for my $schema (qw(s1 s2 s3 s4 s5))
+	{
+		$node->safe_psql($dbname, qq(
+			CREATE SCHEMA $schema;
+			CREATE SEQUENCE $schema.seq1;
+			CREATE SEQUENCE $schema.seq2;
+			CREATE TABLE $schema.t1 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE TABLE $schema.t2 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE VIEW $schema.t2_view AS (
+				SELECT i*2, t FROM $schema.t2
+			);
+			ALTER TABLE $schema.t2
+				ALTER COLUMN t
+				SET STORAGE EXTERNAL;
+
+			INSERT INTO $schema.t1 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			INSERT INTO $schema.t2 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			CREATE MATERIALIZED VIEW $schema.t1_mv AS SELECT * FROM $schema.t1;
+			CREATE MATERIALIZED VIEW $schema.t2_mv AS SELECT * FROM $schema.t2;
+
+			create table $schema.p1 (a int, b int) PARTITION BY list (a);
+			create table $schema.p2 (a int, b int) PARTITION BY list (a);
+
+			create table $schema.p1_1 partition of $schema.p1 for values in (1, 2, 3);
+			create table $schema.p1_2 partition of $schema.p1 for values in (4, 5, 6);
+			create table $schema.p2_1 partition of $schema.p2 for values in (1, 2, 3);
+			create table $schema.p2_2 partition of $schema.p2 for values in (4, 5, 6);
+
+			CREATE INDEX t1_btree ON $schema.t1 USING BTREE (i);
+			CREATE INDEX t2_btree ON $schema.t2 USING BTREE (i);
+
+			CREATE INDEX t1_hash ON $schema.t1 USING HASH (i);
+			CREATE INDEX t2_hash ON $schema.t2 USING HASH (i);
+
+			CREATE INDEX t1_brin ON $schema.t1 USING BRIN (i);
+			CREATE INDEX t2_brin ON $schema.t2 USING BRIN (i);
+
+			CREATE INDEX t1_gist ON $schema.t1 USING GIST (b);
+			CREATE INDEX t2_gist ON $schema.t2 USING GIST (b);
+
+			CREATE INDEX t1_gin ON $schema.t1 USING GIN (ia);
+			CREATE INDEX t2_gin ON $schema.t2 USING GIN (ia);
+
+			CREATE INDEX t1_spgist ON $schema.t1 USING SPGIST (ir);
+			CREATE INDEX t2_spgist ON $schema.t2 USING SPGIST (ir);
+		));
+	}
+}
+
+# Database 'db1' corruptions
+#
+
+# Corrupt indexes in schema "s1"
+plan_to_remove_relation_file('db1', 's1.t1_btree');
+plan_to_corrupt_first_page('db1', 's1.t2_btree');
+
+# Corrupt tables in schema "s2"
+plan_to_remove_relation_file('db1', 's2.t1');
+plan_to_corrupt_first_page('db1', 's2.t2');
+
+# Corrupt tables, partitions, matviews, and btrees in schema "s3"
+plan_to_remove_relation_file('db1', 's3.t1');
+plan_to_corrupt_first_page('db1', 's3.t2');
+
+plan_to_remove_relation_file('db1', 's3.t1_mv');
+plan_to_remove_relation_file('db1', 's3.p1_1');
+
+plan_to_corrupt_first_page('db1', 's3.t2_mv');
+plan_to_corrupt_first_page('db1', 's3.p2_1');
+
+plan_to_remove_relation_file('db1', 's3.t1_btree');
+plan_to_corrupt_first_page('db1', 's3.t2_btree');
+
+# Corrupt toast table, partitions, and materialized views in schema "s4"
+plan_to_remove_toast_file('db1', 's4.t2');
+
+# Corrupt all other object types in schema "s5".  We don't have amcheck support
+# for these types, but we check that their corruption does not trigger any
+# errors in pg_amcheck
+plan_to_remove_relation_file('db1', 's5.seq1');
+plan_to_remove_relation_file('db1', 's5.t1_hash');
+plan_to_remove_relation_file('db1', 's5.t1_gist');
+plan_to_remove_relation_file('db1', 's5.t1_gin');
+plan_to_remove_relation_file('db1', 's5.t1_brin');
+plan_to_remove_relation_file('db1', 's5.t1_spgist');
+
+plan_to_corrupt_first_page('db1', 's5.seq2');
+plan_to_corrupt_first_page('db1', 's5.t2_hash');
+plan_to_corrupt_first_page('db1', 's5.t2_gist');
+plan_to_corrupt_first_page('db1', 's5.t2_gin');
+plan_to_corrupt_first_page('db1', 's5.t2_brin');
+plan_to_corrupt_first_page('db1', 's5.t2_spgist');
+
+
+# Database 'db2' corruptions
+#
+plan_to_remove_relation_file('db2', 's1.t1');
+plan_to_remove_relation_file('db2', 's1.t1_btree');
+
+
+# Leave 'db3' uncorrupted
+#
+
+# Perform the corruptions we planned above using only a single database restart.
+#
+perform_all_corruptions();
+
+
+# Standard first arguments to TestLib functions
+my @cmd = ('pg_amcheck', '--quiet', '-p', $port);
+
+# Regular expressions to match various expected output
+my $no_output_re = qr/^$/;
+my $line_pointer_corruption_re = qr/line pointer/;
+my $missing_file_re = qr/could not open file ".*": No such file or directory/;
+my $index_missing_relation_fork_re = qr/index ".*" lacks a main relation fork/;
+
+# Checking databases with amcheck installed and corrupt relations, pg_amcheck
+# command itself should return exit status = 2, because tables and indexes are
+# corrupt, not exit status = 1, which would mean the pg_amcheck command itself
+# failed.  Corruption messages should go to stdout, and nothing to stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in database db1');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', 'db2', 'db3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in databases db1, db2, and db3');
+
+# Scans of indexes in s1 should detect the specific corruption that we created
+# above.  For missing relation forks, we know what the error message looks
+# like.  For corrupted index pages, the error might vary depending on how the
+# page was formatted on disk, including variations due to alignment differences
+# between platforms, so we accept any non-empty error message.
+#
+# If we don't limit the check to databases with amcheck installed, we expect
+# complaint on stderr, but otherwise stderr should be quiet.
+#
+$node->command_checks_all(
+	[ @cmd, '--all', '-s', 's1', '-i', 't1_btree' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ qr/pg_amcheck: warning: skipping database "postgres": amcheck is not installed/ ],
+	'pg_amcheck index s1.t1_btree reports missing main relation fork');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't2_btree' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ $no_output_re ],
+	'pg_amcheck index s1.s2 reports index corruption');
+
+# Checking db1.s1 with indexes excluded should show no corruptions because we
+# did not corrupt any tables in db1.s1.  Verify that both stdout and stderr
+# are quiet.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db1.s1 excluding indexes');
+
+# Checking db2.s1 should show table corruptions if indexes are excluded
+#
+$node->command_checks_all(
+	[ @cmd, 'db2', '-t', 's1.*', '--no-dependent-indexes' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db2.s1 excluding indexes');
+
+# In schema db1.s3, the tables and indexes are both corrupt.  We should see
+# corruption messages on stdout, and nothing on stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck schema s3 reports table and index errors');
+
+# In schema db1.s4, only toast tables are corrupt.  Check that under default
+# options the toast corruption is reported, but when excluding toast we get no
+# error reports.
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's4' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 reports toast corruption');
+
+$node->command_checks_all(
+	[ @cmd, '--no-dependent-toast', '--exclude-toast-pointers', 'db1', '-s', 's4' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 excluding toast reports no corruption');
+
+# Check that no corruption is reported in schema db1.s5
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's5' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s5 reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we exclude
+# the indexes, no corruption is reported about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-I', 't1_btree', '-I', 't2_btree' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with corrupt indexes excluded reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we provide only
+# table inclusions, and disable index expansion, no corruption is reported
+# about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with all indexes excluded reports no corruption');
+
+# In schema db1.s2, only tables are corrupt.  Verify that when we exclude those
+# tables that no corruption is reported.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's2', '-T', 't1', '-T', 't2' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s2 with corrupt tables excluded reports no corruption');
+
+# Check errors about bad block range command line arguments.  We use schema s5
+# to avoid getting messages about corrupt tables or indexes.
+#
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', 'junk' ],
+	qr/relation start block argument contains garbage characters/,
+	'pg_amcheck rejects garbage startblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--endblock', '1234junk' ],
+	qr/relation end block argument contains garbage characters/,
+	'pg_amcheck rejects garbage endblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', '5', '--endblock', '4' ],
+	qr/relation end block argument precedes start block argument/,
+	'pg_amcheck rejects invalid block range');
+
+# Check bt_index_parent_check alternates.  We don't create any index corruption
+# that would behave differently under these modes, so just smoke test that the
+# arguments are handled sensibly.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--parent-check' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --parent-check');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --heapallindexed --rootdescend');
diff --git a/contrib/pg_amcheck/t/004_verify_heapam.pl b/contrib/pg_amcheck/t/004_verify_heapam.pl
new file mode 100644
index 0000000000..3fdf18931f
--- /dev/null
+++ b/contrib/pg_amcheck/t/004_verify_heapam.pl
@@ -0,0 +1,487 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+
+use Test::More tests => 20;
+
+# This regression test demonstrates that the pg_amcheck binary supplied with
+# the pg_amcheck contrib module correctly identifies specific kinds of
+# corruption within pages.  To test this, we need a mechanism to create corrupt
+# pages with predictable, repeatable corruption.  The postgres backend cannot
+# be expected to help us with this, as its design is not consistent with the
+# goal of intentionally corrupting pages.
+#
+# Instead, we create a table to corrupt, and with careful consideration of how
+# postgresql lays out heap pages, we seek to offsets within the page and
+# overwrite deliberately chosen bytes with specific values calculated to
+# corrupt the page in expected ways.  We then verify that pg_amcheck reports
+# the corruption, and that it runs without crashing.  Note that the backend
+# cannot simply be started to run queries against the corrupt table, as the
+# backend will crash, at least for some of the corruption types we generate.
+#
+# Autovacuum potentially touching the table in the background makes the exact
+# behavior of this test harder to reason about.  We turn it off to keep things
+# simpler.  We use a "belt and suspenders" approach, turning it off for the
+# system generally in postgresql.conf, and turning it off specifically for the
+# test table.
+#
+# This test depends on the table being written to the heap file exactly as we
+# expect it to be, so we take care to arrange the columns of the table, and
+# insert rows of the table, that give predictable sizes and locations within
+# the table page.
+#
+# The HeapTupleHeaderData has 23 bytes of fixed size fields before the variable
+# length t_bits[] array.  We have exactly 3 columns in the table, so natts = 3,
+# t_bits is 1 byte long, and t_hoff = MAXALIGN(23 + 1) = 24.
+#
+# We're not too fussy about which datatypes we use for the test, but we do care
+# about some specific properties.  We'd like to test both fixed size and
+# varlena types.  We'd like some varlena data inline and some toasted.  And
+# we'd like the layout of the table such that the datums land at predictable
+# offsets within the tuple.  We choose a structure without padding on all
+# supported architectures:
+#
+# 	a BIGINT
+#	b TEXT
+#	c TEXT
+#
+# We always insert a 7-ascii character string into field 'b', which with a
+# 1-byte varlena header gives an 8 byte inline value.  We always insert a long
+# text string in field 'c', long enough to force toast storage.
+#
+# We choose to read and write binary copies of our table's tuples, using perl's
+# pack() and unpack() functions.  Perl uses a packing code system in which:
+#
+#	L = "Unsigned 32-bit Long",
+#	S = "Unsigned 16-bit Short",
+#	C = "Unsigned 8-bit Octet",
+#	c = "signed 8-bit octet",
+#	q = "signed 64-bit quadword"
+#
+# Each tuple in our table has a layout as follows:
+#
+#    xx xx xx xx            t_xmin: xxxx		offset = 0		L
+#    xx xx xx xx            t_xmax: xxxx		offset = 4		L
+#    xx xx xx xx          t_field3: xxxx		offset = 8		L
+#    xx xx                   bi_hi: xx			offset = 12		S
+#    xx xx                   bi_lo: xx			offset = 14		S
+#    xx xx                ip_posid: xx			offset = 16		S
+#    xx xx             t_infomask2: xx			offset = 18		S
+#    xx xx              t_infomask: xx			offset = 20		S
+#    xx                     t_hoff: x			offset = 22		C
+#    xx                     t_bits: x			offset = 23		C
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		Cccccccc
+#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		SSSS
+#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued	SSSS
+#    xx xx                        : xx      	 ...continued	S
+#
+# We could choose to read and write columns 'b' and 'c' in other ways, but
+# it is convenient enough to do it this way.  We define packing code
+# constants here, where they can be compared easily against the layout.
+
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCcccccccSSSSSSSSS';
+use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
+
+# Read a tuple of our table from a heap page.
+#
+# Takes an open filehandle to the heap file, and the offset of the tuple.
+#
+# Rather than returning the binary data from the file, unpacks the data into a
+# perl hash with named fields.  These fields exactly match the ones understood
+# by write_tuple(), below.  Returns a reference to this hash.
+#
+sub read_tuple ($$)
+{
+	my ($fh, $offset) = @_;
+	my ($buffer, %tup);
+	seek($fh, $offset, 0);
+	sysread($fh, $buffer, HEAPTUPLE_PACK_LENGTH);
+
+	@_ = unpack(HEAPTUPLE_PACK_CODE, $buffer);
+	%tup = (t_xmin => shift,
+			t_xmax => shift,
+			t_field3 => shift,
+			bi_hi => shift,
+			bi_lo => shift,
+			ip_posid => shift,
+			t_infomask2 => shift,
+			t_infomask => shift,
+			t_hoff => shift,
+			t_bits => shift,
+			a => shift,
+			b_header => shift,
+			b_body1 => shift,
+			b_body2 => shift,
+			b_body3 => shift,
+			b_body4 => shift,
+			b_body5 => shift,
+			b_body6 => shift,
+			b_body7 => shift,
+			c1 => shift,
+			c2 => shift,
+			c3 => shift,
+			c4 => shift,
+			c5 => shift,
+			c6 => shift,
+			c7 => shift,
+			c8 => shift,
+			c9 => shift);
+	# Stitch together the text for column 'b'
+	$tup{b} = join('', map { chr($tup{"b_body$_"}) } (1..7));
+	return \%tup;
+}
+
+# Write a tuple of our table to a heap page.
+#
+# Takes an open filehandle to the heap file, the offset of the tuple, and a
+# reference to a hash with the tuple values, as returned by read_tuple().
+# Writes the tuple fields from the hash into the heap file.
+#
+# The purpose of this function is to write a tuple back to disk with some
+# subset of fields modified.  The function does no error checking.  Use
+# cautiously.
+#
+sub write_tuple($$$)
+{
+	my ($fh, $offset, $tup) = @_;
+	my $buffer = pack(HEAPTUPLE_PACK_CODE,
+					$tup->{t_xmin},
+					$tup->{t_xmax},
+					$tup->{t_field3},
+					$tup->{bi_hi},
+					$tup->{bi_lo},
+					$tup->{ip_posid},
+					$tup->{t_infomask2},
+					$tup->{t_infomask},
+					$tup->{t_hoff},
+					$tup->{t_bits},
+					$tup->{a},
+					$tup->{b_header},
+					$tup->{b_body1},
+					$tup->{b_body2},
+					$tup->{b_body3},
+					$tup->{b_body4},
+					$tup->{b_body5},
+					$tup->{b_body6},
+					$tup->{b_body7},
+					$tup->{c1},
+					$tup->{c2},
+					$tup->{c3},
+					$tup->{c4},
+					$tup->{c5},
+					$tup->{c6},
+					$tup->{c7},
+					$tup->{c8},
+					$tup->{c9});
+	seek($fh, $offset, 0);
+	syswrite($fh, $buffer, HEAPTUPLE_PACK_LENGTH);
+	return;
+}
+
+# Set umask so test directories and files are created with default permissions
+umask(0077);
+
+# Set up the node.  Once we create and corrupt the table,
+# autovacuum workers visiting the table could crash the backend.
+# Disable autovacuum so that won't happen.
+my $node = get_new_node('test');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
+
+# Start the node and load the extensions.  We depend on both
+# amcheck and pageinspect for this test.
+$node->start;
+my $port = $node->port;
+my $pgdata = $node->data_dir;
+$node->safe_psql('postgres', "CREATE EXTENSION amcheck");
+$node->safe_psql('postgres', "CREATE EXTENSION pageinspect");
+
+# Get a non-zero datfrozenxid
+$node->safe_psql('postgres', qq(VACUUM FREEZE));
+
+# Create the test table with precisely the schema that our corruption function
+# expects.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.test (a BIGINT, b TEXT, c TEXT);
+		ALTER TABLE public.test SET (autovacuum_enabled=false);
+		ALTER TABLE public.test ALTER COLUMN c SET STORAGE EXTERNAL;
+		CREATE INDEX test_idx ON public.test(a, b);
+	));
+
+# We want (0 < datfrozenxid < test.relfrozenxid).  To achieve this, we freeze
+# an otherwise unused table, public.junk, prior to inserting data and freezing
+# public.test
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.junk AS SELECT 'junk'::TEXT AS junk_column;
+		ALTER TABLE public.junk SET (autovacuum_enabled=false);
+		VACUUM FREEZE public.junk
+	));
+
+my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.test')));
+my $relpath = "$pgdata/$rel";
+
+# Insert data and freeze public.test
+use constant ROWCOUNT => 16;
+$node->safe_psql('postgres', qq(
+	INSERT INTO public.test (a, b, c)
+		VALUES (
+			12345678,
+			'abcdefg',
+			repeat('w', 10000)
+		);
+	VACUUM FREEZE public.test
+	)) for (1..ROWCOUNT);
+
+my $relfrozenxid = $node->safe_psql('postgres',
+	q(select relfrozenxid from pg_class where relname = 'test'));
+my $datfrozenxid = $node->safe_psql('postgres',
+	q(select datfrozenxid from pg_database where datname = 'postgres'));
+
+# Find where each of the tuples is located on the page.
+my @lp_off;
+for my $tup (0..ROWCOUNT-1)
+{
+	push (@lp_off, $node->safe_psql('postgres', qq(
+select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
+	offset $tup limit 1)));
+}
+
+# Check that pg_amcheck runs against the uncorrupted table without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table, prior to corruption');
+
+# Check that pg_amcheck runs against the uncorrupted table and index without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table and index, prior to corruption');
+
+$node->stop;
+
+# Sanity check that our 'test' table has a relfrozenxid newer than the
+# datfrozenxid for the database, and that the datfrozenxid is greater than the
+# first normal xid.  We rely on these invariants in some of our tests.
+if ($datfrozenxid <= 3 || $datfrozenxid >= $relfrozenxid)
+{
+	fail('Xid thresholds not as expected');
+	$node->clean_node;
+	exit;
+}
+
+# Some #define constants from access/htup_details.h for use while corrupting.
+use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
+use constant HEAP_XMIN_COMMITTED     => 0x0100;
+use constant HEAP_XMIN_INVALID       => 0x0200;
+use constant HEAP_XMAX_COMMITTED     => 0x0400;
+use constant HEAP_XMAX_INVALID       => 0x0800;
+use constant HEAP_NATTS_MASK         => 0x07FF;
+use constant HEAP_XMAX_IS_MULTI      => 0x1000;
+use constant HEAP_KEYS_UPDATED       => 0x2000;
+
+# Helper function to generate a regular expression matching the header we
+# expect verify_heapam() to return given which fields we expect to be non-null.
+sub header
+{
+	my ($blkno, $offnum, $attnum) = @_;
+	return qr/relation postgres\.public\.test, block $blkno, offset $offnum, attribute $attnum\s+/ms
+		if (defined $attnum);
+	return qr/relation postgres\.public\.test, block $blkno, offset $offnum\s+/ms
+		if (defined $offnum);
+	return qr/relation postgres\.public\.test\s+/ms
+		if (defined $blkno);
+	return qr/relation postgres\.public\.test\s+/ms;
+}
+
+# Corrupt the tuples, one type of corruption per tuple.  Some types of
+# corruption cause verify_heapam to skip to the next tuple without
+# performing any remaining checks, so we can't exercise the system properly if
+# we focus all our corruption on a single tuple.
+#
+my @expected;
+my $file;
+open($file, '+<', $relpath);
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	# Sanity-check that the data appears on the page where we expect.
+	if ($tup->{a} ne '12345678' || $tup->{b} ne 'abcdefg')
+	{
+		fail('Page layout differs from our expectations');
+		$node->clean_node;
+		exit;
+	}
+
+	my $header = header(0, $offnum, undef);
+	if ($offnum == 1)
+	{
+		# Corruptly set xmin < relfrozenxid
+		my $xmin = $relfrozenxid - 1;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		# Expected corruption report
+		push @expected,
+			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
+	}
+	if ($offnum == 2)
+	{
+		# Corruptly set xmin < datfrozenxid
+		my $xmin = 3;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin $xmin precedes oldest valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 3)
+	{
+		# Corruptly set xmin < datfrozenxid, further back, noting circularity
+		# of xid comparison.  For a new cluster with epoch = 0, the corrupt
+		# xmin will be interpreted as in the future
+		$tup->{t_xmin} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 4)
+	{
+		# Corruptly set xmax < relminmxid;
+		$tup->{t_xmax} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMAX_INVALID;
+
+		push @expected,
+			qr/${$header}xmax 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 5)
+	{
+		# Corrupt the tuple t_hoff, but keep it aligned properly
+		$tup->{t_hoff} += 128;
+
+		push @expected,
+			qr/${$header}data begins at offset 152 beyond the tuple length 58/,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 152 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 6)
+	{
+		# Corrupt the tuple t_hoff, wrong alignment
+		$tup->{t_hoff} += 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 27 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 7)
+	{
+		# Corrupt the tuple t_hoff, underflow but correct alignment
+		$tup->{t_hoff} -= 8;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 16 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 8)
+	{
+		# Corrupt the tuple t_hoff, underflow and wrong alignment
+		$tup->{t_hoff} -= 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 21 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 9)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, not just 3
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+
+		push @expected,
+			qr/${$header}number of attributes 2047 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 10)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, some of
+		# them null.  This falsely creates the impression that the t_bits
+		# array is longer than just one byte, but t_hoff still says otherwise.
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+		$tup->{t_bits} = 0xAA;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 280, but actually begins at byte 24 \(2047 attributes, has nulls\)/;
+	}
+	elsif ($offnum == 11)
+	{
+		# Same as above, but this time t_hoff plays along
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= (HEAP_NATTS_MASK & 0x40);
+		$tup->{t_bits} = 0xAA;
+		$tup->{t_hoff} = 32;
+
+		push @expected,
+			qr/${$header}number of attributes 67 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 12)
+	{
+		# Corrupt the bits in column 'b' 1-byte varlena header
+		$tup->{b_header} = 0x80;
+
+		$header = header(0, $offnum, 1);
+		push @expected,
+			qr/${header}attribute 1 with length 4294967295 ends at offset 416848000 beyond total tuple length 58/;
+	}
+	elsif ($offnum == 13)
+	{
+		# Corrupt the bits in column 'c' toast pointer
+		$tup->{c6} = 41;
+		$tup->{c7} = 41;
+
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}final toast chunk number 0 differs from expected value 6/,
+			qr/${header}toasted value for attribute 2 missing from toast table/;
+	}
+	elsif ($offnum == 14)
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4;
+
+		push @expected,
+			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
+	}
+	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4000000000;
+
+		push @expected,
+			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
+	}
+	write_tuple($file, $offset, $tup);
+}
+close($file);
+$node->start;
+
+# Run pg_amcheck against the corrupt table with epoch=0, comparing actual
+# corruption messages against the expected messages
+$node->command_checks_all(
+	['pg_amcheck', '--no-dependent-indexes', '-p', $port, 'postgres'],
+	2,
+	[ @expected ],
+	[ ],
+	'Expected corruption message output');
+
+$node->teardown_node;
+$node->clean_node;
diff --git a/contrib/pg_amcheck/t/005_opclass_damage.pl b/contrib/pg_amcheck/t/005_opclass_damage.pl
new file mode 100644
index 0000000000..eba8ea9cae
--- /dev/null
+++ b/contrib/pg_amcheck/t/005_opclass_damage.pl
@@ -0,0 +1,54 @@
+# This regression test checks the behavior of the btree validation in the
+# presence of breaking sort order changes.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create a custom operator class and an index which uses it.
+$node->safe_psql('postgres', q(
+	CREATE EXTENSION amcheck;
+
+	CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;
+
+	CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
+	    OPERATOR 1 < (int4, int4), OPERATOR 2 <= (int4, int4),
+	    OPERATOR 3 = (int4, int4), OPERATOR 4 >= (int4, int4),
+	    OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);
+
+	CREATE TABLE int4tbl (i int4);
+	INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);
+	CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
+));
+
+# We have not yet broken the index, so we should get no corruption
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $node->port, 'postgres' ],
+	qr/^$/,
+	'pg_amcheck all schemas, tables and indexes reports no corruption');
+
+# Change the operator class to use a function which sorts in a different
+# order to corrupt the btree index
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION int4_desc_cmp (int4, int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN -1 ELSE 1 END; $$;
+	UPDATE pg_catalog.pg_amproc
+		SET amproc = 'int4_desc_cmp'::regproc
+		WHERE amproc = 'int4_asc_cmp'::regproc
+));
+
+# Index corruption should now be reported
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $node->port, 'postgres' ],
+	2,
+	[ qr/item order invariant violated for index "fickleidx"/ ],
+	[ ],
+	'pg_amcheck all schemas, tables and indexes reports fickleidx corruption'
+);
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d3ca4b6932..7e101f7c11 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -185,6 +185,7 @@ pages.
   </para>
 
  &oid2name;
+ &pgamcheck;
  &vacuumlo;
  </sect1>
 
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index db1d369743..5115cb03d0 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY oldsnapshot     SYSTEM "oldsnapshot.sgml">
 <!ENTITY pageinspect     SYSTEM "pageinspect.sgml">
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
+<!ENTITY pgamcheck       SYSTEM "pgamcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
diff --git a/doc/src/sgml/pgamcheck.sgml b/doc/src/sgml/pgamcheck.sgml
new file mode 100644
index 0000000000..dbb9945e67
--- /dev/null
+++ b/doc/src/sgml/pgamcheck.sgml
@@ -0,0 +1,682 @@
+<!-- doc/src/sgml/pgamcheck.sgml -->
+
+<refentry id="pgamcheck">
+ <indexterm zone="pgamcheck">
+  <primary>pg_amcheck</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle><application>pg_amcheck</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_amcheck</refname>
+  <refpurpose>checks for corruption in one or more
+  <productname>PostgreSQL</productname> databases</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_amcheck</command>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+   <arg rep="repeat"><replaceable>dbname</replaceable></arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <application>pg_amcheck</application> supports running
+   <xref linkend="amcheck"/>'s corruption checking functions against one or
+   more databases, with options to select which schemas, tables and indexes to
+   check, which kinds of checking to perform, and whether to perform the checks
+   in parallel, and if so, the number of parallel connections to establish and
+   use.
+  </para>
+
+  <para>
+   Only table relations and btree indexes are currently supported.  Other
+   relation types are silently skipped.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <para>
+   pg_amcheck accepts the following command-line arguments:
+
+   <variablelist>
+
+    <varlistentry>
+     <term><option><replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the name of a database to be checked.
+      </para>
+      <para>
+       If no <replaceable>dbname</replaceable> is specified, and if
+       <option>-a</option> <option>--all</option> is not used, the database name
+       is read from the environment variable <envar>PGDATABASE</envar>.  If
+       that is not set, the user name specified for the connection is used.
+       The <replaceable>dbname</replaceable> can be a <link
+       linkend="libpq-connstring">connection string</link>.  If so, connection
+       string parameters will override any conflicting command line options.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-a</option></term>
+     <term><option>--all</option></term>
+       <listitem>
+      <para>
+       Perform checking in all databases which are not otherwise excluded.
+      </para>
+      <para>
+       In the absence of any other options, selects all objects across all
+       schemas and databases.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-database</option> takes
+       precedence over <option>-a</option> <option>--all</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-d <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking in databases matching the specified
+       <replaceable>pattern</replaceable> that are not otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern.  By default, all objects in all matching databases will be
+       checked.
+      </para>
+      <para>
+       If <option>-a</option> <option>--all</option> is also specified,
+       <option>-d</option> <option>--database</option> does not additionally
+       affect which databases are checked.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-database</option> takes
+       precedence over <option>-d</option> <option>--database</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-D <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Do not include databases matching other patterns or included by option
+       <option>-a</option> <option>--all</option> if they also match the
+       specified exclusion <replaceable>pattern</replaceable>.
+      </para>
+      <para>
+       This does not exclude any database that was listed explicitly as a
+       <replaceable>dbname</replaceable> on the command line, nor does it exclude
+       the database chosen in the absence of any
+       <replaceable>dbname</replaceable> argument.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       exclusion pattern.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-e</option></term>
+     <term><option>--echo</option></term>
+     <listitem>
+      <para>
+       Print to stdout all commands and queries being executed against the
+       server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--endblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Skip (do not check) all pages after the given ending
+       <replaceable>block</replaceable>.
+      </para>
+      <para>
+       By default, no pages are skipped.  This option will be applied to all
+       table relations that are checked, including toast tables, but note that
+       unless <option>--exclude-toast-pointers</option> is given, toast
+       pointers found in the main table will be followed into the toast table
+       without regard for the location in the toast table.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--exclude-toast-pointers</option></term>
+     <listitem>
+      <para>
+       When checking main relations, do not look up entries in toast tables
+       corresponding to toast pointers in the main relation.
+      </para>
+      <para>
+       The default behavior checks each toast pointer encountered in the main
+       table to verify, as much as possible, that the pointer points at
+       something in the toast table that is reasonable.  Toast pointers which
+       point beyond the end of the toast table, or to the middle (rather than
+       the beginning) of a toast entry, are identified as corrupt.
+      </para>
+      <para>
+       The process by which <xref linkend="amcheck"/>'s
+       <function>verify_heapam</function> function checks each toast pointer is
+       slow and may be improved in a future release.  Some users may wish to
+       disable this check to save time.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--heapallindexed</option></term>
+     <listitem>
+      <para>
+       For each index checked, verify the presence of all heap tuples as index
+       tuples in the index using <xref linkend="amcheck"/>'s
+       <option>heapallindexed</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-?</option></term>
+     <term><option>--help</option></term>
+     <listitem>
+      <para>
+       Show help about <application>pg_amcheck</application> command line
+       arguments, and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-h <replaceable class="parameter">hostname</replaceable></option></term>
+     <term><option>--host=<replaceable class="parameter">hostname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the host name of the machine on which the server is running.
+       If the value begins with a slash, it is used as the directory for the
+       Unix domain socket.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-i <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checks on indexes which match the specified
+       <replaceable>pattern</replaceable> unless they are otherwise excluded.
+      </para>
+      <para>
+       This is an alias for the <option>-r</option> <option>--relation</option>
+       option, except that it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-I <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on the indexes which match the specified <replaceable>pattern</replaceable>.
+      </para>
+      <para>
+       This is an alias for the <option>-R</option>
+       <option>--exclude-relation</option> option, except that it applies only
+       to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-j <replaceable class="parameter">num</replaceable></option></term>
+     <term><option>--jobs=<replaceable class="parameter">num</replaceable></option></term>
+     <listitem>
+      <para>
+       Use <replaceable>num</replaceable> concurrent connections to the server,
+       or one per object to be checked, whichever number is smaller.
+      </para>
+      <para>
+       The default is to use a single connection.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--maintenance-db=<replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the name of the database to connect to to discover which
+       databases should be checked, when
+       <option>-a</option>/<option>--all</option> is used.  If not specified,
+       the <literal>postgres</literal> database will be used, or if that does
+       not exist, <literal>template1</literal> will be used.  This can be a
+       <link linkend="libpq-connstring">connection string</link>.  If so,
+       connection string parameters will override any conflicting command line
+       options.  Also, connection string parameters other than the database
+       name itself will be re-used when connecting to other databases.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-indexes</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include btree indexes associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated btree indexes, if any.  If this option is given, only
+       those indexes which match a <option>--relation</option> or
+       <option>--index</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-strict-names</option></term>
+     <listitem>
+      <para>
+       When calculating the list of databases to check, and the objects within
+       those databases to be checked, do not raise an error for database,
+       schema, relation, table, nor index inclusion patterns which match no
+       corresponding objects.
+      </para>
+      <para>
+       Exclusion patterns are not required to match any objects, but by
+       default unmatched inclusion patterns raise an error, including when
+       they fail to match as a result of an exclusion pattern having
+       prohibited them matching an existent object, and when they fail to
+       match a database because it is unconnectable (datallowconn is false).
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-toast</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include toast tables associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated toast tables, if any.  If this option is given, only
+       those toast tables which match a <option>--relation</option> or
+       <option>--table</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--on-error-stop</option></term>
+     <listitem>
+      <para>
+       After reporting all corruptions on the first page of a table where
+       corruptions are found, stop processing that table relation and move on
+       to the next table or index.
+      </para>
+      <para>
+       Note that index checking always stops after the first corrupt page.
+       This option only has meaning relative to table relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--parent-check</option></term>
+     <listitem>
+      <para>
+       For each btree index checked, use <xref linkend="amcheck"/>'s
+       <function>bt_index_parent_check</function> function, which performs
+       additional checks of parent/child relationships during index checking.
+      </para>
+      <para>
+       The default is to use <application>amcheck</application>'s
+       <function>bt_index_check</function> function, but note that use of the
+       <option>--rootdescend</option> option implicitly selects
+       <function>bt_index_parent_check</function>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-p <replaceable class="parameter">port</replaceable></option></term>
+     <term><option>--port=<replaceable class="parameter">port</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the TCP port or local Unix domain socket file extension on
+       which the server is listening for connections.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-P</option></term>
+     <term><option>--progress</option></term>
+     <listitem>
+      <para>
+       Show progress information about how many relations have been checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-q</option></term>
+     <term><option>--quiet</option></term>
+     <listitem>
+      <para>
+       Do not write additional messages beyond those about corruption.
+      </para>
+      <para>
+       This option does not quiet any output specifically due to the use of
+       the <option>-e</option> <option>--echo</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-r <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking on all relations matching the specified <replaceable>pattern</replaceable>
+       unless they are otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern.
+      </para>
+      <para>
+       Patterns may be unqualified, or they may be schema-qualified or
+       database- and schema-qualified, such as
+       <literal>"my*relation"</literal>,
+       <literal>"my*schema*.my*relation*"</literal>, or
+       <literal>"my*database.my*schema.my*relation</literal>.  There is no
+       problem specifying relation patterns that match databases that are not
+       otherwise included, as the relation in the matching database will still
+       be checked.
+      </para>
+      <para>
+       Option <option>-R</option> <option>--exclude-relation</option> takes
+       precedence over <option>-r</option> <option>--relation</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-R <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on relations matching the specified
+       <replaceable>pattern</replaceable>.
+      </para>
+      <para>
+       As with <option>-r</option> <option>--relation</option>, the
+       <replaceable>pattern</replaceable> may be unqualified, schema-qualified,
+       or database- and schema-qualified.
+      </para>
+      <para>
+       Option <option>-R</option> <option>--exclude-relation</option> takes
+       precedence over <option>-r</option> <option>--relation</option>,
+       <option>-t</option> <option>--table</option> and <option>-i</option>
+       <option>--index</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--rootdescend</option></term>
+     <listitem>
+      <para>
+       For each index checked, re-find tuples on the leaf level by performing a
+       new search from the root page for each tuple using
+       <xref linkend="amcheck"/>'s <option>rootdescend</option> option.
+      </para>
+      <para>
+       Use of this option implicitly also selects the
+       <option>--parent-check</option> option.
+      </para>
+      <para>
+       This form of verification was originally written to help in the
+       development of btree index features.  It may be of limited use or even
+       of no use in helping detect the kinds of corruption that occur in
+       practice.  It may also cause corruption checking to take considerably
+       longer and consume considerably more resources on the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-s <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking in schemas matching the specified
+       <replaceable>pattern</replaceable> that are not otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern for checking.  By default, all objects in all matching schema(s)
+       will be checked.
+      </para>
+      <para>
+       Option <option>-S</option> <option>--exclude-schema</option> takes
+       precedence over <option>-s</option> <option>--schema</option>.
+      </para>
+      <para>
+       Note that both tables and indexes are included using this option, which
+       might not be what you want if you are also using
+       <option>--no-dependent-indexes</option>.  To specify all tables in a
+       schema without also specifying all indexes, <option>--table</option> can
+       be used with a pattern that specifies the schema.  For example, to check
+       all tables in schema <literal>corp</literal>, the option
+       <literal>--table="corp.*"</literal> may be used.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-S <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Do not perform checking in schemas matching the specified <replaceable>pattern</replaceable>.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern for exclusion.
+      </para>
+      <para>
+       If a schema which is included using
+       <option>-s</option> <option>--schema</option> is also excluded using
+       <option>-S</option> <option>--exclude-schema</option>, the schema will
+       be excluded.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--skip=<replaceable class="parameter">option</replaceable></option></term>
+     <listitem>
+      <para>
+       If <literal>"all-frozen"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all frozen.
+      </para>
+      <para>
+       If <literal>"all-visible"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all visible.
+      </para>
+      <para>
+       By default, no pages are skipped.  This can be specified as
+       <literal>"none"</literal>, but since this is the default, it need not be
+       mentioned.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--startblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Skip (do not check) pages prior to the given starting block.
+      </para>
+      <para>
+       By default, no pages are skipped.  This option will be applied to all
+       table relations that are checked, including toast tables, but note
+       that unless <option>--exclude-toast-pointers</option> is given, toast
+       pointers found in the main table will be followed into the toast table
+       without regard for the location in the toast table.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-t <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checks on all tables matching the specified
+       <replaceable>pattern</replaceable> unless they are otherwise excluded.
+      </para>
+      <para>
+       This is an alias for the <option>-r</option> <option>--relation</option>
+       option, except that it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-T <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on tables matching the specified
+       <replaceable>pattern</replaceable>.
+      </para>
+      <para>
+       This is an alias for the <option>-R</option>
+       <option>--exclude-relation</option> option, except that it applies only
+       to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-U</option></term>
+     <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
+     <listitem>
+      <para>
+       User name to connect as.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-v</option></term>
+     <term><option>--verbose</option></term>
+     <listitem>
+      <para>
+       Increases the log level verbosity.  This option may be given more than
+       once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-V</option></term>
+     <term><option>--version</option></term>
+     <listitem>
+      <para>
+       Print the <application>pg_amcheck</application> version and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-w</option></term>
+     <term><option>--no-password</option></term>
+     <listitem>
+      <para>
+       Never issue a password prompt.  If the server requires password
+       authentication and a password is not available by other means such as
+       a <filename>.pgpass</filename> file, the connection attempt will fail.
+       This option can be useful in batch jobs and scripts where no user is
+       present to enter a password.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-W</option></term>
+     <term><option>--password</option></term>
+     <listitem>
+      <para>
+       Force <application>pg_amcheck</application> to prompt for a password
+       before connecting to a database.
+      </para>
+      <para>
+       This option is never essential, since
+       <application>pg_amcheck</application> will automatically prompt for a
+       password if the server demands password authentication.  However,
+       <application>pg_amcheck</application> will waste a connection attempt
+       finding out that the server wants a password.  In some cases it is
+       worth typing <option>-W</option> to avoid the extra connection attempt.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+   <application>pg_amcheck</application> is designed to work with
+   <productname>PostgreSQL</productname> 14.0 and later.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Author</title>
+
+  <para>
+   Mark Dilger <email>mark.dilger@enterprisedb.com</email>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="amcheck"/></member>
+  </simplelist>
+ </refsect1>
+</refentry>
diff --git a/src/tools/msvc/Install.pm b/src/tools/msvc/Install.pm
index ea3af48777..49ad558b74 100644
--- a/src/tools/msvc/Install.pm
+++ b/src/tools/msvc/Install.pm
@@ -18,7 +18,7 @@ our (@ISA, @EXPORT_OK);
 @EXPORT_OK = qw(Install);
 
 my $insttype;
-my @client_contribs = ('oid2name', 'pgbench', 'vacuumlo');
+my @client_contribs = ('oid2name', 'pg_amcheck', 'pgbench', 'vacuumlo');
 my @client_program_files = (
 	'clusterdb',      'createdb',   'createuser',    'dropdb',
 	'dropuser',       'ecpg',       'libecpg',       'libecpg_compat',
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 49614106dc..f680544e07 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -33,9 +33,9 @@ my @unlink_on_exit;
 
 # Set of variables for modules in contrib/ and src/test/modules/
 my $contrib_defines = { 'refint' => 'REFINT_VERBOSE' };
-my @contrib_uselibpq = ('dblink', 'oid2name', 'postgres_fdw', 'vacuumlo');
-my @contrib_uselibpgport   = ('oid2name', 'vacuumlo');
-my @contrib_uselibpgcommon = ('oid2name', 'vacuumlo');
+my @contrib_uselibpq = ('dblink', 'oid2name', 'pg_amcheck', 'postgres_fdw', 'vacuumlo');
+my @contrib_uselibpgport   = ('oid2name', 'pg_amcheck', 'vacuumlo');
+my @contrib_uselibpgcommon = ('oid2name', 'pg_amcheck', 'vacuumlo');
 my $contrib_extralibs      = undef;
 my $contrib_extraincludes = { 'dblink' => ['src/backend'] };
 my $contrib_extrasource = {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1dec43f9d..a0dfe164cd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -101,6 +101,7 @@ AlterUserMappingStmt
 AlteredTableInfo
 AlternativeSubPlan
 AlternativeSubPlanState
+AmcheckOptions
 AnalyzeAttrComputeStatsFunc
 AnalyzeAttrFetchFunc
 AnalyzeForeignTable_function
@@ -499,6 +500,7 @@ DSA
 DWORD
 DataDumperPtr
 DataPageDeleteStack
+DatabaseInfo
 DateADT
 Datum
 DatumTupleFields
@@ -2084,6 +2086,7 @@ RelToCluster
 RelabelType
 Relation
 RelationData
+RelationInfo
 RelationPtr
 RelationSyncEntry
 RelcacheCallbackFunction
-- 
2.21.1 (Apple Git-122.3)

v43-0003-Extending-PostgresNode-to-test-corruption.patchapplication/octet-stream; name=v43-0003-Extending-PostgresNode-to-test-corruption.patch; x-unix-mode=0644Download
From 321aa07b22a97fab54521ccf0bf7bc2cc9cc510e Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Feb 2021 12:37:58 -0800
Subject: [PATCH v43 3/3] Extending PostgresNode to test corruption.

PostgresNode now has functions for overwriting relation files
with full or partial prior versions of those files, creating
corruption beyond merely twiddling the bits of a heap relation
file.

Adding a regression test for pg_amcheck based on this new
functionality.
---
 contrib/pg_amcheck/t/006_relfile_damage.pl    | 145 ++++++++++
 src/test/modules/Makefile                     |   1 +
 src/test/modules/corruption/Makefile          |  16 ++
 .../modules/corruption/t/001_corruption.pl    |  83 ++++++
 src/test/perl/PostgresNode.pm                 | 261 ++++++++++++++++++
 5 files changed, 506 insertions(+)
 create mode 100644 contrib/pg_amcheck/t/006_relfile_damage.pl
 create mode 100644 src/test/modules/corruption/Makefile
 create mode 100644 src/test/modules/corruption/t/001_corruption.pl

diff --git a/contrib/pg_amcheck/t/006_relfile_damage.pl b/contrib/pg_amcheck/t/006_relfile_damage.pl
new file mode 100644
index 0000000000..45ad223531
--- /dev/null
+++ b/contrib/pg_amcheck/t/006_relfile_damage.pl
@@ -0,0 +1,145 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 22;
+use PostgresNode;
+
+my ($node, $port);
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT ct.relname
+			FROM pg_catalog.pg_class cr, pg_catalog.pg_class ct
+			WHERE cr.oid = '$relname'::regclass
+			  AND cr.reltoastrelid = ct.oid
+			));
+	return undef unless defined $rel;
+	return "pg_toast.$rel";
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+# Create a table with a btree index.  Use a fillfactor for the table and index
+# that will allow some fraction of updates to be on the original pages and some
+# on new pages.
+#
+$node->safe_psql('postgres', qq(
+create schema t;
+create table t.t1 (id integer, t text) with (fillfactor=75);
+alter table t.t1 alter column t set storage external;
+insert into t.t1 select gs, repeat('x',gs) from generate_series(9990,10000) gs;
+create index t1_idx on t.t1 (id) with (fillfactor=75);
+));
+
+my $toastrel = relation_toast('postgres', 't.t1');
+
+# Flush relation files to disk and take snapshots of the toast and index
+#
+$node->restart;
+$node->take_relfile_snapshot_minimal('postgres', 'idx', 't.t1_idx');
+$node->take_relfile_snapshot_minimal('postgres', 'toast', $toastrel);
+
+# Insert new data into the table and index
+#
+$node->safe_psql('postgres', qq(
+insert into t.t1 select gs, repeat('y',gs) from generate_series(10001,10100) gs;
+));
+
+# Revert index.  The reverted snapshot file is not corrupt, but it also
+# does not match the current contents of the table.
+#
+$node->stop;
+$node->revert_to_snapshot('idx');
+
+# Restart the node and check table and index with varying options.
+#
+$node->start;
+
+# Checks which do not reconcile the index and table via --heapallindexed will
+# not notice any problems
+#
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--parent-check' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --parent-check');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--rootdescend' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --rootdescend');
+
+# Checks which do reconcile the index and table via --heapallindexed will
+# notice the mismatch in their contents
+#
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed --rootdescend');
+
+# Revert the toast.  The reverted toast table is not corrupt, but it does not
+# have entries for all toast pointers in the main table
+#
+$node->stop;
+$node->revert_to_snapshot('toast');
+
+# Restart the node and check table and toast with varying options.  When
+# checking the toast pointers, we may get errors produced by verify_heapam, but
+# we may also get errors from failure to read toast blocks that are beyond the
+# end of the toast table, of the form /ERROR:  could not read block/.  To avoid
+# having a brittle test, we accept any error message.
+#
+$node->start;
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', $toastrel ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck reverted toast table');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--exclude-toast-pointers' ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck with reverted toast using --exclude-toast-pointers');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ ],
+	'pg_amcheck with reverted toast and default checking');
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 5391f461a2..c92d1702b4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  corruption \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/corruption/Makefile b/src/test/modules/corruption/Makefile
new file mode 100644
index 0000000000..ba461c645d
--- /dev/null
+++ b/src/test/modules/corruption/Makefile
@@ -0,0 +1,16 @@
+# src/test/modules/corruption/Makefile
+
+# EXTRA_INSTALL = contrib/pg_amcheck
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/corruption
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/corruption/t/001_corruption.pl b/src/test/modules/corruption/t/001_corruption.pl
new file mode 100644
index 0000000000..ae4a262e06
--- /dev/null
+++ b/src/test/modules/corruption/t/001_corruption.pl
@@ -0,0 +1,83 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 10;
+use PostgresNode;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create something non-trivial for the first snapshot
+$node->safe_psql('postgres', qq(
+create table t1 (id integer, short_text text, long_text text);
+insert into t1 (id, short_text, long_text)
+	(select gs, 'foo', repeat('x', gs)
+		from generate_series(1,10000) gs);
+create unique index idx1 on t1 (id, short_text);
+vacuum freeze;
+));
+
+# Flush relation files to disk and take snapshot of them
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap1', 'public.t1');
+
+# Update data in the table, toast table, and index
+$node->safe_psql('postgres', qq(
+update t1 set
+	short_text = 'bar',
+	long_text = repeat('y', id);
+));
+
+# Flush relation files to disk and take second snapshot
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap2', 'public.t1');
+
+# Revert the first page of t1 using a torn snapshot.  This should be a partial
+# and corrupt reverting of the update.
+$node->stop;
+$node->revert_to_torn_relfile_snapshot('snap1', 8192);
+
+# Restart the node and count the number of rows in t1 with the original
+# (pre-update) values.  It should not be zero, but nor will it be the full
+# 10000.
+$node->start;
+my ($old, $new, $oldtoast, $newtoast) = counts();
+ok($old > 0 && $old < 10000, "Torn snapshot reverts some of the main updates");
+ok($new > 0 && $new <= 10000, "Torn snapshot retains some of the main updates");
+
+# Revert t1 fully to the first snapshot.  This should fully restore the
+# original (pre-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap1');
+
+# Restart the node and verify only old values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 10000, "Full snapshot restores all the old main values");
+is($oldtoast, 10000, "Full snapshot restores all the old toast values");
+is($new, 0, "Full snapshot reverts all the new main values");
+is($newtoast, 0, "Full snapshot reverts all the new toast values");
+
+# Restore t1 fully to the second snapshot.  This should fully restore the
+# new (post-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap2');
+
+# Restart the node and verify only new values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 0, "Full snapshot reverts all the old main values");
+is($oldtoast, 0, "Full snapshot reverts all the old toast values");
+is($new, 10000, "Full snapshot restores all the new main values");
+is($newtoast, 10000, "Full snapshot restores all the new toast values");
+
+sub counts {
+	return map {
+		$node->safe_psql('postgres', qq(select count(*) from t1 where $_))
+	} ("short_text = 'foo'",
+	   "short_text = 'bar'",
+	   "long_text ~ 'x'",
+	   "long_text ~ 'y'");
+}
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..d470af93c5 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2225,6 +2225,267 @@ sub pg_recvlogical_upto
 
 =back
 
+=head1 DATABASE CORRUPTION METHODS
+
+=over
+
+=item $node->relfile_snapshot_repository()
+
+The path to the parent directory of all directories storing snapshots of
+relation backing files.
+
+=cut
+
+sub relfile_snapshot_repository
+{
+	my ($self) = @_;
+	my $snaprepo = join('/', $self->basedir, 'snapshot');
+	unless (-d $snaprepo)
+	{
+		mkdir $snaprepo
+			or $!{EEXIST}
+			or BAIL_OUT("could not create snapshot repository directory \"$snaprepo\": $!");
+	}
+	return $snaprepo;
+}
+
+=pod
+
+=item $node->relfile_snapshot_directory(snapname)
+
+The path to the directory for storing the named snapshot.
+
+=cut
+
+sub relfile_snapshot_directory
+{
+	my ($self, $snapname) = @_;
+
+	join("/", $self->relfile_snapshot_repository(), $snapname);
+}
+
+=pod
+
+=item $node->take_relfile_snapshot($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relname>, the associated
+toast relations (if any), and all associated indexes (if any).  No attempt is
+made to flush these files to disk, meaning the snapshot taken could be stale
+unless the caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relations.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+=pod
+
+=item $node->take_relfile_snapshot_minimal($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relnames>.  No attempt is made
+to flush these files to disk, meaning the snapshot taken could be stale unless the
+caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relation.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+sub take_relfile_snapshot
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 1, @relnames);
+}
+
+sub take_relfile_snapshot_minimal
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 0, @relnames);
+}
+
+sub take_relfile_snapshot_helper
+{
+	my ($self, $dbname, $snapname, $extended, @relnames) = @_;
+
+	croak "dbname must be specified" unless defined $dbname;
+	croak "relnames must be defined" unless scalar(grep { defined $_ } @relnames);
+	croak "snapname must be specified" unless defined $snapname;
+	croak "snapname must be unique" if exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snapdir = $self->relfile_snapshot_directory($snapname);
+	croak "snapname directory name already in use: $snapdir" if (-e $snapdir);
+	mkdir $snapdir
+		or BAIL_OUT("could not create snapshot directory \"$snapdir\": $!");
+
+	my @relpaths = map {
+		$self->safe_psql($dbname,
+			qq(SELECT pg_relation_filepath('$_')));
+	} @relnames;
+
+	my (@toastpaths, @idxpaths);
+	if ($extended)
+	{
+		for my $relname (@relnames)
+		{
+			push (@toastpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(c.reltoastrelid)
+					FROM pg_catalog.pg_class c
+					WHERE c.oid = '$relname'::regclass
+					AND c.reltoastrelid != 0::oid))));
+			push (@idxpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(i.indexrelid)
+					FROM pg_catalog.pg_index i
+					WHERE i.indrelid = '$relname'::regclass))));
+		}
+	}
+
+	$self->{snapshot}->{$snapname} = {};
+	for my $path (@relpaths, grep { defined($_) } @toastpaths, @idxpaths)
+	{
+		croak "file backing relation is missing: $pgdata/$path" unless -f "$pgdata/$path";
+		copy_file($snapdir, $pgdata, 0, $path);
+		$self->{snapshot}->{$snapname}->{$path} = 1;
+	}
+}
+
+=pod
+
+=item $node->revert_to_snapshot($self, $snapname)
+
+Overwrites the database's relation files with files previously saved in
+B<$snapname>.
+
+Dies if the given B<$snapname> does not exist.
+
+=cut
+
+=pod
+
+=item $node->revert_to_torn_relfile_snapshot($self, $snapname, $bytes)
+
+Partially overwrites the database's relation files using prefixes of the given
+number of bytes from the files saved in B<$snapname>.  If B<$bytes> is
+negative, uses suffixes of the given byte length rather than prefixes.
+
+If B<$bytes> is null, replaces the database's relation files using the saved
+files in the B<$snapname>, which unlike for non-undef values, means the file
+may become shorter if the saved file is shorter than the current file.
+
+=cut
+
+sub revert_to_snapshot
+{
+	my ($self, $snapname) = @_;
+	$self->revert_to_torn_relfile_snapshot($snapname, undef);
+}
+
+sub revert_to_torn_relfile_snapshot
+{
+	my ($self, $snapname, $bytes) = @_;
+
+	croak "no such snapshot" unless exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snaprepo = join('/', $self->relfile_snapshot_repository, $snapname);
+	croak "snapname directory missing: $snaprepo" unless (-d $snaprepo);
+
+	if (defined $bytes)
+	{
+		tear_file($pgdata, $snaprepo, $bytes, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+	else
+	{
+		copy_file($pgdata, $snaprepo, 1, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+}
+
+sub copy_file
+{
+	my ($dstdir, $srcdir, $overwrite, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	foreach my $part (split(m{/}, $path))
+	{
+		my $srcpart = "$srcdir/$part";
+		my $dstpart = "$dstdir/$part";
+
+		if (-d $srcpart)
+		{
+			$srcdir = $srcpart;
+			$dstdir = $dstpart;
+			die "$dstdir is in the way" if (-e $dstdir && ! -d $dstdir);
+			unless (-d $dstdir)
+			{
+				mkdir $dstdir
+					or BAIL_OUT("could not create directory \"$dstdir\": $!");
+			}
+		}
+		elsif (-f $srcpart)
+		{
+			die "$dstdir/$part is in the way" if (!$overwrite && -e "$dstdir/$part");
+
+			File::Copy::copy($srcpart, "$dstdir/$part");
+		}
+	}
+}
+
+sub tear_file
+{
+	my ($dstdir, $srcdir, $bytes, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	my $srcfile = "$srcdir/$path";
+	my $dstfile = "$dstdir/$path";
+
+	croak "No such file: $srcfile" unless -f $srcfile;
+	croak "No such file: $dstfile" unless -f $dstfile;
+
+	my ($srcfh, $dstfh);
+	open($srcfh, '<', $srcfile) or die "Cannot read $srcfile: $!";
+	open($dstfh, '+<', $dstfile) or die "Cannot modify $dstfile: $!";
+	binmode($srcfh);
+	binmode($dstfh);
+
+	my $buffer;
+	if ($bytes < 0)
+	{
+		$bytes *= -1;		# Easier to use positive value
+		my $srcsize = (stat($srcfh))[7];
+		my $offset = $srcsize - $bytes;
+		seek($srcfh, $offset, 0);
+		seek($dstfh, $offset, 0);
+		sysread($srcfh, $buffer, $bytes);
+		syswrite($dstfh, $buffer, $bytes);
+	}
+	else
+	{
+		seek($srcfh, 0, 0);
+		seek($dstfh, 0, 0);
+		sysread($srcfh, $buffer, $bytes);
+		syswrite($dstfh, $buffer, $bytes);
+	}
+
+	close($srcfh);
+	close($dstfh);
+}
+
+=pod
+
+=back
+
 =cut
 
 1;
-- 
2.21.1 (Apple Git-122.3)

#4Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#3)
Re: pg_amcheck contrib application

Most of these changes sound good. I'll go through the whole patch
again today, or as much of it as I can. But before I do that, I want
to comment on this point specifically.

On Thu, Mar 4, 2021 at 1:25 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I think this is fixed up now. There is an interaction with amcheck's verify_heapam(), where that function raises an error if the startblock or endblock arguments are out of bounds for the relation in question. Rather than aborting the entire pg_amcheck run, it avoids passing inappropriate block ranges to verify_heapam() and outputs a warning, so:

% pg_amcheck mark.dilger -t foo -t pg_class --progress -v --startblock=35 --endblock=77
pg_amcheck: in database "mark.dilger": using amcheck version "1.3" in schema "public"
0/6 relations (0%) 0/55 pages (0%)
pg_amcheck: checking table "mark.dilger"."public"."foo" (oid 16385) (10/45 pages)
pg_amcheck: warning: ignoring endblock option 77 beyond end of table "mark.dilger"."public"."foo"
pg_amcheck: checking btree index "mark.dilger"."public"."foo_idx" (oid 16388) (30/30 pages)
pg_amcheck: checking table "mark.dilger"."pg_catalog"."pg_class" (oid 1259) (0/13 pages)
pg_amcheck: warning: ignoring startblock option 35 beyond end of table "mark.dilger"."pg_catalog"."pg_class"
pg_amcheck: warning: ignoring endblock option 77 beyond end of table "mark.dilger"."pg_catalog"."pg_class"
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_relname_nsp_index" (oid 2663) (6/6 pages)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_tblspc_relfilenode_index" (oid 3455) (5/5 pages)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_oid_index" (oid 2662) (4/4 pages)
6/6 relations (100%) 55/55 pages (100%)

The way the (x/y pages) is printed takes into account that the [startblock..endblock] range may reduce the number of pages to check (x) to something less than the number of pages in the relation (y), but the reporting is a bit of a lie when the startblock is beyond the end of the table, as it doesn't get passed to verify_heapam and so the number of blocks checked may be more than the zero blocks reported. I think I might need to fix this up tomorrow, but I want to get what I have in this patch set posted tonight, so it's not fixed here. Also, there are multiple ways of addressing this, and I'm having trouble deciding which way is best. I can exclude the relation from being checked at all, or realize earlier that I'm not going to honor the startblock argument and compute the blocks to check correctly. Thoughts?

I think this whole approach is pretty suspect because the number of
blocks in the relation can increase (by relation extension) or
decrease (by VACUUM or TRUNCATE) between the time when we query for
the list of target relations and the time we get around to executing
any queries against them. I think it's OK to use the number of
relation pages for progress reporting because progress reporting is
only approximate anyway, but I wouldn't print them out in the progress
messages, and I wouldn't try to fix up the startblock and endblock
arguments on the basis of how long you think that relation is going to
be. You seem to view the fact that the server reported the error as
the reason for the problem, but I don't agree. I think having the
server report the error here is right, and the problem is that the
error reporting sucked because it was long-winded and didn't
necessarily tell you which table had the problem.

There are a LOT of things that can go wrong when we go try to run
verify_heapam on a table. The table might have been dropped; in fact,
on a busy production system, such cases are likely to occur routinely
if DDL is common, which for many users it is. The system catalog
entries might be screwed up, so that the relation can't be opened.
There might be an unreadable page in the relation, either because the
OS reports an I/O error or something like that, or because checksum
verification fails. There are various other possibilities. We
shouldn't view such errors as low-level things that occur only in
fringe cases; this is a corruption-checking tool, and we should expect
that running it against messed-up databases will be common. We
shouldn't try to interpret the errors we get or make any big decisions
about them, but we should have a clear way of reporting them so that
the user can decide what to do.

Just as an experiment, I suggest creating a database with 100 tables
in it, each with 1 index, and then deleting a single pg_attribute
entry for 10 of the tables, and then running pg_amcheck. I think you
will get 20 errors - one for each messed-up table and one for the
corresponding index. Maybe you'll get errors for the TOAST tables
checks too, if the tables have TOAST tables, although that seems like
it should be avoidable. Now, now matter what you do, the tool is going
to produce a lot of output here, because you have a lot of problems,
and that's OK. But how understandable is that output, and how concise
is it? If it says something like:

pg_amcheck: could not check "SCHEMA_NAME"."TABLE_NAME": ERROR: some
attributes are missing or something

...and that line is repeated 20 times, maybe with a context or detail
line for each one or something like that, then you have got a good UI.
If it's not clear which tables have the problem, you have got a bad
UI. If it dumps out 300 lines of output instead of 20 or 40, you have
a UI that is so verbose that usability is going to be somewhat
impaired, which is why I suggested only showing the query in verbose
mode.

BTW, another thing that might be interesting is to call
PQsetErrorVerbosity(conn, PQERRORS_VERBOSE) in verbose mode. It's
probably possible to contrive a case where the server error message is
something generic like "cache lookup failed for relation %u" which
occurs in a whole bunch of places in the source code, and being able
get the file and line number information can be really useful when
trying to track such things down.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#4)
Re: pg_amcheck contrib application

On Thu, Mar 4, 2021 at 10:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

Most of these changes sound good. I'll go through the whole patch
again today, or as much of it as I can. But before I do that, I want
to comment on this point specifically.

Just a thought - I don't feel strongly about this - but you want want
to consider storing your list of patterns in an array that gets
resized as necessary rather than a list. Then the pattern ID would
just be pattern_ptr - pattern_array, and finding the pattern by ID
would just be pattern_ptr = &pattern_array[pattern_id]. I don't think
there's a real efficiency issue here because the list of patterns is
almost always going to be short, and even if somebody decides to
provide a very long list of patterns (e.g. by using xargs) it's
probably still not that big a deal. A sufficiently obstinate user
running an operating system where argument lists can be extremely long
could probably make this the dominant cost by providing a gigantic
number of patterns that don't match anything, but such a person is
trying to prove a point, rather than accomplish anything useful, so I
don't care. But, the code might be more elegant the other way.

This patch increases the number of cases where we use ^ to assert that
exactly one of two things is true from 4 to 5. I think it might be
better to just write out (a && !b) || (b && !a), but there is some
precedent for the way you did it so perhaps it's fine.

The name prepare_table_commmand() is oddly non-parallel with
verify_heapam_slot_handler(). Seems better to call it either a table
throughout, or a heapam throughout. Actually I think I would prefer
"heap" to either of those, but I definitely think we shouldn't switch
terminology. Note that prepare_btree_command() doesn't have this
issue, since it matches verify_btree_slot_handler(). On a related
note, "c.relam = 2" is really a test for is_heap, not is_table. We
might have other table AMs in the future, but only one of those AMs
will be called heap, and only one will have OID 2.

You've got some weird round-tripping stuff where you sent literal
values to the server so that you can turn around and get them back
from the server. For example, you've got prepare_table_command()
select rel->nspname and rel->relname back from the server as literals,
which seems silly because we have to already have that information or
we couldn't ask the server to give it to us ... and if we already have
it, then why do we need to get it again? The reason it's like this
seems to be that after calling prepare_table_command(), we use
ParallelSlotSetHandler() to set verify_heapam_slot_handler() as the
callback, and we set sql.data as the callback, so we don't have access
to the RelationInfo object when we're handling the slot result. But
that's easy to fix: just store the sql as a field inside the
RelationInfo, and then pass a pointer to the whole RelationInfo to the
slot handler. Then you don't need to round-trip the table and schema
names; and you have the values available even if an error happens.

On a somewhat related note, I think it might make sense to have the
slot handlers try to free memory. It seems hard to make pg_amcheck
leak enough memory to matter, but I guess it's not entirely
implausible that someone could be checking let's say 10 million
relations. Freeing the query strings could probably prevent a half a
GB or so of accumulated memory usage under those circumstances. I
suppose freeing nspname and relname would save a bit more, but it's
hardly worth doing since they are a lot shorter and you've got to have
all that information in memory at once at some point anyway; similarly
with the RelationInfo structures, which have the further complexity of
being part of a linked list you might not want to corrupt. But you
don't need to have every query string in memory at the same time, just
as many as are running at one in time.

Also, maybe compile_relation_list_one_db() should keep the result set
around so that you don't need to pstrdup() the nspname and relname in
the first place. Right now, just before compile_relation_list_one_db()
calls PQclear() you have two copies of every nspname and relname
allocated. If you just kept the result sets around forever, the peak
memory usage would be lower than it is currently. If you really wanted
to get fancy you could arrange to free each result set when you've
finished that database, but that seems annoying to code and I'm pretty
sure it doesn't matter.

The CTEs called "include_raw" and "exclude_raw" which are used as part
of the query to construct a list of tables. The regexes are fished
through there, and the pattern IDs, which makes sense, but the raw
patterns are also fished through, and I don't see a reason for that.
We don't seem to need that for anything. The same seems to apply to
the query used to resolve database patterns.

I see that most of the queries have now been adjusted to be spread
across fewer lines, which is good, but please make sure to do that
everywhere. In particular, I notice that the bt_index_check calls are
still too spread out.

More in a bit, need to grab some lunch.

--
Robert Haas
EDB: http://www.enterprisedb.com

#6Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#5)
Re: pg_amcheck contrib application

On Thu, Mar 4, 2021 at 12:27 PM Robert Haas <robertmhaas@gmail.com> wrote:

More in a bit, need to grab some lunch.

Moving on to the tests, in 003_check.pl, I think it would be slightly
better if relation_toast were to select ct.oid::regclass and then just
have the caller use that value directly. We'd certainly want to do
that if the name could contain any characters that might require
quoting. Here that's not possible, but I think we might as well use
the same technique anyway.

I'm not sure how far to go with it, but I think that you might want to
try to enhance the logging in some of the cases where the TAP tests
might fail. In particular, if either of these trip in the buildfarm,
it doesn't seem like it will be too easy to figure out why they
failed:

+    fail('Xid thresholds not as expected');
+        fail('Page layout differs from our expectations');

You might want to rephrase the message to incorporate the values that
triggered the failure, e.g. "datfrozenxid $datfrozenxid is not between
3 and $relfrozenxid", "expected (a,b) = (12345678,abcdefg) but got
($x,$y)", so that if the buildfarm happens to fail there's a shred of
hope that we might be able to guess the reason from the message. You
could also give some thought to whether there are any tests that can
be improved in similar ways. Test::More is nice in that when you run a
test with eq() or like() and it fails it will tell you about the input
values in the diagnostic, but if you do something like is($x < 4, ...)
instead of cmp_ok($x, '<', 4, ...) then you lose that. I'm not saying
you're doing that exact thing, just saying that looking through the
test code with an eye to finding things where you could output a
little more info about a potential failure might be a worthwhile
activity.

If it were me, I would get rid of ROWCOUNT and have a list of
closures, and then loop over the list and call each one e.g. my
@corruption = ( sub { ... }, sub { ... }, sub { ... }) or maybe
something like what I did with @scenario in
src/bin/pg_verifybackup/t/003_corruption.pl, but this is ultimately a
style preference and I think the way you actually did it is also
reasonable, and some people might find it more readable than the other
way.

The name int4_fickle_ops is positively delightful and I love having a
test case like this.

On the whole, I think these tests look quite solid. I am a little
concerned, as you may gather from the comment above, that they will
not survive contact with the buildfarm, because they will turn out to
be platform or OS-dependent in some way. However, I can see that
you've taken steps to avoid such dependencies, and maybe we'll be
lucky and those will work. Also, while I am suspicious something's
going to break, I don't know what it's going to be, so I can't suggest
any method to avoid it. I think we'll just have to keep an eye on the
buildfarm post-commit and see what crops up.

Turning to the documentation, I see that it is documented that a bare
command-line argument can be a connection string rather than a
database name. That sounds like a good plan, but when I try
'pg_amcheck sslmode=require' it does not work: FATAL: database
"sslmode=require" does not exist. The argument to -e is also
documented to be a connection string, but that also seems not to work.
Some thought might need to be given to what exactly these connection
opens are supposed to mean. Like, do the connection options I set via
-e apply to all the connections I make, or just the one to the
maintenance database? How do I set connection options for connections
to databases whose names aren't specified explicitly but are
discovered by querying pg_database? Maybe instead of allowing these to
be a connection string, we should have a separate option that can be
used just for the purpose of setting connection options that then
apply to all connections. That seems a little bit oddly unlike other
tools, but if I want sslmode=verify-ca or something on all my
connections, there should be an easy way to get it.

The documentation makes many references to patterns, but does not
explain what a pattern is. I see that psql's documentation contains an
explanation, and pg_dump's documentation links to psql's
documentation. pg_amcheck should probably link to psql's
documentation, too.

In the documentation for -d, you say that "If -a --all is also
specified, -d --database does not additionally affect which databases
are checked." I suggest replacing "does not additionally affect which
databases are checked" with "has no effect."

In two places you say "without regard for" but I think it should be
"without regard to".

In the documentation for --no-strict-names you use "nor" where I think
it should say "or".

I kind of wonder whether we need --quiet. It seems like right now it
only does two things. One is to control complaints about ignoring the
startblock and endblock options, but I don't agree with that behavior
anyway. The other is control whether we complain about unmatched
patterns, but I think that could just be controlled --no-strict-names
i.e. normally an unmatched pattern results in a complaint and a
failure, but with --no-strict-names there is neither a complaint nor a
failure. Having a flag to control whether we get the message
separately from whether we get the failure doesn't seem helpful.

I don't think it's good to say "This is an alias for" in the
documentation of -i -I -t -T. I suggest instead saying "This is
similar to".

Instead of "Option BLAH takes precedence over..." I suggest "The BLAH
option takes precedence over..."

OK, that's it from me for this review pass.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#4)
Re: pg_amcheck contrib application

On Thu, Mar 4, 2021 at 7:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think this whole approach is pretty suspect because the number of
blocks in the relation can increase (by relation extension) or
decrease (by VACUUM or TRUNCATE) between the time when we query for
the list of target relations and the time we get around to executing
any queries against them. I think it's OK to use the number of
relation pages for progress reporting because progress reporting is
only approximate anyway, but I wouldn't print them out in the progress
messages, and I wouldn't try to fix up the startblock and endblock
arguments on the basis of how long you think that relation is going to
be.

I don't think that the struct AmcheckOptions block fields (e.g.,
startblock) should be of type 'long' -- that doesn't work well on
Windows, where 'long' is only 32-bit. To be fair we already do the
same thing elsewhere, but there is no reason to repeat those mistakes.
(I'm rather suspicious of 'long' in general.)

I think that you could use BlockNumber + strtoul() without breaking Windows.

There are a LOT of things that can go wrong when we go try to run
verify_heapam on a table. The table might have been dropped; in fact,
on a busy production system, such cases are likely to occur routinely
if DDL is common, which for many users it is. The system catalog
entries might be screwed up, so that the relation can't be opened.
There might be an unreadable page in the relation, either because the
OS reports an I/O error or something like that, or because checksum
verification fails. There are various other possibilities. We
shouldn't view such errors as low-level things that occur only in
fringe cases; this is a corruption-checking tool, and we should expect
that running it against messed-up databases will be common. We
shouldn't try to interpret the errors we get or make any big decisions
about them, but we should have a clear way of reporting them so that
the user can decide what to do.

I agree.

Your database is not supposed to be corrupt. Once your database has
become corrupt, all bets are off -- something happened that was
supposed to be impossible -- which seems like a good reason to be
modest about what we think we know.

The user should always see the unvarnished truth. pg_amcheck should
not presume to suppress errors from lower level code, except perhaps
in well-scoped special cases.

--
Peter Geoghegan

#8Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Geoghegan (#7)
Re: pg_amcheck contrib application

On Mar 4, 2021, at 2:04 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Mar 4, 2021 at 7:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think this whole approach is pretty suspect because the number of
blocks in the relation can increase (by relation extension) or
decrease (by VACUUM or TRUNCATE) between the time when we query for
the list of target relations and the time we get around to executing
any queries against them. I think it's OK to use the number of
relation pages for progress reporting because progress reporting is
only approximate anyway, but I wouldn't print them out in the progress
messages, and I wouldn't try to fix up the startblock and endblock
arguments on the basis of how long you think that relation is going to
be.

I don't think that the struct AmcheckOptions block fields (e.g.,
startblock) should be of type 'long' -- that doesn't work well on
Windows, where 'long' is only 32-bit. To be fair we already do the
same thing elsewhere, but there is no reason to repeat those mistakes.
(I'm rather suspicious of 'long' in general.)

I think that you could use BlockNumber + strtoul() without breaking Windows.

Fair enough.

There are a LOT of things that can go wrong when we go try to run
verify_heapam on a table. The table might have been dropped; in fact,
on a busy production system, such cases are likely to occur routinely
if DDL is common, which for many users it is. The system catalog
entries might be screwed up, so that the relation can't be opened.
There might be an unreadable page in the relation, either because the
OS reports an I/O error or something like that, or because checksum
verification fails. There are various other possibilities. We
shouldn't view such errors as low-level things that occur only in
fringe cases; this is a corruption-checking tool, and we should expect
that running it against messed-up databases will be common. We
shouldn't try to interpret the errors we get or make any big decisions
about them, but we should have a clear way of reporting them so that
the user can decide what to do.

I agree.

Your database is not supposed to be corrupt. Once your database has
become corrupt, all bets are off -- something happened that was
supposed to be impossible -- which seems like a good reason to be
modest about what we think we know.

The user should always see the unvarnished truth. pg_amcheck should
not presume to suppress errors from lower level code, except perhaps
in well-scoped special cases.

I think Robert mistook why I was doing that. I was thinking about a different usage pattern. If somebody thinks a subset of relations have been badly corrupted, but doesn't know which relations those might be, they might try to find them with pg_amcheck, but wanting to just check the first few blocks per relation in order to sample the relations. So,

pg_amcheck --startblock=0 --endblock=9 --no-dependent-indexes

or something like that. I don't think it's very fun to have it error out for each relation that doesn't have at least ten blocks, nor is it fun to have those relations skipped by error'ing out before checking any blocks, as they might be the corrupt relations you are looking for. But using --startblock and --endblock for this is not a natural fit, as evidenced by how I was trying to "fix things up" for the user, so I'll punt on this usage until some future version, when I might add a sampling option.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#8)
Re: pg_amcheck contrib application

On Thu, Mar 4, 2021 at 5:39 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I think Robert mistook why I was doing that. I was thinking about a different usage pattern. If somebody thinks a subset of relations have been badly corrupted, but doesn't know which relations those might be, they might try to find them with pg_amcheck, but wanting to just check the first few blocks per relation in order to sample the relations. So,

pg_amcheck --startblock=0 --endblock=9 --no-dependent-indexes

or something like that. I don't think it's very fun to have it error out for each relation that doesn't have at least ten blocks, nor is it fun to have those relations skipped by error'ing out before checking any blocks, as they might be the corrupt relations you are looking for. But using --startblock and --endblock for this is not a natural fit, as evidenced by how I was trying to "fix things up" for the user, so I'll punt on this usage until some future version, when I might add a sampling option.

I admit I hadn't thought of that use case. I guess somebody could want
to do that, but it doesn't seem all that useful. Checking the first
up-to-ten blocks of every relation is not a very representative
sample, and it's not clear to me that sampling is a good idea even if
it were representative. What good is it to know that 10% of my
database is probably not corrupted?

On the other hand, people want to do all kinds of things that seem
strange to me, and this might be another one. But, if that's so, then
I think the right place to implement it is in amcheck itself, not
pg_amcheck. I think pg_amcheck should be, now and in the future, a
thin wrapper around the functionality provided by amcheck, just
providing target selection and parallel execution. If you put
something into pg_amcheck that figures out how long the relation is
and runs it on some of the blocks, that functionality is only
accessible to people who are accessing amcheck via pg_amcheck. If you
put it in amcheck itself and just expose it through pg_amcheck, then
it's accessible either way. It's probably cleaner and more performant
to do it that way, too.

So if you did add a sampling option in the future, that's the way I
would recommend doing it, but I think it is probably best not to go
there right now.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#9)
Re: pg_amcheck contrib application

On Mar 8, 2021, at 8:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 4, 2021 at 5:39 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I think Robert mistook why I was doing that. I was thinking about a different usage pattern. If somebody thinks a subset of relations have been badly corrupted, but doesn't know which relations those might be, they might try to find them with pg_amcheck, but wanting to just check the first few blocks per relation in order to sample the relations. So,

pg_amcheck --startblock=0 --endblock=9 --no-dependent-indexes

or something like that. I don't think it's very fun to have it error out for each relation that doesn't have at least ten blocks, nor is it fun to have those relations skipped by error'ing out before checking any blocks, as they might be the corrupt relations you are looking for. But using --startblock and --endblock for this is not a natural fit, as evidenced by how I was trying to "fix things up" for the user, so I'll punt on this usage until some future version, when I might add a sampling option.

I admit I hadn't thought of that use case. I guess somebody could want
to do that, but it doesn't seem all that useful. Checking the first
up-to-ten blocks of every relation is not a very representative
sample, and it's not clear to me that sampling is a good idea even if
it were representative. What good is it to know that 10% of my
database is probably not corrupted?

`cd $PGDATA; tar xfz my_csv_data.tgz` ctrl-C ctrl-C ctrl-C
`rm -rf $PGDATA` ctrl-C ctrl-C ctrl-C
`/my/stupid/backup/and/restore/script.sh` ctrl-C ctrl-C ctrl-C

# oh wow, i wonder if any relations got overwritten with csv file data, or had their relation files unlinked, or ...?

`pg_amcheck --jobs=8 --startblock=0 --endblock=10`

# ah, darn, it's spewing lots of irrelevant errors because some relations are too short

`pg_amcheck --jobs=8 --startblock=0 --endblock=0`

# ah, darn, it's still spewing lots of irrelevant errors because I have lots of indexes with zero blocks of data

`pg_amcheck --jobs=8`

# ah, darn, it's taking forever, because it's processing huge tables in their entirety

I agree this can be left to later, and the --startblock and --endblock options are the wrong way to do it.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#4)
3 attachment(s)
Re: pg_amcheck contrib application

Robert, Peter, in response to your review comments spanning multiple emails:

On Mar 4, 2021, at 7:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Most of these changes sound good. I'll go through the whole patch
again today, or as much of it as I can. But before I do that, I want
to comment on this point specifically.

On Thu, Mar 4, 2021 at 1:25 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I think this is fixed up now. There is an interaction with amcheck's verify_heapam(), where that function raises an error if the startblock or endblock arguments are out of bounds for the relation in question. Rather than aborting the entire pg_amcheck run, it avoids passing inappropriate block ranges to verify_heapam() and outputs a warning, so:

% pg_amcheck mark.dilger -t foo -t pg_class --progress -v --startblock=35 --endblock=77
pg_amcheck: in database "mark.dilger": using amcheck version "1.3" in schema "public"
0/6 relations (0%) 0/55 pages (0%)
pg_amcheck: checking table "mark.dilger"."public"."foo" (oid 16385) (10/45 pages)
pg_amcheck: warning: ignoring endblock option 77 beyond end of table "mark.dilger"."public"."foo"
pg_amcheck: checking btree index "mark.dilger"."public"."foo_idx" (oid 16388) (30/30 pages)
pg_amcheck: checking table "mark.dilger"."pg_catalog"."pg_class" (oid 1259) (0/13 pages)
pg_amcheck: warning: ignoring startblock option 35 beyond end of table "mark.dilger"."pg_catalog"."pg_class"
pg_amcheck: warning: ignoring endblock option 77 beyond end of table "mark.dilger"."pg_catalog"."pg_class"
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_relname_nsp_index" (oid 2663) (6/6 pages)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_tblspc_relfilenode_index" (oid 3455) (5/5 pages)
pg_amcheck: checking btree index "mark.dilger"."pg_catalog"."pg_class_oid_index" (oid 2662) (4/4 pages)
6/6 relations (100%) 55/55 pages (100%)

The way the (x/y pages) is printed takes into account that the [startblock..endblock] range may reduce the number of pages to check (x) to something less than the number of pages in the relation (y), but the reporting is a bit of a lie when the startblock is beyond the end of the table, as it doesn't get passed to verify_heapam and so the number of blocks checked may be more than the zero blocks reported. I think I might need to fix this up tomorrow, but I want to get what I have in this patch set posted tonight, so it's not fixed here. Also, there are multiple ways of addressing this, and I'm having trouble deciding which way is best. I can exclude the relation from being checked at all, or realize earlier that I'm not going to honor the startblock argument and compute the blocks to check correctly. Thoughts?

I think this whole approach is pretty suspect because the number of
blocks in the relation can increase (by relation extension) or
decrease (by VACUUM or TRUNCATE) between the time when we query for
the list of target relations and the time we get around to executing
any queries against them. I think it's OK to use the number of
relation pages for progress reporting because progress reporting is
only approximate anyway,

Fair point.

but I wouldn't print them out in the progress
messages,

Removed.

and I wouldn't try to fix up the startblock and endblock
arguments on the basis of how long you think that relation is going to
be.

Yeah, in light of a new day, that seems like a bad idea to me, too. Removed.

You seem to view the fact that the server reported the error as
the reason for the problem, but I don't agree. I think having the
server report the error here is right, and the problem is that the
error reporting sucked because it was long-winded and didn't
necessarily tell you which table had the problem.

No, I was thinking about a different usage pattern, but I've answer that already elsewhere on this thread.

There are a LOT of things that can go wrong when we go try to run
verify_heapam on a table. The table might have been dropped; in fact,
on a busy production system, such cases are likely to occur routinely
if DDL is common, which for many users it is. The system catalog
entries might be screwed up, so that the relation can't be opened.
There might be an unreadable page in the relation, either because the
OS reports an I/O error or something like that, or because checksum
verification fails. There are various other possibilities. We
shouldn't view such errors as low-level things that occur only in
fringe cases; this is a corruption-checking tool, and we should expect
that running it against messed-up databases will be common. We
shouldn't try to interpret the errors we get or make any big decisions
about them, but we should have a clear way of reporting them so that
the user can decide what to do.

Once again, I think you are right and have removed the objectionable behavior, but....

The --startblock and --endblock options make the most sense when the user is only checking one table, like

pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table

because the user likely has some knowledge about that table, perhaps from a prior run of pg_amcheck. The --startblock and --endblock arguments are a bit strange when used globally, as relations don't all have the same number of blocks, so

pg_amcheck --startblock=17 --endblock=19 mydb

will very likely emit lots of error messages for tables which don't have blocks in that range. That's not entirely pg_amcheck's fault, as it just did what the user asked, but it also doesn't seem super helpful. I'm not going to do anything about it in this release.

Just as an experiment, I suggest creating a database with 100 tables
in it, each with 1 index, and then deleting a single pg_attribute
entry for 10 of the tables, and then running pg_amcheck. I think you
will get 20 errors - one for each messed-up table and one for the
corresponding index. Maybe you'll get errors for the TOAST tables
checks too, if the tables have TOAST tables, although that seems like
it should be avoidable. Now, now matter what you do, the tool is going
to produce a lot of output here, because you have a lot of problems,
and that's OK. But how understandable is that output, and how concise
is it? If it says something like:

pg_amcheck: could not check "SCHEMA_NAME"."TABLE_NAME": ERROR: some
attributes are missing or something

...and that line is repeated 20 times, maybe with a context or detail
line for each one or something like that, then you have got a good UI.
If it's not clear which tables have the problem, you have got a bad
UI. If it dumps out 300 lines of output instead of 20 or 40, you have
a UI that is so verbose that usability is going to be somewhat
impaired, which is why I suggested only showing the query in verbose
mode.

After running 'make installcheck', if I delete all entries from pg_class where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck regression', I get lines that look like this:

heap relation "regression"."public"."quad_poly_tbl":
ERROR: could not open relation with OID 17177
heap relation "regression"."public"."gin_test_tbl":
ERROR: could not open relation with OID 24793
heap relation "regression"."pg_catalog"."pg_depend":
ERROR: could not open relation with OID 8888
heap relation "regression"."public"."spgist_text_tbl":
ERROR: could not open relation with OID 25624

which seems ok.

If instead I delete pg_attribute entries, as you suggest above, I get rows like this:

heap relation "regression"."regress_rls_schema"."rls_tbl":
ERROR: catalog is missing 1 attribute(s) for relid 26467
heap relation "regression"."regress_rls_schema"."rls_tbl_force":
ERROR: catalog is missing 1 attribute(s) for relid 26474

which also seems ok.

If instead, I manually corrupt relation files belonging to the regression database, I get lines that look like this for corrupt heap relations:

relation "regression"."public"."functional_dependencies", block 28, offset 54, attribute 0
attribute 0 with length 4294967295 ends at offset 50 beyond total tuple length 43
relation "regression"."public"."functional_dependencies", block 28, offset 55
multitransaction ID is invalid
relation "regression"."public"."functional_dependencies", block 28, offset 57
multitransaction ID is invalid

and for corrupt btree relations:

btree relation "regression"."public"."tenk1_unique1":
ERROR: high key invariant violated for index "tenk1_unique1"
DETAIL: Index tid=(1,38) points to heap tid=(70,26) page lsn=0/33A96D0.
btree relation "regression"."public"."tenk1_unique2":
ERROR: index tuple size does not equal lp_len in index "tenk1_unique2"
DETAIL: Index tid=(1,35) tuple size=4913 lp_len=16 page lsn=0/33DFD98.
HINT: This could be a torn page problem.
btree relation "regression"."public"."tenk1_thous_tenthous":
ERROR: index tuple size does not equal lp_len in index "tenk1_thous_tenthous"
DETAIL: Index tid=(1,36) tuple size=4402 lp_len=16 page lsn=0/34C0770.
HINT: This could be a torn page problem.

which likewise seems ok.

BTW, another thing that might be interesting is to call
PQsetErrorVerbosity(conn, PQERRORS_VERBOSE) in verbose mode. It's
probably possible to contrive a case where the server error message is
something generic like "cache lookup failed for relation %u" which
occurs in a whole bunch of places in the source code, and being able
get the file and line number information can be really useful when
trying to track such things down.

Good idea. I decided to also honor the --quiet flag

if (opts.verbose)
PQsetErrorVerbosity(free_slot->connection, PQERRORS_VERBOSE);
else if (opts.quiet)
PQsetErrorVerbosity(free_slot->connection, PQERRORS_TERSE);

On Mar 4, 2021, at 2:04 PM, Peter Geoghegan <pg@bowt.ie> wrote:

I don't think that the struct AmcheckOptions block fields (e.g.,
startblock) should be of type 'long' -- that doesn't work well on
Windows, where 'long' is only 32-bit. To be fair we already do the
same thing elsewhere, but there is no reason to repeat those mistakes.
(I'm rather suspicious of 'long' in general.)

I think that you could use BlockNumber + strtoul() without breaking Windows.

Thanks for reviewing!

Good points. I decided to use int64 instead of BlockNumber. The option processing needs to give a sensible error message if the user gives a negative number for the argument, so unsigned types are a bad fit.

On Thu, Mar 4, 2021 at 10:29 AM Robert Haas <robertmhaas@gmail.com> wrote:

Most of these changes sound good. I'll go through the whole patch
again today, or as much of it as I can. But before I do that, I want
to comment on this point specifically.

Just a thought - I don't feel strongly about this - but you want want
to consider storing your list of patterns in an array that gets
resized as necessary rather than a list. Then the pattern ID would
just be pattern_ptr - pattern_array, and finding the pattern by ID
would just be pattern_ptr = &pattern_array[pattern_id]. I don't think
there's a real efficiency issue here because the list of patterns is
almost always going to be short, and even if somebody decides to
provide a very long list of patterns (e.g. by using xargs) it's
probably still not that big a deal. A sufficiently obstinate user
running an operating system where argument lists can be extremely long
could probably make this the dominant cost by providing a gigantic
number of patterns that don't match anything, but such a person is
trying to prove a point, rather than accomplish anything useful, so I
don't care. But, the code might be more elegant the other way.

Done. I was not too motivated by the efficiency argument, but the code to look up patterns is cleaner when the pattern_id is an index into an array than when it is a field in a struct that has to be searched in a list.

This patch increases the number of cases where we use ^ to assert that
exactly one of two things is true from 4 to 5. I think it might be
better to just write out (a && !b) || (b && !a), but there is some
precedent for the way you did it so perhaps it's fine.

Your formulation takes longer for me to read and understand (by, perhaps, some milliseconds), but checking what C compilers guarantee to store in

bool a = (i == j);
bool b = (k == l);

I found it hard to be sure that some compiler wouldn't do weird things with that. Two "true" values a and b could pass the (a ^ b) test if they represent "true" in two different bit patterns. I don't really think there is a risk here in practice, but looking up the relevant C standards isn't quick for future readers of this code, so I went with your formulation.

The name prepare_table_commmand() is oddly non-parallel with
verify_heapam_slot_handler(). Seems better to call it either a table
throughout, or a heapam throughout. Actually I think I would prefer
"heap" to either of those, but I definitely think we shouldn't switch
terminology. Note that prepare_btree_command() doesn't have this
issue, since it matches verify_btree_slot_handler(). On a related
note, "c.relam = 2" is really a test for is_heap, not is_table. We
might have other table AMs in the future, but only one of those AMs
will be called heap, and only one will have OID 2.

Changed to use "heap" in many places where "table" was used previously, and to use "btree" in many places where "index" was used previously. The term "heapam" now only occurs as part of "verify_heapam", a function defined in contrib/amcheck and not changed here.

You've got some weird round-tripping stuff where you sent literal
values to the server so that you can turn around and get them back
from the server. For example, you've got prepare_table_command()
select rel->nspname and rel->relname back from the server as literals,
which seems silly because we have to already have that information or
we couldn't ask the server to give it to us ... and if we already have
it, then why do we need to get it again? The reason it's like this
seems to be that after calling prepare_table_command(), we use
ParallelSlotSetHandler() to set verify_heapam_slot_handler() as the
callback, and we set sql.data as the callback, so we don't have access
to the RelationInfo object when we're handling the slot result. But
that's easy to fix: just store the sql as a field inside the
RelationInfo, and then pass a pointer to the whole RelationInfo to the
slot handler. Then you don't need to round-trip the table and schema
names; and you have the values available even if an error happens.

Changed. I was doing that mostly so that people examining the server logs would have something more than the oid in the sql to suggest which table or index is being checked.

On a somewhat related note, I think it might make sense to have the
slot handlers try to free memory. It seems hard to make pg_amcheck
leak enough memory to matter, but I guess it's not entirely
implausible that someone could be checking let's say 10 million
relations. Freeing the query strings could probably prevent a half a
GB or so of accumulated memory usage under those circumstances. I
suppose freeing nspname and relname would save a bit more, but it's
hardly worth doing since they are a lot shorter and you've got to have
all that information in memory at once at some point anyway; similarly
with the RelationInfo structures, which have the further complexity of
being part of a linked list you might not want to corrupt. But you
don't need to have every query string in memory at the same time, just
as many as are running at one in time.

Changed.

Also, maybe compile_relation_list_one_db() should keep the result set
around so that you don't need to pstrdup() the nspname and relname in
the first place. Right now, just before compile_relation_list_one_db()
calls PQclear() you have two copies of every nspname and relname
allocated. If you just kept the result sets around forever, the peak
memory usage would be lower than it is currently. If you really wanted
to get fancy you could arrange to free each result set when you've
finished that database, but that seems annoying to code and I'm pretty
sure it doesn't matter.

Hmm. When compile_relation_list_one_db() is processing the ith database out of N databases, all (nspname,relname) pairs are allocated for databases in [0..i], and additionally the result set for database i is in memory. The result sets for [0..i-1] have already been freed. Keeping around the result sets for all N databases seems more expensive, considering how much stuff is in struct pg_result, if N is large and the relations are spread across the databases rather than clumped together in the last one.

I think your proposal might be a win for some users and a loss for others. Given that it is not a clear win, I don't care to implement it that way, as it takes more effort to remember which object owns which bit of memory.

I have added pfree()s to the handlers to free the nspname and relname when finished. This does little to reduce the peak memory usage, though.

The CTEs called "include_raw" and "exclude_raw" which are used as part
of the query to construct a list of tables. The regexes are fished
through there, and the pattern IDs, which makes sense, but the raw
patterns are also fished through, and I don't see a reason for that.
We don't seem to need that for anything. The same seems to apply to
the query used to resolve database patterns.

Changed.

Both queries are changed to no longer have a "pat" column, and the "id" field (renamed as "pattern_id" for clarity) is used instead.

I see that most of the queries have now been adjusted to be spread
across fewer lines, which is good, but please make sure to do that
everywhere. In particular, I notice that the bt_index_check calls are
still too spread out.

When running` pg_amcheck --echo`, the queries for a table and index now print as:

SELECT blkno, offnum, attnum, msg FROM "public".verify_heapam(
relation := 33024, on_error_stop := false, check_toast := true, skip := 'none')
SELECT * FROM "public".bt_index_check(index := '33029'::regclass, heapallindexed := false)

Which is two lines per heap table, and just one line per btree index.

On Thu, Mar 4, 2021 at 12:27 PM Robert Haas <robertmhaas@gmail.com> wrote:

More in a bit, need to grab some lunch.

Moving on to the tests, in 003_check.pl, I think it would be slightly
better if relation_toast were to select ct.oid::regclass and then just
have the caller use that value directly. We'd certainly want to do
that if the name could contain any characters that might require
quoting. Here that's not possible, but I think we might as well use
the same technique anyway.

Using c.reltoastrelid::regclass, which is basically the same idea.

I'm not sure how far to go with it, but I think that you might want to
try to enhance the logging in some of the cases where the TAP tests
might fail. In particular, if either of these trip in the buildfarm,
it doesn't seem like it will be too easy to figure out why they
failed:

+    fail('Xid thresholds not as expected');
+        fail('Page layout differs from our expectations');

Ok, I've extended these messages with the extra debugging information. I have also changed them to use 'plan skip_all', since what we are really talking about here is an inability for the test to properly exercise pg_amcheck, not an actual failure of pg_amcheck to function correctly. This should save us some grief if the test isn't portable to all platforms in the build farm, though we'll have to check whether the skip messages are happening on any farm animals.

You might want to rephrase the message to incorporate the values that
triggered the failure, e.g. "datfrozenxid $datfrozenxid is not between
3 and $relfrozenxid", "expected (a,b) = (12345678,abcdefg) but got
($x,$y)", so that if the buildfarm happens to fail there's a shred of
hope that we might be able to guess the reason from the message.

Added to the skip_all message.

You
could also give some thought to whether there are any tests that can
be improved in similar ways. Test::More is nice in that when you run a
test with eq() or like() and it fails it will tell you about the input
values in the diagnostic, but if you do something like is($x < 4, ...)
instead of cmp_ok($x, '<', 4, ...) then you lose that. I'm not saying
you're doing that exact thing, just saying that looking through the
test code with an eye to finding things where you could output a
little more info about a potential failure might be a worthwhile
activity.

I'm mostly using command_checks_all and command_fails_like. The main annoyance is that when a pattern fails to match, you get a rather long error message. I'm not sure that it's lacking information, though.

If it were me, I would get rid of ROWCOUNT and have a list of
closures, and then loop over the list and call each one e.g. my
@corruption = ( sub { ... }, sub { ... }, sub { ... }) or maybe
something like what I did with @scenario in
src/bin/pg_verifybackup/t/003_corruption.pl, but this is ultimately a
style preference and I think the way you actually did it is also
reasonable, and some people might find it more readable than the other
way.

Unchanged. I think the closure idea is ok, but I am using the ROWCOUNT constant elsewhere (specifically, when inserting rows into the table) and using a constant for this helps keep the number of rows of data and the number of corruptions synchronized.

The name int4_fickle_ops is positively delightful and I love having a
test case like this.

I know you know this already, but for others reading this thread, the test using int4_fickle_ops is testing the kind of index corruption that might happen if you changed the sort order underlying an index, such as by updating collation definitions. It was simpler to not muck around with collations in the test itself, but to achieve the sort order breakage this way.

On the whole, I think these tests look quite solid. I am a little
concerned, as you may gather from the comment above, that they will
not survive contact with the buildfarm, because they will turn out to
be platform or OS-dependent in some way. However, I can see that
you've taken steps to avoid such dependencies, and maybe we'll be
lucky and those will work. Also, while I am suspicious something's
going to break, I don't know what it's going to be, so I can't suggest
any method to avoid it. I think we'll just have to keep an eye on the
buildfarm post-commit and see what crops up.

As I mentioned above, I've changed some failures to 'plan skip_all => reason', so that the build farm won't break if the tests aren't portable in ways I'm already thinking about. We'll just see if it breaks for additional ways that I'm not thinking about.

Turning to the documentation, I see that it is documented that a bare
command-line argument can be a connection string rather than a
database name. That sounds like a good plan, but when I try
'pg_amcheck sslmode=require' it does not work: FATAL: database
"sslmode=require" does not exist. The argument to -e is also
documented to be a connection string, but that also seems not to work.
Some thought might need to be given to what exactly these connection
opens are supposed to mean. Like, do the connection options I set via
-e apply to all the connections I make, or just the one to the
maintenance database? How do I set connection options for connections
to databases whose names aren't specified explicitly but are
discovered by querying pg_database? Maybe instead of allowing these to
be a connection string, we should have a separate option that can be
used just for the purpose of setting connection options that then
apply to all connections. That seems a little bit oddly unlike other
tools, but if I want sslmode=verify-ca or something on all my
connections, there should be an easy way to get it.

I'm not sure where you are getting the '-e' from. That is the short form of --echo, and not what you are likely to want. However, your larger point is valid.

I don't like the idea that pg_amcheck would handle these options in a way that is incompatible with reindexdb or vacuumdb. I think pg_amcheck can have a superset of those tools' options, but it should not have options that are incompatible with those tools' options. That way, if the extra options that pg_amcheck offers become popular, we can add support for them in those other tools. But if the options are incompatible, we'd not be able to do that without breaking backward compatibility of those tools' interfaces, which we wouldn't want to do.

As such, I have solved the problem by reducing the number of dbname arguments you can provide on the command-line to just one. (This does not limit the number of database *patterns* that you can supply.) Those tools only allow one dbname on the command line, so this is not a regression of functionality from what those tools offer. Only the single dbname argument, or single maintenance-db argument, can be a connection string. The database patterns do not support that, nor would it make sense for them to do so.

All of the following should now work:

pg_amcheck --all "port=5555 sslmode=require"

pg_amcheck --maintenance-db="host=myhost port=5555 dbname=mydb sslmode=require" --all

pg_amcheck -d foo -d bar -d baz mydb

pg_amcheck -d foo -d bar -d baz "host=myhost dbname=mydb"

Note that using --all with a connection string is a pg_amcheck extension. It doesn't currently work in reindexdb, which complains.

There is a strange case, `pg_amcheck --maintenance-db="port=5555 dbname=postgres" "port=5432 dbname=regression"`, which doesn't complain, despite there being nothing listening on port 5555. This is because pg_amcheck completely ignores the maintenance-db argument in this instance, but I have not changed this behavior, because reindexdb does the same thing.

The documentation makes many references to patterns, but does not
explain what a pattern is. I see that psql's documentation contains an
explanation, and pg_dump's documentation links to psql's
documentation. pg_amcheck should probably link to psql's
documentation, too.

A prior version of this patch had a reference to that, but no more. Thanks for noticing. I've put it back in. There is some tension here between the desire to keep the docs concise and the desire to explain things better with examples, etc. I'm not sure I've got that balance right, but I'm too close to the project to be the right person to make that call. Does it seem ok?

In the documentation for -d, you say that "If -a --all is also
specified, -d --database does not additionally affect which databases
are checked." I suggest replacing "does not additionally affect which
databases are checked" with "has no effect."

Changed.

In two places you say "without regard for" but I think it should be
"without regard to".

Changed.

In the documentation for --no-strict-names you use "nor" where I think
it should say "or".

Changed.

I kind of wonder whether we need --quiet. It seems like right now it
only does two things. One is to control complaints about ignoring the
startblock and endblock options, but I don't agree with that behavior
anyway. The other is control whether we complain about unmatched
patterns, but I think that could just be controlled --no-strict-names
i.e. normally an unmatched pattern results in a complaint and a
failure, but with --no-strict-names there is neither a complaint nor a
failure. Having a flag to control whether we get the message
separately from whether we get the failure doesn't seem helpful.

Hmm. I think that having --quiet plus --no-strict-names suppress the warnings about unmatched patterns has some value.

Also, as discussed above, I also now decrease the PGVerbosity to PQERRORS_TERSE, which has additional value, I think.

But I don't feel strongly about this, and if you'd rather --quiet be removed, that's fine, too. But I'll wait to hear back about that.

I don't think it's good to say "This is an alias for" in the
documentation of -i -I -t -T. I suggest instead saying "This is
similar to".

Changed.

Instead of "Option BLAH takes precedence over..." I suggest "The BLAH
option takes precedence over..."

Changed.

Attachments:

v44-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patchapplication/octet-stream; name=v44-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patch; x-unix-mode=0644Download
From 5f9396849ff1c38619b7ea7727af377a47319233 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 3 Mar 2021 07:16:55 -0800
Subject: [PATCH v44 1/3] Reworking ParallelSlots for mutliple DB use

The existing implementation of ParallelSlots is used by reindexdb
and vacuumdb to process tables in parallel in only one database at
a time.  The ParallelSlots interface reflects this usage pattern.
The function to set up the slots assumes all slots should be
connected to the same database, and the function for getting the
next idle slot pays no attention to which database the slot may be
connected to.

In anticipation of pg_amcheck using parallel slots to process
multiple databases in parallel, reworking the interface while
trying to remain reasonably simple for reindexdb and vacuumdb to
use:

ParallelSlotsSetup() no longer creates or receives database
connections.  It takes arguments that it stores for use in
subsequent operations when a connection needs to be formed.

Callers who already have a connection and want to reuse it can give
it to the parallel slots using a new function,
ParallelSlotsAdoptConn().  Both reindexdb and vacuumdb use this.

ParallelSlotsGetIdle() is extended to take a dbname argument
indicating the database to which a connection is desired, and to
manage a heterogeneous set of slots potentially connected to varying
databases and some perhaps not yet connected.  The function will
reuse an existing connection or form a new connection as necessary.

The logic for determining whether a slot's connection is suitable
for reuse is based on the database the slot's connection is
connected to, and whether that matches the database desired.  Other
connection parameters (user, host, port, etc.) are assumed not to
change from slot to slot.
---
 src/bin/scripts/reindexdb.c          |  17 +-
 src/bin/scripts/vacuumdb.c           |  46 +--
 src/fe_utils/parallel_slot.c         | 407 +++++++++++++++++++--------
 src/include/fe_utils/parallel_slot.h |  27 +-
 src/tools/pgindent/typedefs.list     |   2 +
 5 files changed, 338 insertions(+), 161 deletions(-)

diff --git a/src/bin/scripts/reindexdb.c b/src/bin/scripts/reindexdb.c
index cf28176243..fc0681538a 100644
--- a/src/bin/scripts/reindexdb.c
+++ b/src/bin/scripts/reindexdb.c
@@ -36,7 +36,7 @@ static SimpleStringList *get_parallel_object_list(PGconn *conn,
 												  ReindexType type,
 												  SimpleStringList *user_list,
 												  bool echo);
-static void reindex_one_database(const ConnParams *cparams, ReindexType type,
+static void reindex_one_database(ConnParams *cparams, ReindexType type,
 								 SimpleStringList *user_list,
 								 const char *progname,
 								 bool echo, bool verbose, bool concurrently,
@@ -330,7 +330,7 @@ main(int argc, char *argv[])
 }
 
 static void
-reindex_one_database(const ConnParams *cparams, ReindexType type,
+reindex_one_database(ConnParams *cparams, ReindexType type,
 					 SimpleStringList *user_list,
 					 const char *progname, bool echo,
 					 bool verbose, bool concurrently, int concurrentCons,
@@ -341,7 +341,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 	bool		parallel = concurrentCons > 1;
 	SimpleStringList *process_list = user_list;
 	ReindexType process_type = type;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	bool		failed = false;
 	int			items_count = 0;
 
@@ -461,7 +461,8 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 
 	Assert(process_list != NULL);
 
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, NULL);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	cell = process_list->head;
 	do
@@ -475,7 +476,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -489,7 +490,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
@@ -499,8 +500,8 @@ finish:
 		pg_free(process_list);
 	}
 
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pfree(slots);
+	ParallelSlotsTerminate(sa);
+	pfree(sa);
 
 	if (failed)
 		exit(1);
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 602fd45c42..7901c41f16 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -45,7 +45,7 @@ typedef struct vacuumingOptions
 } vacuumingOptions;
 
 
-static void vacuum_one_database(const ConnParams *cparams,
+static void vacuum_one_database(ConnParams *cparams,
 								vacuumingOptions *vacopts,
 								int stage,
 								SimpleStringList *tables,
@@ -408,7 +408,7 @@ main(int argc, char *argv[])
  * a list of tables from the database.
  */
 static void
-vacuum_one_database(const ConnParams *cparams,
+vacuum_one_database(ConnParams *cparams,
 					vacuumingOptions *vacopts,
 					int stage,
 					SimpleStringList *tables,
@@ -421,13 +421,14 @@ vacuum_one_database(const ConnParams *cparams,
 	PGresult   *res;
 	PGconn	   *conn;
 	SimpleStringListCell *cell;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	SimpleStringList dbtables = {NULL, NULL};
 	int			i;
 	int			ntups;
 	bool		failed = false;
 	bool		tables_listed = false;
 	bool		has_where = false;
+	const char *initcmd;
 	const char *stage_commands[] = {
 		"SET default_statistics_target=1; SET vacuum_cost_delay=0;",
 		"SET default_statistics_target=10; RESET vacuum_cost_delay;",
@@ -684,26 +685,25 @@ vacuum_one_database(const ConnParams *cparams,
 		concurrentCons = 1;
 
 	/*
-	 * Setup the database connections. We reuse the connection we already have
-	 * for the first slot.  If not in parallel mode, the first slot in the
-	 * array contains the connection.
+	 * All slots need to be prepared to run the appropriate analyze stage, if
+	 * caller requested that mode.  We have to prepare the initial connection
+	 * ourselves before setting up the slots.
 	 */
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	if (stage == ANALYZE_NO_STAGE)
+		initcmd = NULL;
+	else
+	{
+		initcmd = stage_commands[stage];
+		executeCommand(conn, initcmd, echo);
+	}
 
 	/*
-	 * Prepare all the connections to run the appropriate analyze stage, if
-	 * caller requested that mode.
+	 * Setup the database connections. We reuse the connection we already have
+	 * for the first slot.  If not in parallel mode, the first slot in the
+	 * array contains the connection.
 	 */
-	if (stage != ANALYZE_NO_STAGE)
-	{
-		int			j;
-
-		/* We already emitted the message above */
-
-		for (j = 0; j < concurrentCons; j++)
-			executeCommand((slots + j)->connection,
-						   stage_commands[stage], echo);
-	}
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, initcmd);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	initPQExpBuffer(&sql);
 
@@ -719,7 +719,7 @@ vacuum_one_database(const ConnParams *cparams,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -740,12 +740,12 @@ vacuum_one_database(const ConnParams *cparams,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pg_free(slots);
+	ParallelSlotsTerminate(sa);
+	pg_free(sa);
 
 	termPQExpBuffer(&sql);
 
diff --git a/src/fe_utils/parallel_slot.c b/src/fe_utils/parallel_slot.c
index b625deb254..69581157c2 100644
--- a/src/fe_utils/parallel_slot.c
+++ b/src/fe_utils/parallel_slot.c
@@ -25,25 +25,16 @@
 #include "common/logging.h"
 #include "fe_utils/cancel.h"
 #include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
 
 #define ERRCODE_UNDEFINED_TABLE  "42P01"
 
-static void init_slot(ParallelSlot *slot, PGconn *conn);
 static int	select_loop(int maxFd, fd_set *workerset);
 static bool processQueryResult(ParallelSlot *slot, PGresult *result);
 
-static void
-init_slot(ParallelSlot *slot, PGconn *conn)
-{
-	slot->connection = conn;
-	/* Initially assume connection is idle */
-	slot->isFree = true;
-	ParallelSlotClearHandler(slot);
-}
-
 /*
  * Process (and delete) a query result.  Returns true if there's no problem,
- * false otherwise. It's up to the handler to decide what cosntitutes a
+ * false otherwise. It's up to the handler to decide what constitutes a
  * problem.
  */
 static bool
@@ -137,151 +128,316 @@ select_loop(int maxFd, fd_set *workerset)
 }
 
 /*
- * ParallelSlotsGetIdle
- *		Return a connection slot that is ready to execute a command.
- *
- * This returns the first slot we find that is marked isFree, if one is;
- * otherwise, we loop on select() until one socket becomes available.  When
- * this happens, we read the whole set and mark as free all sockets that
- * become available.  If an error occurs, NULL is returned.
+ * Return the offset of a suitable idle slot, or -1 if none are available.  If
+ * the given dbname is not null, only idle slots connected to the given
+ * database are considered suitable, otherwise all idle connected slots are
+ * considered suitable.
  */
-ParallelSlot *
-ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
+static int
+find_matching_idle_slot(const ParallelSlotArray *sa, const char *dbname)
 {
 	int			i;
-	int			firstFree = -1;
 
-	/*
-	 * Look for any connection currently free.  If there is one, mark it as
-	 * taken and let the caller know the slot to use.
-	 */
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (slots[i].isFree)
-		{
-			slots[i].isFree = false;
-			return slots + i;
-		}
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			continue;
+
+		if (dbname == NULL ||
+			strcmp(PQdb(sa->slots[i].connection), dbname) == 0)
+			return i;
+	}
+	return -1;
+}
+
+/*
+ * Return the offset of the first slot without a database connection, or -1 if
+ * all slots are connected.
+ */
+static int
+find_unconnected_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			return i;
+	}
+
+	return -1;
+}
+
+/*
+ * Return the offset of the first idle slot, or -1 if all slots are busy.
+ */
+static int
+find_any_idle_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+		if (!sa->slots[i].inUse)
+			return i;
+
+	return -1;
+}
+
+/*
+ * Wait for any slot's connection to have query results, consume the results,
+ * and update the slot's status as appropriate.  Returns true on success,
+ * false on cancellation, on error, or if no slots are connected.
+ */
+static bool
+wait_on_slots(ParallelSlotArray *sa)
+{
+	int			i;
+	fd_set		slotset;
+	int			maxFd = 0;
+	PGconn	   *cancelconn = NULL;
+
+	/* We must reconstruct the fd_set for each call to select_loop */
+	FD_ZERO(&slotset);
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		int			sock;
+
+		/* We shouldn't get here if we still have slots without connections */
+		Assert(sa->slots[i].connection != NULL);
+
+		sock = PQsocket(sa->slots[i].connection);
+
+		/*
+		 * We don't really expect any connections to lose their sockets after
+		 * startup, but just in case, cope by ignoring them.
+		 */
+		if (sock < 0)
+			continue;
+
+		/* Keep track of the first valid connection we see. */
+		if (cancelconn == NULL)
+			cancelconn = sa->slots[i].connection;
+
+		FD_SET(sock, &slotset);
+		if (sock > maxFd)
+			maxFd = sock;
 	}
 
 	/*
-	 * No free slot found, so wait until one of the connections has finished
-	 * its task and return the available slot.
+	 * If we get this far with no valid connections, processing cannot
+	 * continue.
 	 */
-	while (firstFree < 0)
+	if (cancelconn == NULL)
+		return false;
+
+	SetCancelConn(sa->slots->connection);
+	i = select_loop(maxFd, &slotset);
+	ResetCancelConn();
+
+	/* failure? */
+	if (i < 0)
+		return false;
+
+	for (i = 0; i < sa->numslots; i++)
 	{
-		fd_set		slotset;
-		int			maxFd = 0;
+		int			sock;
 
-		/* We must reconstruct the fd_set for each call to select_loop */
-		FD_ZERO(&slotset);
+		sock = PQsocket(sa->slots[i].connection);
 
-		for (i = 0; i < numslots; i++)
+		if (sock >= 0 && FD_ISSET(sock, &slotset))
 		{
-			int			sock = PQsocket(slots[i].connection);
-
-			/*
-			 * We don't really expect any connections to lose their sockets
-			 * after startup, but just in case, cope by ignoring them.
-			 */
-			if (sock < 0)
-				continue;
-
-			FD_SET(sock, &slotset);
-			if (sock > maxFd)
-				maxFd = sock;
+			/* select() says input is available, so consume it */
+			PQconsumeInput(sa->slots[i].connection);
 		}
 
-		SetCancelConn(slots->connection);
-		i = select_loop(maxFd, &slotset);
-		ResetCancelConn();
-
-		/* failure? */
-		if (i < 0)
-			return NULL;
-
-		for (i = 0; i < numslots; i++)
+		/* Collect result(s) as long as any are available */
+		while (!PQisBusy(sa->slots[i].connection))
 		{
-			int			sock = PQsocket(slots[i].connection);
+			PGresult   *result = PQgetResult(sa->slots[i].connection);
 
-			if (sock >= 0 && FD_ISSET(sock, &slotset))
+			if (result != NULL)
 			{
-				/* select() says input is available, so consume it */
-				PQconsumeInput(slots[i].connection);
+				/* Handle and discard the command result */
+				if (!processQueryResult(&sa->slots[i], result))
+					return false;
 			}
-
-			/* Collect result(s) as long as any are available */
-			while (!PQisBusy(slots[i].connection))
+			else
 			{
-				PGresult   *result = PQgetResult(slots[i].connection);
-
-				if (result != NULL)
-				{
-					/* Handle and discard the command result */
-					if (!processQueryResult(slots + i, result))
-						return NULL;
-				}
-				else
-				{
-					/* This connection has become idle */
-					slots[i].isFree = true;
-					ParallelSlotClearHandler(slots + i);
-					if (firstFree < 0)
-						firstFree = i;
-					break;
-				}
+				/* This connection has become idle */
+				sa->slots[i].inUse = false;
+				ParallelSlotClearHandler(&sa->slots[i]);
+				break;
 			}
 		}
 	}
+	return true;
+}
 
-	slots[firstFree].isFree = false;
-	return slots + firstFree;
+/*
+ * Open a new database connection using the stored connection parameters and
+ * optionally a given dbname if not null, execute the stored initial command if
+ * any, and associate the new connection with the given slot.
+ */
+static void
+connect_slot(ParallelSlotArray *sa, int slotno, const char *dbname)
+{
+	const char *old_override;
+	ParallelSlot *slot = &sa->slots[slotno];
+
+	old_override = sa->cparams->override_dbname;
+	if (dbname)
+		sa->cparams->override_dbname = dbname;
+	slot->connection = connectDatabase(sa->cparams, sa->progname, sa->echo, false, true);
+	sa->cparams->override_dbname = old_override;
+
+	if (PQsocket(slot->connection) >= FD_SETSIZE)
+	{
+		pg_log_fatal("too many jobs for this platform");
+		exit(1);
+	}
+
+	/* Setup the connection using the supplied command, if any. */
+	if (sa->initcmd)
+		executeCommand(slot->connection, sa->initcmd, sa->echo);
 }
 
 /*
- * ParallelSlotsSetup
- *		Prepare a set of parallel slots to use on a given database.
+ * ParallelSlotsGetIdle
+ *		Return a connection slot that is ready to execute a command.
+ *
+ * The slot returned is chosen as follows:
+ *
+ * If any idle slot already has an open connection, and if either dbname is
+ * null or the existing connection is to the given database, that slot will be
+ * returned allowing the connection to be reused.
+ *
+ * Otherwise, if any idle slot is not yet connected to any database, the slot
+ * will be returned with it's connection opened using the stored cparams and
+ * optionally the given dbname if not null.
+ *
+ * Otherwise, if any idle slot exists, an idle slot will be chosen and returned
+ * after having it's connection disconnected and reconnected using the stored
+ * cparams and optionally the given dbname if not null.
  *
- * This creates and initializes a set of connections to the database
- * using the information given by the caller, marking all parallel slots
- * as free and ready to use.  "conn" is an initial connection set up
- * by the caller and is associated with the first slot in the parallel
- * set.
+ * Otherwise, if any slots have connections that are busy, we loop on select()
+ * until one socket becomes available.  When this happens, we read the whole
+ * set and mark as free all sockets that become available.  We then select a
+ * slot using the same rules as above.
+ *
+ * Otherwise, we cannot return a slot, which is an error, and NULL is returned.
+ *
+ * For any connection created, if the stored initcmd is not null, it will be
+ * executed as a command on the newly formed connection before the slot is
+ * returned.
+ *
+ * If an error occurs, NULL is returned.
  */
 ParallelSlot *
-ParallelSlotsSetup(const ConnParams *cparams,
-				   const char *progname, bool echo,
-				   PGconn *conn, int numslots)
+ParallelSlotsGetIdle(ParallelSlotArray *sa, const char *dbname)
 {
-	ParallelSlot *slots;
-	int			i;
+	int			offset;
 
-	Assert(conn != NULL);
+	Assert(sa);
+	Assert(sa->numslots > 0);
 
-	slots = (ParallelSlot *) pg_malloc(sizeof(ParallelSlot) * numslots);
-	init_slot(slots, conn);
-	if (numslots > 1)
+	while (1)
 	{
-		for (i = 1; i < numslots; i++)
+		/* First choice: a slot already connected to the desired database. */
+		offset = find_matching_idle_slot(sa, dbname);
+		if (offset >= 0)
 		{
-			conn = connectDatabase(cparams, progname, echo, false, true);
-
-			/*
-			 * Fail and exit immediately if trying to use a socket in an
-			 * unsupported range.  POSIX requires open(2) to use the lowest
-			 * unused file descriptor and the hint given relies on that.
-			 */
-			if (PQsocket(conn) >= FD_SETSIZE)
-			{
-				pg_log_fatal("too many jobs for this platform -- try %d", i);
-				exit(1);
-			}
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
+
+		/* Second choice: a slot not connected to any database. */
+		offset = find_unconnected_slot(sa);
+		if (offset >= 0)
+		{
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
 
-			init_slot(slots + i, conn);
+		/* Third choice: a slot connected to the wrong database. */
+		offset = find_any_idle_slot(sa);
+		if (offset >= 0)
+		{
+			disconnectDatabase(sa->slots[offset].connection);
+			sa->slots[offset].connection = NULL;
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
 		}
+
+		/*
+		 * Fourth choice: block until one or more slots become available. If
+		 * any slots hit a fatal error, we'll find out about that here and
+		 * return NULL.
+		 */
+		if (!wait_on_slots(sa))
+			return NULL;
 	}
+}
+
+/*
+ * ParallelSlotsSetup
+ *		Prepare a set of parallel slots but do not connect to any database.
+ *
+ * This creates and initializes a set of slots, marking all parallel slots as
+ * free and ready to use.  Establishing connections is delayed until requesting
+ * a free slot.  The cparams, progname, echo, and initcmd are stored for later
+ * use and must remain valid for the lifetime of the returned array.
+ */
+ParallelSlotArray *
+ParallelSlotsSetup(int numslots, ConnParams *cparams, const char *progname,
+				   bool echo, const char *initcmd)
+{
+	ParallelSlotArray *sa;
 
-	return slots;
+	Assert(numslots > 0);
+	Assert(cparams != NULL);
+	Assert(progname != NULL);
+
+	sa = (ParallelSlotArray *) palloc0(offsetof(ParallelSlotArray, slots) +
+									   numslots * sizeof(ParallelSlot));
+
+	sa->numslots = numslots;
+	sa->cparams = cparams;
+	sa->progname = progname;
+	sa->echo = echo;
+	sa->initcmd = initcmd;
+
+	return sa;
+}
+
+/*
+ * ParallelSlotsAdoptConn
+ *		Assign an open connection to the slots array for reuse.
+ *
+ * This turns over ownership of an open connection to a slots array.  The
+ * caller should not further use or close the connection.  All the connection's
+ * parameters (user, host, port, etc.) except possibly dbname should match
+ * those of the slots array's cparams, as given in ParallelSlotsSetup.  If
+ * these parameters differ, subsequent behavior is undefined.
+ */
+void
+ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn)
+{
+	int			offset;
+
+	offset = find_unconnected_slot(sa);
+	if (offset >= 0)
+		sa->slots[offset].connection = conn;
+	else
+		disconnectDatabase(conn);
 }
 
 /*
@@ -292,13 +448,13 @@ ParallelSlotsSetup(const ConnParams *cparams,
  * terminate all connections.
  */
 void
-ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
+ParallelSlotsTerminate(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		PGconn	   *conn = slots[i].connection;
+		PGconn	   *conn = sa->slots[i].connection;
 
 		if (conn == NULL)
 			continue;
@@ -314,13 +470,15 @@ ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
  * error has been found on the way.
  */
 bool
-ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
+ParallelSlotsWaitCompletion(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (!consumeQueryResult(slots + i))
+		if (sa->slots[i].connection == NULL)
+			continue;
+		if (!consumeQueryResult(&sa->slots[i]))
 			return false;
 	}
 
@@ -350,6 +508,9 @@ ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
 bool
 TableCommandResultHandler(PGresult *res, PGconn *conn, void *context)
 {
+	Assert(res != NULL);
+	Assert(conn != NULL);
+
 	/*
 	 * If it's an error, report it.  Errors about a missing table are harmless
 	 * so we continue processing; but die for other errors.
diff --git a/src/include/fe_utils/parallel_slot.h b/src/include/fe_utils/parallel_slot.h
index 8902f8d4f4..b7e2b0a29b 100644
--- a/src/include/fe_utils/parallel_slot.h
+++ b/src/include/fe_utils/parallel_slot.h
@@ -21,7 +21,7 @@ typedef bool (*ParallelSlotResultHandler) (PGresult *res, PGconn *conn,
 typedef struct ParallelSlot
 {
 	PGconn	   *connection;		/* One connection */
-	bool		isFree;			/* Is it known to be idle? */
+	bool		inUse;			/* Is the slot being used? */
 
 	/*
 	 * Prior to issuing a command or query on 'connection', a handler callback
@@ -33,6 +33,16 @@ typedef struct ParallelSlot
 	void	   *handler_context;
 } ParallelSlot;
 
+typedef struct ParallelSlotArray
+{
+	int			numslots;
+	ConnParams *cparams;
+	const char *progname;
+	bool		echo;
+	const char *initcmd;
+	ParallelSlot slots[FLEXIBLE_ARRAY_MEMBER];
+} ParallelSlotArray;
+
 static inline void
 ParallelSlotSetHandler(ParallelSlot *slot, ParallelSlotResultHandler handler,
 					   void *context)
@@ -48,15 +58,18 @@ ParallelSlotClearHandler(ParallelSlot *slot)
 	slot->handler_context = NULL;
 }
 
-extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlot *slots, int numslots);
+extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlotArray *slots,
+										  const char *dbname);
+
+extern ParallelSlotArray *ParallelSlotsSetup(int numslots, ConnParams *cparams,
+											 const char *progname, bool echo,
+											 const char *initcmd);
 
-extern ParallelSlot *ParallelSlotsSetup(const ConnParams *cparams,
-										const char *progname, bool echo,
-										PGconn *conn, int numslots);
+extern void ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn);
 
-extern void ParallelSlotsTerminate(ParallelSlot *slots, int numslots);
+extern void ParallelSlotsTerminate(ParallelSlotArray *sa);
 
-extern bool ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots);
+extern bool ParallelSlotsWaitCompletion(ParallelSlotArray *sa);
 
 extern bool TableCommandResultHandler(PGresult *res, PGconn *conn,
 									  void *context);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e4d2debb3c..8ef71bd900 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -403,6 +403,7 @@ ConfigData
 ConfigVariable
 ConnCacheEntry
 ConnCacheKey
+ConnParams
 ConnStatusType
 ConnType
 ConnectionStateEnum
@@ -1729,6 +1730,7 @@ ParallelHashJoinState
 ParallelIndexScanDesc
 ParallelReadyList
 ParallelSlot
+ParallelSlotArray
 ParallelState
 ParallelTableScanDesc
 ParallelTableScanDescData
-- 
2.21.1 (Apple Git-122.3)

v44-0002-Adding-contrib-module-pg_amcheck.patchapplication/octet-stream; name=v44-0002-Adding-contrib-module-pg_amcheck.patch; x-unix-mode=0644Download
From c1162e467b2ef5e36efce6037d58b7e909d7a4f0 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Mar 2021 08:34:40 -0800
Subject: [PATCH v44 2/3] Adding contrib module pg_amcheck

Adding new contrib module pg_amcheck, which is a command line
interface for running amcheck's verifications against tables and
indexes.
---
 contrib/Makefile                           |    1 +
 contrib/pg_amcheck/.gitignore              |    3 +
 contrib/pg_amcheck/Makefile                |   29 +
 contrib/pg_amcheck/pg_amcheck.c            | 2104 ++++++++++++++++++++
 contrib/pg_amcheck/t/001_basic.pl          |    9 +
 contrib/pg_amcheck/t/002_nonesuch.pl       |  248 +++
 contrib/pg_amcheck/t/003_check.pl          |  497 +++++
 contrib/pg_amcheck/t/004_verify_heapam.pl  |  517 +++++
 contrib/pg_amcheck/t/005_opclass_damage.pl |   54 +
 doc/src/sgml/contrib.sgml                  |    1 +
 doc/src/sgml/filelist.sgml                 |    1 +
 doc/src/sgml/pgamcheck.sgml                |  701 +++++++
 src/tools/msvc/Install.pm                  |    2 +-
 src/tools/msvc/Mkvcbuild.pm                |    6 +-
 src/tools/pgindent/typedefs.list           |    5 +
 15 files changed, 4174 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pg_amcheck/.gitignore
 create mode 100644 contrib/pg_amcheck/Makefile
 create mode 100644 contrib/pg_amcheck/pg_amcheck.c
 create mode 100644 contrib/pg_amcheck/t/001_basic.pl
 create mode 100644 contrib/pg_amcheck/t/002_nonesuch.pl
 create mode 100644 contrib/pg_amcheck/t/003_check.pl
 create mode 100644 contrib/pg_amcheck/t/004_verify_heapam.pl
 create mode 100644 contrib/pg_amcheck/t/005_opclass_damage.pl
 create mode 100644 doc/src/sgml/pgamcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index f27e458482..a72dcf7304 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -30,6 +30,7 @@ SUBDIRS = \
 		old_snapshot	\
 		pageinspect	\
 		passwordcheck	\
+		pg_amcheck	\
 		pg_buffercache	\
 		pg_freespacemap \
 		pg_prewarm	\
diff --git a/contrib/pg_amcheck/.gitignore b/contrib/pg_amcheck/.gitignore
new file mode 100644
index 0000000000..c21a14de31
--- /dev/null
+++ b/contrib/pg_amcheck/.gitignore
@@ -0,0 +1,3 @@
+pg_amcheck
+
+/tmp_check/
diff --git a/contrib/pg_amcheck/Makefile b/contrib/pg_amcheck/Makefile
new file mode 100644
index 0000000000..bc61ee7970
--- /dev/null
+++ b/contrib/pg_amcheck/Makefile
@@ -0,0 +1,29 @@
+# contrib/pg_amcheck/Makefile
+
+PGFILEDESC = "pg_amcheck - detects corruption within database relations"
+PGAPPICON = win32
+
+PROGRAM = pg_amcheck
+OBJS = \
+	$(WIN32RES) \
+	pg_amcheck.o
+
+REGRESS_OPTS += --load-extension=amcheck --load-extension=pageinspect
+EXTRA_INSTALL += contrib/amcheck contrib/pageinspect
+
+TAP_TESTS = 1
+
+PG_CPPFLAGS = -I$(libpq_srcdir)
+PG_LIBS_INTERNAL = -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+SHLIB_PREREQS = submake-libpq
+subdir = contrib/pg_amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_amcheck/pg_amcheck.c b/contrib/pg_amcheck/pg_amcheck.c
new file mode 100644
index 0000000000..336140d962
--- /dev/null
+++ b/contrib/pg_amcheck/pg_amcheck.c
@@ -0,0 +1,2104 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_amcheck.c
+ *		Detects corruption within database relations.
+ *
+ * Copyright (c) 2017-2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_amcheck/pg_amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <time.h>
+
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_namespace_d.h"
+#include "common/logging.h"
+#include "common/username.h"
+#include "fe_utils/cancel.h"
+#include "fe_utils/option_utils.h"
+#include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
+#include "fe_utils/simple_list.h"
+#include "fe_utils/string_utils.h"
+#include "getopt_long.h"		/* pgrminclude ignore */
+#include "pgtime.h"
+#include "storage/block.h"
+
+typedef struct PatternInfo
+{
+	const char *pattern;		/* Unaltered pattern from the command line */
+	char	   *db_regex;		/* Database regexp parsed from pattern, or
+								 * NULL */
+	char	   *nsp_regex;		/* Schema regexp parsed from pattern, or NULL */
+	char	   *rel_regex;		/* Relation regexp parsed from pattern, or
+								 * NULL */
+	bool		heap_only;		/* true if rel_regex should only match heap
+								 * tables */
+	bool		btree_only;		/* true if rel_regex should only match btree
+								 * indexes */
+	bool		matched;		/* true if the pattern matched in any database */
+} PatternInfo;
+
+typedef struct PatternInfoArray
+{
+	PatternInfo *data;
+	size_t		len;
+} PatternInfoArray;
+
+/* pg_amcheck command line options controlled by user flags */
+typedef struct AmcheckOptions
+{
+	bool		alldb;
+	bool		echo;
+	bool		quiet;
+	bool		verbose;
+	bool		strict_names;
+	bool		show_progress;
+	int			jobs;
+
+	/* Objects to check or not to check, as lists of PatternInfo structs. */
+	PatternInfoArray include;
+	PatternInfoArray exclude;
+
+	/*
+	 * As an optimization, if any pattern in the exclude list applies to heap
+	 * tables, or similarly if any such pattern applies to btree indexes, or
+	 * to schemas, then these will be true, otherwise false.  These should
+	 * always agree with what you'd conclude by grep'ing through the exclude
+	 * list.
+	 */
+	bool		excludetbl;
+	bool		excludeidx;
+	bool		excludensp;
+
+	/*
+	 * If any inclusion pattern exists, then we should only be checking
+	 * matching relations rather than all relations, so this is true iff
+	 * include is empty.
+	 */
+	bool		allrel;
+
+	/* heap table checking options */
+	bool		no_toast_expansion;
+	bool		reconcile_toast;
+	bool		on_error_stop;
+	int64		startblock;
+	int64		endblock;
+	const char *skip;
+
+	/* btree index checking options */
+	bool		parent_check;
+	bool		rootdescend;
+	bool		heapallindexed;
+
+	/* heap and btree hybrid option */
+	bool		no_btree_expansion;
+} AmcheckOptions;
+
+static AmcheckOptions opts = {
+	.alldb = false,
+	.echo = false,
+	.quiet = false,
+	.verbose = false,
+	.strict_names = true,
+	.show_progress = false,
+	.jobs = 1,
+	.include = {NULL, 0},
+	.exclude = {NULL, 0},
+	.excludetbl = false,
+	.excludeidx = false,
+	.excludensp = false,
+	.allrel = true,
+	.no_toast_expansion = false,
+	.reconcile_toast = true,
+	.on_error_stop = false,
+	.startblock = -1,
+	.endblock = -1,
+	.skip = "none",
+	.parent_check = false,
+	.rootdescend = false,
+	.heapallindexed = false,
+	.no_btree_expansion = false
+};
+
+static const char *progname = NULL;
+
+/* Whether all relations have so far passed their corruption checks */
+static bool all_checks_pass = true;
+
+/* Time last progress report was displayed */
+static pg_time_t last_progress_report = 0;
+static bool progress_since_last_stderr = false;
+
+typedef struct DatabaseInfo
+{
+	char	   *datname;
+	char	   *amcheck_schema; /* escaped, quoted literal */
+} DatabaseInfo;
+
+typedef struct RelationInfo
+{
+	const DatabaseInfo *datinfo;	/* shared by other relinfos */
+	Oid			reloid;
+	bool		is_heap;		/* true if heap, false if btree */
+	char	   *nspname;
+	char	   *relname;
+	int			relpages;
+	int			blocks_to_check;
+	char	   *sql;			/* set during query run, pg_free'd after */
+} RelationInfo;
+
+/*
+ * Support for using RelationInfo objects to embedding qualified relation names
+ * in strings with the pattern \"%s\".\"%s\".\"%s\".
+ */
+#define QUALIFIED_NAME_FIELDS(rel) \
+	((rel)->datinfo->datname, (rel)->nspname, (rel)->relname
+
+/*
+ * Query for determining if contrib's amcheck is installed.  If so, selects the
+ * namespace name where amcheck's functions can be found.
+ */
+static const char *amcheck_sql =
+"SELECT n.nspname, x.extversion FROM pg_catalog.pg_extension x"
+"\nJOIN pg_catalog.pg_namespace n ON x.extnamespace = n.oid"
+"\nWHERE x.extname = 'amcheck'";
+
+static void prepare_heap_command(PQExpBuffer sql, RelationInfo *rel,
+								 PGconn *conn);
+static void prepare_btree_command(PQExpBuffer sql, RelationInfo *rel,
+								  PGconn *conn);
+static void run_command(ParallelSlot *slot, const char *sql,
+						ConnParams *cparams);
+static bool verify_heap_slot_handler(PGresult *res, PGconn *conn,
+									 void *context);
+static bool verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context);
+static void help(const char *progname);
+static void progress_report(uint64 relations_total, uint64 relations_checked,
+							uint64 relpages_total, uint64 relpages_checked,
+							const char *datname, bool force, bool finished);
+
+static void append_database_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_schema_pattern(PatternInfoArray *pia, const char *pattern,
+								  int encoding);
+static void append_relation_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_heap_pattern(PatternInfoArray *pia, const char *pattern,
+								int encoding);
+static void append_btree_pattern(PatternInfoArray *pia, const char *pattern,
+								 int encoding);
+static void compile_database_list(PGconn *conn, SimplePtrList *databases,
+								  const char *initial_dbname);
+static void compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+										 const DatabaseInfo *datinfo,
+										 uint64 *pagecount);
+
+#define log_no_match(...) do { \
+		if (opts.strict_names) \
+			pg_log_generic(PG_LOG_ERROR, __VA_ARGS__); \
+		else \
+			pg_log_generic(PG_LOG_WARNING, __VA_ARGS__); \
+	} while(0)
+
+int
+main(int argc, char *argv[])
+{
+	PGconn	   *conn;
+	SimplePtrListCell *cell;
+	SimplePtrList databases = {NULL, NULL};
+	SimplePtrList relations = {NULL, NULL};
+	bool		failed = false;
+	const char *latest_datname;
+	int			parallel_workers;
+	ParallelSlotArray *sa;
+	PQExpBufferData sql;
+	uint64		reltotal = 0;
+	uint64		pageschecked = 0;
+	uint64		pagestotal = 0;
+	uint64		relprogress = 0;
+	int			pattern_id;
+
+	static struct option long_options[] = {
+		/* Connection options */
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"maintenance-db", required_argument, NULL, 1},
+
+		/* check options */
+		{"all", no_argument, NULL, 'a'},
+		{"database", required_argument, NULL, 'd'},
+		{"exclude-database", required_argument, NULL, 'D'},
+		{"echo", no_argument, NULL, 'e'},
+		{"index", required_argument, NULL, 'i'},
+		{"exclude-index", required_argument, NULL, 'I'},
+		{"jobs", required_argument, NULL, 'j'},
+		{"progress", no_argument, NULL, 'P'},
+		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'r'},
+		{"exclude-relation", required_argument, NULL, 'R'},
+		{"schema", required_argument, NULL, 's'},
+		{"exclude-schema", required_argument, NULL, 'S'},
+		{"table", required_argument, NULL, 't'},
+		{"exclude-table", required_argument, NULL, 'T'},
+		{"verbose", no_argument, NULL, 'v'},
+		{"no-dependent-indexes", no_argument, NULL, 2},
+		{"no-dependent-toast", no_argument, NULL, 3},
+		{"exclude-toast-pointers", no_argument, NULL, 4},
+		{"on-error-stop", no_argument, NULL, 5},
+		{"skip", required_argument, NULL, 6},
+		{"startblock", required_argument, NULL, 7},
+		{"endblock", required_argument, NULL, 8},
+		{"rootdescend", no_argument, NULL, 9},
+		{"no-strict-names", no_argument, NULL, 10},
+		{"heapallindexed", no_argument, NULL, 11},
+		{"parent-check", no_argument, NULL, 12},
+
+		{NULL, 0, NULL, 0}
+	};
+
+	int			optindex;
+	int			c;
+
+	const char *db = NULL;
+	const char *maintenance_db = NULL;
+
+	const char *host = NULL;
+	const char *port = NULL;
+	const char *username = NULL;
+	enum trivalue prompt_password = TRI_DEFAULT;
+	int			encoding = pg_get_encoding_from_locale(NULL, false);
+	ConnParams	cparams;
+
+	pg_logging_init(argv[0]);
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("contrib"));
+
+	handle_help_version_opts(argc, argv, progname, help);
+
+	/* process command-line options */
+	while ((c = getopt_long(argc, argv, "ad:D:eh:Hi:I:j:p:Pqr:R:s:S:t:T:U:wWv",
+							long_options, &optindex)) != -1)
+	{
+		char	   *endptr;
+
+		switch (c)
+		{
+			case 'a':
+				opts.alldb = true;
+				break;
+			case 'd':
+				append_database_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'D':
+				append_database_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'e':
+				opts.echo = true;
+				break;
+			case 'h':
+				host = pg_strdup(optarg);
+				break;
+			case 'i':
+				opts.allrel = false;
+				append_btree_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'I':
+				opts.excludeidx = true;
+				append_btree_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'j':
+				opts.jobs = atoi(optarg);
+				if (opts.jobs < 1)
+				{
+					fprintf(stderr,
+							"number of parallel jobs must be at least 1\n");
+					exit(1);
+				}
+				break;
+			case 'p':
+				port = pg_strdup(optarg);
+				break;
+			case 'P':
+				opts.show_progress = true;
+				break;
+			case 'q':
+				opts.quiet = true;
+				break;
+			case 'r':
+				opts.allrel = false;
+				append_relation_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'R':
+				opts.excludeidx = true;
+				opts.excludetbl = true;
+				append_relation_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 's':
+				opts.allrel = false;
+				append_schema_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'S':
+				opts.excludensp = true;
+				append_schema_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 't':
+				opts.allrel = false;
+				append_heap_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'T':
+				opts.excludetbl = true;
+				append_heap_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'U':
+				username = pg_strdup(optarg);
+				break;
+			case 'w':
+				prompt_password = TRI_NO;
+				break;
+			case 'W':
+				prompt_password = TRI_YES;
+				break;
+			case 'v':
+				opts.verbose = true;
+				pg_logging_increase_verbosity();
+				break;
+			case 1:
+				maintenance_db = pg_strdup(optarg);
+				break;
+			case 2:
+				opts.no_btree_expansion = true;
+				break;
+			case 3:
+				opts.no_toast_expansion = true;
+				break;
+			case 4:
+				opts.reconcile_toast = false;
+				break;
+			case 5:
+				opts.on_error_stop = true;
+				break;
+			case 6:
+				if (pg_strcasecmp(optarg, "all-visible") == 0)
+					opts.skip = "all visible";
+				else if (pg_strcasecmp(optarg, "all-frozen") == 0)
+					opts.skip = "all frozen";
+				else
+				{
+					fprintf(stderr, "invalid skip options\n");
+					exit(1);
+				}
+				break;
+			case 7:
+				opts.startblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"relation start block argument contains garbage characters\n");
+					exit(1);
+				}
+				if (opts.startblock > MaxBlockNumber || opts.startblock < 0)
+				{
+					fprintf(stderr,
+							"relation start block argument out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 8:
+				opts.endblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"relation end block argument contains garbage characters\n");
+					exit(1);
+				}
+				if (opts.endblock > MaxBlockNumber || opts.endblock < 0)
+				{
+					fprintf(stderr,
+							"relation end block argument out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 9:
+				opts.rootdescend = true;
+				opts.parent_check = true;
+				break;
+			case 10:
+				opts.strict_names = false;
+				break;
+			case 11:
+				opts.heapallindexed = true;
+				break;
+			case 12:
+				opts.parent_check = true;
+				break;
+			default:
+				fprintf(stderr,
+						"Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+		}
+	}
+
+	if (opts.endblock >= 0 && opts.endblock < opts.startblock)
+	{
+		fprintf(stderr,
+				"relation end block argument precedes start block argument\n");
+		exit(1);
+	}
+
+	/*
+	 * A single non-option arguments specifies a database name or connection
+	 * string.
+	 */
+	if (optind < argc)
+	{
+		db = argv[optind];
+		optind++;
+	}
+
+	if (optind < argc)
+	{
+		pg_log_error("too many command-line arguments (first is \"%s\")",
+					 argv[optind]);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+		exit(1);
+	}
+
+	/* fill cparams except for dbname, which is set below */
+	cparams.pghost = host;
+	cparams.pgport = port;
+	cparams.pguser = username;
+	cparams.prompt_password = prompt_password;
+	cparams.override_dbname = NULL;
+
+	setup_cancel_handler(NULL);
+
+	/* choose the database for our initial connection */
+	if (opts.alldb)
+	{
+		/*
+		 * Prefer a maintenance_db argument over a database argument when
+		 * --all is specified, but don't ignore the database argument when no
+		 * maintenance_db was given.  This allows users to give a connection
+		 * string with --all, like `pg_amcheck --all "port=7777
+		 * sslmode=require".
+		 */
+		if (db != NULL && maintenance_db == NULL)
+			cparams.dbname = db;
+		else
+			cparams.dbname = maintenance_db;
+	}
+	else if (db != NULL)
+		cparams.dbname = db;
+	else
+	{
+		const char *default_db;
+
+		if (getenv("PGDATABASE"))
+			default_db = getenv("PGDATABASE");
+		else if (getenv("PGUSER"))
+			default_db = getenv("PGUSER");
+		else
+			default_db = get_user_name_or_exit(progname);
+
+		cparams.dbname = default_db;
+	}
+
+	if (opts.alldb)
+	{
+		conn = connectMaintenanceDatabase(&cparams, progname, opts.echo);
+		compile_database_list(conn, &databases, NULL);
+	}
+	else
+	{
+		conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+		compile_database_list(conn, &databases, PQdb(conn));
+	}
+
+	disconnectDatabase(conn);
+
+	if (databases.head == NULL)
+	{
+		pg_log_error("no databases to check");
+		exit(0);
+	}
+
+	/*
+	 * Compile a list of all relations spanning all databases to be checked.
+	 */
+	for (cell = databases.head; cell; cell = cell->next)
+	{
+		PGresult   *result;
+		int			ntups;
+		const char *amcheck_schema = NULL;
+		DatabaseInfo *dat = (DatabaseInfo *) cell->ptr;
+
+		cparams.override_dbname = dat->datname;
+		conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+
+		/*
+		 * Verify that amcheck is installed for this next database.  User
+		 * error could result in a database not having amcheck that should
+		 * have it, but we also could be iterating over multiple databases
+		 * where not all of them have amcheck installed (for example,
+		 * 'template1').
+		 */
+		result = executeQuery(conn, amcheck_sql, opts.echo);
+		if (PQresultStatus(result) != PGRES_TUPLES_OK)
+		{
+			/* Querying the catalog failed. */
+			pg_log_error("database \"%s\": %s",
+						 PQdb(conn), PQerrorMessage(conn));
+			pg_log_info("query was: %s", amcheck_sql);
+			PQclear(result);
+			disconnectDatabase(conn);
+			exit(1);
+		}
+		ntups = PQntuples(result);
+		if (ntups == 0)
+		{
+			/* Querying the catalog succeeded, but amcheck is missing. */
+			pg_log_warning("skipping database \"%s\": amcheck is not installed",
+						   PQdb(conn));
+			disconnectDatabase(conn);
+			continue;
+		}
+		amcheck_schema = PQgetvalue(result, 0, 0);
+		if (opts.verbose)
+			pg_log_info("in database \"%s\": using amcheck version \"%s\" in schema \"%s\"",
+						PQdb(conn), PQgetvalue(result, 0, 1), amcheck_schema);
+		dat->amcheck_schema = PQescapeIdentifier(conn, amcheck_schema,
+												 strlen(amcheck_schema));
+		PQclear(result);
+
+		compile_relation_list_one_db(conn, &relations, dat, &pagestotal);
+		disconnectDatabase(conn);
+	}
+
+	/*
+	 * Check that all inclusion patterns matched at least one schema or
+	 * relation that we can check.
+	 */
+	for (pattern_id = 0; pattern_id < opts.include.len; pattern_id++)
+	{
+		PatternInfo *pat = &opts.include.data[pattern_id];
+
+		if (!pat->matched && (pat->nsp_regex != NULL || pat->rel_regex != NULL))
+		{
+			failed = opts.strict_names;
+
+			if (!opts.quiet || failed)
+			{
+				if (pat->heap_only)
+					log_no_match("no heap tables to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->btree_only)
+					log_no_match("no btree indexes to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->rel_regex == NULL)
+					log_no_match("no relations to check in schemas matching \"%s\"",
+								 pat->pattern);
+				else
+					log_no_match("no relations to check matching \"%s\"",
+								 pat->pattern);
+			}
+		}
+	}
+
+	if (failed)
+		exit(1);
+
+	/*
+	 * Set parallel_workers to the lesser of opts.jobs and the number of
+	 * relations.
+	 */
+	parallel_workers = 0;
+	for (cell = relations.head; cell; cell = cell->next)
+	{
+		reltotal++;
+		if (parallel_workers < opts.jobs)
+			parallel_workers++;
+	}
+
+	if (reltotal == 0)
+	{
+		pg_log_error("no relations to check");
+		exit(1);
+	}
+	progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, false);
+
+	/*
+	 * Main event loop.
+	 *
+	 * We use server-side parallelism to check up to parallel_workers
+	 * relations in parallel.  The list of relations was computed in database
+	 * order, which minimizes the number of connects and disconnects as we
+	 * process the list.
+	 */
+	latest_datname = NULL;
+	sa = ParallelSlotsSetup(parallel_workers, &cparams, progname, opts.echo,
+							NULL);
+
+	initPQExpBuffer(&sql);
+	for (relprogress = 0, cell = relations.head; cell; cell = cell->next)
+	{
+		ParallelSlot *free_slot;
+		RelationInfo *rel;
+
+		rel = (RelationInfo *) cell->ptr;
+
+		if (CancelRequested)
+		{
+			failed = true;
+			break;
+		}
+
+		/*
+		 * The list of relations is in database sorted order.  If this next
+		 * relation is in a different database than the last one seen, we are
+		 * about to start checking this database.  Note that other slots may
+		 * still be working on relations from prior databases.
+		 */
+		latest_datname = rel->datinfo->datname;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, latest_datname, false, false);
+
+		relprogress++;
+		pageschecked += rel->blocks_to_check;
+
+		/*
+		 * Get a parallel slot for the next amcheck command, blocking if
+		 * necessary until one is available, or until a previously issued slot
+		 * command fails, indicating that we should abort checking the
+		 * remaining objects.
+		 */
+		free_slot = ParallelSlotsGetIdle(sa, rel->datinfo->datname);
+		if (!free_slot)
+		{
+			/*
+			 * Something failed.  We don't need to know what it was, because
+			 * the handler should already have emitted the necessary error
+			 * messages.
+			 */
+			failed = true;
+			break;
+		}
+
+		if (opts.verbose)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_VERBOSE);
+		else if (opts.quiet)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_TERSE);
+
+		/*
+		 * Execute the appropriate amcheck command for this relation using our
+		 * slot's database connection.  We do not wait for the command to
+		 * complete, nor do we perform any error checking, as that is done by
+		 * the parallel slots and our handler callback functions.
+		 */
+		if (rel->is_heap)
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+				pg_log_info("checking heap table \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_heap_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_heap_slot_handler, rel);
+			run_command(free_slot, rel->sql, &cparams);
+		}
+		else
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+
+				pg_log_info("checking btree index \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_btree_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_btree_slot_handler, rel);
+			run_command(free_slot, rel->sql, &cparams);
+		}
+	}
+	termPQExpBuffer(&sql);
+
+	if (!failed)
+	{
+
+		/*
+		 * Wait for all slots to complete, or for one to indicate that an
+		 * error occurred.  Like above, we rely on the handler emitting the
+		 * necessary error messages.
+		 */
+		if (sa && !ParallelSlotsWaitCompletion(sa))
+			failed = true;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, true);
+	}
+
+	if (sa)
+	{
+		ParallelSlotsTerminate(sa);
+		pg_free(sa);
+	}
+
+	if (failed)
+		exit(1);
+
+	if (!all_checks_pass)
+		exit(2);
+}
+
+/*
+ * prepare_heap_command
+ *
+ * Creates a SQL command for running amcheck checking on the given heap
+ * relation.  The command is phrased as a SQL query, with column order and
+ * names matching the expectations of verify_heap_slot_handler, which will
+ * receive and handle each row returned from the verify_heapam() function.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the heap table to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_heap_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+	appendPQExpBuffer(sql,
+					  "SELECT blkno, offnum, attnum, msg FROM %s.verify_heapam("
+					  "\nrelation := %u, on_error_stop := %s, check_toast := %s, skip := '%s'",
+					  rel->datinfo->amcheck_schema,
+					  rel->reloid,
+					  opts.on_error_stop ? "true" : "false",
+					  opts.reconcile_toast ? "true" : "false",
+					  opts.skip);
+
+	if (opts.startblock >= 0)
+		appendPQExpBuffer(sql, ", startblock := " INT64_FORMAT, opts.startblock);
+	if (opts.endblock >= 0)
+		appendPQExpBuffer(sql, ", endblock := " INT64_FORMAT, opts.endblock);
+
+	appendPQExpBuffer(sql, ")");
+}
+
+/*
+ * prepare_btree_command
+ *
+ * Creates a SQL command for running amcheck checking on the given btree index
+ * relation.  The command does not select any columns, as btree checking
+ * functions do not return any, but rather return corruption information by
+ * raising errors, which verify_btree_slot_handler expects.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the index to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_btree_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+
+	/*
+	 * Embed the database, schema, and relation name in the query, so if the
+	 * check throws an error, the user knows which relation the error came
+	 * from.
+	 */
+	if (opts.parent_check)
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_parent_check("
+						  "index := '%u'::regclass, heapallindexed := %s, "
+						  "rootdescend := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"),
+						  (opts.rootdescend ? "true" : "false"));
+	else
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_check("
+						  "index := '%u'::regclass, heapallindexed := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"));
+}
+
+/*
+ * run_command
+ *
+ * Sends a command to the server without waiting for the command to complete.
+ * Logs an error if the command cannot be sent, but otherwise any errors are
+ * expected to be handled by a ParallelSlotHandler.
+ *
+ * If reconnecting to the database is necessary, the cparams argument may be
+ * modified.
+ *
+ * slot: slot with connection to the server we should use for the command
+ * sql: query to send
+ * cparams: connection parameters in case the slot needs to be reconnected
+ */
+static void
+run_command(ParallelSlot *slot, const char *sql, ConnParams *cparams)
+{
+	if (opts.echo)
+		printf("%s\n", sql);
+
+	if (PQsendQuery(slot->connection, sql) == 0)
+	{
+		pg_log_error("error sending command to database \"%s\": %s",
+					 PQdb(slot->connection),
+					 PQerrorMessage(slot->connection));
+		pg_log_error("command was: %s", sql);
+		exit(1);
+	}
+}
+
+/*
+ * should_processing_continue
+ *
+ * Checks a query result returned from a query (presumably issued on a slot's
+ * connection) to determine if parallel slots should continue issuing further
+ * commands.
+ *
+ * Note: Heap relation corruption is reported by verify_heapam() via the result
+ * set, rather than an ERROR, but running verify_heapam() on a corrupted heap
+ * table may still result in an error being returned from the server due to
+ * missing relation files, bad checksums, etc.  The btree corruption checking
+ * functions always use errors to communicate corruption messages.  We can't
+ * just abort processing because we got a mere ERROR.
+ *
+ * res: result from an executed sql query
+ */
+static bool
+should_processing_continue(PGresult *res)
+{
+	const char *severity;
+
+	switch (PQresultStatus(res))
+	{
+			/* These are expected and ok */
+		case PGRES_COMMAND_OK:
+		case PGRES_TUPLES_OK:
+		case PGRES_NONFATAL_ERROR:
+			break;
+
+			/* This is expected but requires closer scrutiny */
+		case PGRES_FATAL_ERROR:
+			severity = PQresultErrorField(res, PG_DIAG_SEVERITY_NONLOCALIZED);
+			if (strcmp(severity, "FATAL") == 0)
+				return false;
+			if (strcmp(severity, "PANIC") == 0)
+				return false;
+			break;
+
+			/* These are unexpected */
+		case PGRES_BAD_RESPONSE:
+		case PGRES_EMPTY_QUERY:
+		case PGRES_COPY_OUT:
+		case PGRES_COPY_IN:
+		case PGRES_COPY_BOTH:
+		case PGRES_SINGLE_TUPLE:
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Returns a copy of the argument string with all lines indented four spaces.
+ *
+ * The caller should pg_free the result when finished with it.
+ */
+static char *
+indent_lines(const char *str)
+{
+	PQExpBufferData buf;
+	const char *c;
+	char	   *result;
+
+	initPQExpBuffer(&buf);
+	appendPQExpBufferStr(&buf, "    ");
+	for (c = str; *c; c++)
+	{
+		appendPQExpBufferChar(&buf, *c);
+		if (c[0] == '\n' && c[1] != '\0')
+			appendPQExpBufferStr(&buf, "    ");
+	}
+	result = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+
+	return result;
+}
+
+/*
+ * verify_heap_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a heap table checking command
+ * created by prepare_heap_command and outputs the results for the user.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: the sql query being handled, as a cstring
+ */
+static bool
+verify_heap_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			i;
+		int			ntups = PQntuples(res);
+
+		if (ntups > 0)
+			all_checks_pass = false;
+
+		for (i = 0; i < ntups; i++)
+		{
+			const char *msg;
+
+			/* The message string should never be null, but check */
+			if (PQgetisnull(res, i, 3))
+				msg = "NO MESSAGE";
+			else
+				msg = PQgetvalue(res, i, 3);
+
+			if (!PQgetisnull(res, i, 2))
+				printf("relation \"%s\".\"%s\".\"%s\", block %s, offset %s, attribute %s\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   PQgetvalue(res, i, 2),	/* attnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 1))
+				printf("relation \"%s\".\"%s\".\"%s\", block %s, offset %s\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 0))
+				printf("relation \"%s\".\"%s\".\"%s\", block %s\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   msg);
+
+			else
+				printf("relation \"%s\".\"%s\".\"%s\"\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		}
+	}
+	else if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("heap relation \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		pg_free(msg);
+	}
+
+	pg_free(rel->sql);
+	pg_free(rel->nspname);
+	pg_free(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * verify_btree_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a btree checking command
+ * created by prepare_btree_command and outputs them for the user.  The results
+ * from the btree checking command is assumed to be empty, but when the results
+ * are an error code, the useful information about the corruption is expected
+ * in the connection's error message.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: unused
+ */
+static bool
+verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			ntups = PQntuples(res);
+
+		if (ntups != 1)
+		{
+			/*
+			 * We expect the btree checking functions to return one void row
+			 * each, so we should output some sort of warning if we get
+			 * anything else, not because it indicates corruption, but because
+			 * it suggests a mismatch between amcheck and pg_amcheck versions.
+			 *
+			 * In conjunction with --progress, anything written to stderr at
+			 * this time would present strangely to the user without an extra
+			 * newline, so we print one.  If we were multithreaded, we'd have
+			 * to avoid splitting this across multiple calls, but we're in an
+			 * event loop, so it doesn't matter.
+			 */
+			if (opts.show_progress && progress_since_last_stderr)
+				fprintf(stderr, "\n");
+			pg_log_warning("btree relation \"%s\".\"%s\".\"%s\": btree checking function returned unexpected number of rows: %d",
+						   rel->datinfo->datname, rel->nspname, rel->relname, ntups);
+			if (opts.verbose)
+				pg_log_info("query was: %s", rel->sql);
+			pg_log_warning("are %s's and amcheck's versions compatible?",
+						   progname);
+			progress_since_last_stderr = false;
+		}
+	}
+	else
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("btree relation \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		pg_free(msg);
+	}
+
+	pg_free(rel->sql);
+	pg_free(rel->nspname);
+	pg_free(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_amcheck"
+ */
+static void
+help(const char *progname)
+{
+	printf("%s uses amcheck module to check objects in a PostgreSQL database for corruption.\n\n", progname);
+	printf("Usage:\n");
+	printf("  %s [OPTION]... [DBNAME]\n", progname);
+	printf("\nTarget Options:\n");
+	printf("  -a, --all                      check all databases\n");
+	printf("  -d, --database=PATTERN         check matching database(s)\n");
+	printf("  -D, --exclude-database=PATTERN do NOT check matching database(s)\n");
+	printf("  -i, --index=PATTERN            check matching index(es)\n");
+	printf("  -I, --exclude-index=PATTERN    do NOT check matching index(es)\n");
+	printf("  -r, --relation=PATTERN         check matching relation(s)\n");
+	printf("  -R, --exclude-relation=PATTERN do NOT check matching relation(s)\n");
+	printf("  -s, --schema=PATTERN           check matching schema(s)\n");
+	printf("  -S, --exclude-schema=PATTERN   do NOT check matching schema(s)\n");
+	printf("  -t, --table=PATTERN            check matching table(s)\n");
+	printf("  -T, --exclude-table=PATTERN    do NOT check matching table(s)\n");
+	printf("      --no-dependent-indexes     do NOT expand list of relations to include indexes\n");
+	printf("      --no-dependent-toast       do NOT expand list of relations to include toast\n");
+	printf("      --no-strict-names          do NOT require patterns to match objects\n");
+	printf("\nTable Checking Options:\n");
+	printf("      --exclude-toast-pointers   do NOT follow relation toast pointers\n");
+	printf("      --on-error-stop            stop checking at end of first corrupt page\n");
+	printf("      --skip=OPTION              do NOT check \"all-frozen\" or \"all-visible\" blocks\n");
+	printf("      --startblock=BLOCK         begin checking table(s) at the given block number\n");
+	printf("      --endblock=BLOCK           check table(s) only up to the given block number\n");
+	printf("\nBtree Index Checking Options:\n");
+	printf("      --heapallindexed           check all heap tuples are found within indexes\n");
+	printf("      --parent-check             check index parent/child relationships\n");
+	printf("      --rootdescend              search from root page to refind tuples\n");
+	printf("\nConnection options:\n");
+	printf("  -h, --host=HOSTNAME            database server host or socket directory\n");
+	printf("  -p, --port=PORT                database server port\n");
+	printf("  -U, --username=USERNAME        user name to connect as\n");
+	printf("  -w, --no-password              never prompt for password\n");
+	printf("  -W, --password                 force password prompt\n");
+	printf("      --maintenance-db=DBNAME    alternate maintenance database\n");
+	printf("\nOther Options:\n");
+	printf("  -e, --echo                     show the commands being sent to the server\n");
+	printf("  -j, --jobs=NUM                 use this many concurrent connections to the server\n");
+	printf("  -q, --quiet                    don't write any messages\n");
+	printf("  -v, --verbose                  write a lot of output\n");
+	printf("  -V, --version                  output version information, then exit\n");
+	printf("  -P, --progress                 show progress information\n");
+	printf("  -?, --help                     show this help, then exit\n");
+
+	printf("\nReport bugs to <%s>.\n", PACKAGE_BUGREPORT);
+	printf("%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Print a progress report based on the global variables.
+ *
+ * Progress report is written at maximum once per second, unless the force
+ * parameter is set to true.
+ *
+ * If finished is set to true, this is the last progress report. The cursor
+ * is moved to the next line.
+ */
+static void
+progress_report(uint64 relations_total, uint64 relations_checked,
+				uint64 relpages_total, uint64 relpages_checked,
+				const char *datname, bool force, bool finished)
+{
+	int			percent_rel = 0;
+	int			percent_pages = 0;
+	char		checked_rel[32];
+	char		total_rel[32];
+	char		checked_pages[32];
+	char		total_pages[32];
+	pg_time_t	now;
+
+	if (!opts.show_progress)
+		return;
+
+	now = time(NULL);
+	if (now == last_progress_report && !force && !finished)
+		return;					/* Max once per second */
+
+	last_progress_report = now;
+	if (relations_total)
+		percent_rel = (int) (relations_checked * 100 / relations_total);
+	if (relpages_total)
+		percent_pages = (int) (relpages_checked * 100 / relpages_total);
+
+	/*
+	 * Separate step to keep platform-dependent format code out of fprintf
+	 * calls.  We only test for INT64_FORMAT availability in snprintf, not
+	 * fprintf.
+	 */
+	snprintf(checked_rel, sizeof(checked_rel), INT64_FORMAT, relations_checked);
+	snprintf(total_rel, sizeof(total_rel), INT64_FORMAT, relations_total);
+	snprintf(checked_pages, sizeof(checked_pages), INT64_FORMAT, relpages_checked);
+	snprintf(total_pages, sizeof(total_pages), INT64_FORMAT, relpages_total);
+
+#define VERBOSE_DATNAME_LENGTH 35
+	if (opts.verbose)
+	{
+		if (!datname)
+
+			/*
+			 * No datname given, so clear the status line (used for first and
+			 * last call)
+			 */
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%) %*s",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+					VERBOSE_DATNAME_LENGTH + 2, "");
+		else
+		{
+			bool		truncate = (strlen(datname) > VERBOSE_DATNAME_LENGTH);
+
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%), (%s%-*.*s)",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+			/* Prefix with "..." if we do leading truncation */
+					truncate ? "..." : "",
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+			/* Truncate datname at beginning if it's too long */
+					truncate ? datname + strlen(datname) - VERBOSE_DATNAME_LENGTH + 3 : datname);
+		}
+	}
+	else
+		fprintf(stderr,
+				"%*s/%s relations (%d%%) %*s/%s pages (%d%%)",
+				(int) strlen(total_rel),
+				checked_rel, total_rel, percent_rel,
+				(int) strlen(total_pages),
+				checked_pages, total_pages, percent_pages);
+
+	/*
+	 * Stay on the same line if reporting to a terminal and we're not done
+	 * yet.
+	 */
+	if (!finished && isatty(fileno(stderr)))
+	{
+		fputc('\r', stderr);
+		progress_since_last_stderr = true;
+	}
+	else
+		fputc('\n', stderr);
+}
+
+/*
+ * Extend the pattern info array to hold one additional initialized pattern
+ * info entry.
+ *
+ * Returns a pointer to the new entry.
+ */
+static PatternInfo *
+extend_pattern_info_array(PatternInfoArray *pia)
+{
+	PatternInfo *result;
+
+	pia->len++;
+	pia->data = (PatternInfo *) pg_realloc(pia->data, pia->len * sizeof(PatternInfo));
+	result = &pia->data[pia->len - 1];
+	memset(result, 0, sizeof(*result));
+
+	return result;
+}
+
+/*
+ * append_database_pattern
+ *
+ * Adds the given pattern interpreted as a database name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the database name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_database_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->db_regex = pstrdup(buf.data);
+
+	termPQExpBuffer(&buf);
+}
+
+/*
+ * append_schema_pattern
+ *
+ * Adds the given pattern interpreted as a schema name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the schema name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_schema_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->nsp_regex = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+}
+
+/*
+ * append_relation_pattern_helper
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ * heap_only: whether the pattern should only be matched against heap tables
+ * btree_only: whether the pattern should only be matched against btree indexes
+ */
+static void
+append_relation_pattern_helper(PatternInfoArray *pia, const char *pattern,
+							   int encoding, bool heap_only, bool btree_only)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PQExpBufferData relbuf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+	initPQExpBuffer(&relbuf);
+
+	patternToSQLRegex(encoding, &dbbuf, &nspbuf, &relbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+		info->db_regex = pstrdup(dbbuf.data);
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+	if (relbuf.data[0])
+		info->rel_regex = pstrdup(relbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+	termPQExpBuffer(&relbuf);
+
+	info->heap_only = heap_only;
+	info->btree_only = btree_only;
+}
+
+/*
+ * append_relation_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched
+ * against both heap tables and btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_relation_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, false);
+}
+
+/*
+ * append_heap_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against heap tables.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_heap_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, true, false);
+}
+
+/*
+ * append_btree_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_btree_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, true);
+}
+
+/*
+ * append_db_pattern_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the database portions filtered from the list of patterns expressed as two
+ * columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     rgx: the database regular expression parsed from the pattern
+ *
+ * Patterns without a database portion are skipped.  Patterns with more than
+ * just a database portion are optionally skipped, depending on argument
+ * 'inclusive'.
+ *
+ * buf: the buffer to be appended
+ * pia: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ * inclusive: whether to include patterns with schema and/or relation parts
+ *
+ * Returns whether any database patterns were appended.
+ */
+static bool
+append_db_pattern_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+					  PGconn *conn, bool inclusive)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (info->db_regex != NULL &&
+			(inclusive || (info->nsp_regex == NULL && info->rel_regex == NULL)))
+		{
+			if (!have_values)
+				appendPQExpBufferStr(buf, "\nVALUES");
+			have_values = true;
+			appendPQExpBuffer(buf, "%s\n(%d, ", comma, pattern_id);
+			appendStringLiteralConn(buf, info->db_regex, conn);
+			appendPQExpBufferStr(buf, ")");
+			comma = ",";
+		}
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf, "\nSELECT NULL, NULL, NULL WHERE false");
+
+	return have_values;
+}
+
+/*
+ * compile_database_list
+ *
+ * If any database patterns exist, or if --all was given, compiles a distinct
+ * list of databases to check using a SQL query based on the patterns plus the
+ * literal initial database name, if given.  If no database patterns exist and
+ * --all was not given, the query is not necessary, and only the initial
+ * database name (if any) is added to the list.
+ *
+ * conn: connection to the initial database
+ * databases: the list onto which databases should be appended
+ * initial_dbname: an optional extra database name to include in the list
+ */
+static void
+compile_database_list(PGconn *conn, SimplePtrList *databases,
+					  const char *initial_dbname)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	bool		fatal;
+
+	if (initial_dbname)
+	{
+		DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+		/* This database is included.  Add to list */
+		if (opts.verbose)
+			pg_log_info("including database: \"%s\"", initial_dbname);
+
+		dat->datname = pstrdup(initial_dbname);
+		simple_ptr_list_append(databases, dat);
+	}
+
+	initPQExpBuffer(&sql);
+
+	/* Append the include patterns CTE. */
+	appendPQExpBufferStr(&sql, "WITH include_raw (pattern_id, rgx) AS (");
+	if (!append_db_pattern_cte(&sql, &opts.include, conn, true) &&
+		!opts.alldb)
+	{
+		/*
+		 * None of the inclusion patterns (if any) contain database portions,
+		 * so there is no need to query the database to resolve database
+		 * patterns.
+		 *
+		 * Since we're also not operating under --all, we don't need to query
+		 * the exhaustive list of connectable databases, either.
+		 */
+		termPQExpBuffer(&sql);
+		return;
+	}
+
+	/* Append the exclude patterns CTE. */
+	appendPQExpBufferStr(&sql, "),\nexclude_raw (pattern_id, rgx) AS (");
+	append_db_pattern_cte(&sql, &opts.exclude, conn, false);
+	appendPQExpBufferStr(&sql, "),");
+
+	/*
+	 * Append the database CTE, which includes whether each database is
+	 * connectable and also joins against exclude_raw to determine whether
+	 * each database is excluded.
+	 */
+	appendPQExpBufferStr(&sql,
+						 "\ndatabase (datname) AS ("
+						 "\nSELECT d.datname "
+						 "FROM pg_catalog.pg_database d "
+						 "LEFT OUTER JOIN exclude_raw e "
+						 "ON d.datname ~ e.rgx "
+						 "\nWHERE d.datallowconn "
+						 "AND e.pattern_id IS NULL"
+						 "),"
+
+	/*
+	 * Append the include_pat CTE, which joins the include_raw CTE against the
+	 * databases CTE to determine if all the inclusion patterns had matches,
+	 * and whether each matched pattern had the misfortune of only matching
+	 * excluded or unconnectable databases.
+	 */
+						 "\ninclude_pat (pattern_id, checkable) AS ("
+						 "\nSELECT i.pattern_id, "
+						 "COUNT(*) FILTER ("
+						 "WHERE d IS NOT NULL"
+						 ") AS checkable"
+						 "\nFROM include_raw i "
+						 "LEFT OUTER JOIN database d "
+						 "ON d.datname ~ i.rgx"
+						 "\nGROUP BY i.pattern_id"
+						 "),"
+
+	/*
+	 * Append the filtered_databases CTE, which selects from the database CTE
+	 * optionally joined against the include_raw CTE to only select databases
+	 * that match an inclusion pattern.  This appears to duplicate what the
+	 * include_pat CTE already did above, but here we want only databases, and
+	 * there we wanted patterns.
+	 */
+						 "\nfiltered_databases (datname) AS ("
+						 "\nSELECT DISTINCT d.datname "
+						 "FROM database d");
+	if (!opts.alldb)
+		appendPQExpBufferStr(&sql,
+							 " INNER JOIN include_raw i "
+							 "ON d.datname ~ i.rgx");
+	appendPQExpBufferStr(&sql,
+						 ")"
+
+	/*
+	 * Select the checkable databases and the unmatched inclusion patterns.
+	 */
+						 "\nSELECT pattern_id, datname FROM ("
+						 "\nSELECT pattern_id, NULL::TEXT AS datname "
+						 "FROM include_pat "
+						 "WHERE checkable = 0 "
+						 "UNION ALL"
+						 "\nSELECT NULL, datname "
+						 "FROM filtered_databases"
+						 ") AS combined_records"
+						 "\nORDER BY pattern_id NULLS LAST, datname");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (fatal = false, i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		const char *datname = NULL;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			datname = PQgetvalue(res, i, 1);
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern that matched no
+			 * checkable databases.
+			 */
+			fatal = opts.strict_names;
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected database pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+			log_no_match("no connectable databases to check matching \"%s\"",
+						 opts.include.data[pattern_id].pattern);
+		}
+		else
+		{
+			/* Current record pertains to a database */
+			Assert(datname != NULL);
+
+			/* Avoid entering a duplicate entry matching the initial_dbname */
+			if (initial_dbname != NULL && strcmp(initial_dbname, datname) == 0)
+				continue;
+
+			DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+			/* This database is included.  Add to list */
+			if (opts.verbose)
+				pg_log_info("including database: \"%s\"", datname);
+
+			dat->datname = pstrdup(datname);
+			simple_ptr_list_append(databases, dat);
+		}
+	}
+	PQclear(res);
+
+	if (fatal)
+	{
+		disconnectDatabase(conn);
+		exit(1);
+	}
+}
+
+/*
+ * append_rel_pattern_raw_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the given patterns as six columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     db_regex: the database regexp parsed from the pattern, or NULL if the
+ *               pattern had no database part
+ *     nsp_regex: the namespace regexp parsed from the pattern, or NULL if the
+ *                pattern had no namespace part
+ *     rel_regex: the relname regexp parsed from the pattern, or NULL if the
+ *                pattern had no relname part
+ *     heap_only: true if the pattern applies only to heap tables (not indexes)
+ *     btree_only: true if the pattern applies only to btree indexes (not tables)
+ *
+ * buf: the buffer to be appended
+ * patterns: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_raw_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+						   PGconn *conn)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (!have_values)
+			appendPQExpBufferStr(buf, "\nVALUES");
+		have_values = true;
+		appendPQExpBuffer(buf, "%s\n(%d::INTEGER, ", comma, pattern_id);
+		if (info->db_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->db_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->nsp_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->nsp_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->rel_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->rel_regex, conn);
+		if (info->heap_only)
+			appendPQExpBufferStr(buf, "::TEXT, true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, "::TEXT, false::BOOLEAN");
+		if (info->btree_only)
+			appendPQExpBufferStr(buf, ", true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, ", false::BOOLEAN");
+		appendPQExpBufferStr(buf, ")");
+		comma = ",";
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf,
+							 "\nSELECT NULL::INTEGER, NULL::TEXT, NULL::TEXT, "
+							 "NULL::TEXT, NULL::BOOLEAN, NULL::BOOLEAN "
+							 "WHERE false");
+}
+
+/*
+ * append_rel_pattern_filtered_cte
+ *
+ * Appends to the buffer a Common Table Expression (CTE) which selects
+ * all patterns from the named raw CTE, filtered by database.  All patterns
+ * which have no database portion or whose database portion matches our
+ * connection's database name are selected, with other patterns excluded.
+ *
+ * The basic idea here is that if we're connected to database "foo" and we have
+ * patterns "foo.bar.baz", "alpha.beta" and "one.two.three", we only want to
+ * use the first two while processing relations in this database, as the third
+ * one is not relevant.
+ *
+ * buf: the buffer to be appended
+ * raw: the name of the CTE to select from
+ * filtered: the name of the CTE to create
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_filtered_cte(PQExpBuffer buf, const char *raw,
+								const char *filtered, PGconn *conn)
+{
+	appendPQExpBuffer(buf,
+					  "\n%s (pattern_id, nsp_regex, rel_regex, heap_only, btree_only) AS ("
+					  "\nSELECT pattern_id, nsp_regex, rel_regex, heap_only, btree_only "
+					  "FROM %s r"
+					  "\nWHERE (r.db_regex IS NULL "
+					  "OR ",
+					  filtered, raw);
+	appendStringLiteralConn(buf, PQdb(conn), conn);
+	appendPQExpBufferStr(buf, " ~ r.db_regex)");
+	appendPQExpBufferStr(buf,
+						 " AND (r.nsp_regex IS NOT NULL"
+						 " OR r.rel_regex IS NOT NULL)"
+						 "),");
+}
+
+/*
+ * compile_relation_list_one_db
+ *
+ * Compiles a list of relations to check within the currently connected
+ * database based on the user supplied options, sorted by descending size,
+ * and appends them to the given list of relations.
+ *
+ * The cells of the constructed list contain all information about the relation
+ * necessary to connect to the database and check the object, including which
+ * database to connect to, where contrib/amcheck is installed, and the Oid and
+ * type of object (heap table vs. btree index).  Rather than duplicating the
+ * database details per relation, the relation structs use references to the
+ * same database object, provided by the caller.
+ *
+ * conn: connection to this next database, which should be the same as in 'dat'
+ * relations: list onto which the relations information should be appended
+ * dat: the database info struct for use by each relation
+ * pagecount: gets incremented by the number of blocks to check in all
+ * relations added
+ */
+static void
+compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+							 const DatabaseInfo *dat,
+							 uint64 *pagecount)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+
+	initPQExpBuffer(&sql);
+	appendPQExpBufferStr(&sql, "WITH");
+
+	/* Append CTEs for the relation inclusion patterns, if any */
+	if (!opts.allrel)
+	{
+		appendPQExpBufferStr(&sql,
+							 " include_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.include, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "include_raw", "include_pat", conn);
+	}
+
+	/* Append CTEs for the relation exclusion patterns, if any */
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+	{
+		appendPQExpBufferStr(&sql,
+							 " exclude_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.exclude, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "exclude_raw", "exclude_pat", conn);
+	}
+
+	/* Append the relation CTE. */
+	appendPQExpBufferStr(&sql,
+						 " relation (pattern_id, oid, nspname, relname, reltoastrelid, relpages, is_heap, is_btree) AS ("
+						 "\nSELECT DISTINCT ON (c.oid");
+	if (!opts.allrel)
+		appendPQExpBufferStr(&sql, ", ip.pattern_id) ip.pattern_id,");
+	else
+		appendPQExpBufferStr(&sql, ") NULL::INTEGER AS pattern_id,");
+	appendPQExpBuffer(&sql,
+					  "\nc.oid, n.nspname, c.relname, c.reltoastrelid, c.relpages, "
+					  "c.relam = %u AS is_heap, "
+					  "c.relam = %u AS is_btree"
+					  "\nFROM pg_catalog.pg_class c "
+					  "INNER JOIN pg_catalog.pg_namespace n "
+					  "ON c.relnamespace = n.oid",
+					  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (!opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nINNER JOIN include_pat ip"
+						  "\nON (n.nspname ~ ip.nsp_regex OR ip.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ip.rel_regex OR ip.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ip.heap_only)"
+						  "\nAND (c.relam = %u OR NOT ip.btree_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBuffer(&sql,
+						  "\nLEFT OUTER JOIN exclude_pat ep"
+						  "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.heap_only)"
+						  "\nAND (c.relam = %u OR NOT ep.btree_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBufferStr(&sql, "\nWHERE ep.pattern_id IS NULL");
+	else
+		appendPQExpBufferStr(&sql, "\nWHERE true");
+
+	/*
+	 * We need to be careful not to break the --no-dependent-toast and
+	 * --no-dependent-indexes options.  By default, the btree indexes, toast
+	 * tables, and toast table btree indexes associated with primary heap
+	 * tables are included, using their own CTEs below.  We implement the
+	 * --exclude-* options by not creating those CTEs, but that's no use if
+	 * we've already selected the toast and indexes here.  On the other hand,
+	 * we want inclusion patterns that match indexes or toast tables to be
+	 * honored.  So, if inclusion patterns were given, we want to select all
+	 * tables, toast tables, or indexes that match the patterns.  But if no
+	 * inclusion patterns were given, and we're simply matching all relations,
+	 * then we only want to match the primary tables here.
+	 */
+	if (opts.allrel)
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind IN ('r', 'm', 't') "
+						  "AND c.relnamespace != %u",
+						  HEAP_TABLE_AM_OID, PG_TOAST_NAMESPACE);
+	else
+		appendPQExpBuffer(&sql,
+						  " AND c.relam IN (%u, %u)"
+						  "AND c.relkind IN ('r', 'm', 't', 'i') "
+						  "AND ((c.relam = %u AND c.relkind IN ('r', 'm', 't')) OR "
+						  "(c.relam = %u AND c.relkind = 'i'))",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID,
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	appendPQExpBufferStr(&sql,
+						 "\nORDER BY c.oid)");
+
+	if (!opts.no_toast_expansion)
+	{
+		/*
+		 * Include a CTE for toast tables associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * toast table names.
+		 */
+		appendPQExpBufferStr(&sql,
+							 ", toast (oid, nspname, relname, relpages) AS ("
+							 "\nSELECT t.oid, 'pg_toast', t.relname, t.relpages"
+							 "\nFROM pg_catalog.pg_class t "
+							 "INNER JOIN relation r "
+							 "ON r.reltoastrelid = t.oid");
+		if (opts.excludetbl || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (t.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.heap_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		appendPQExpBufferStr(&sql,
+							 "\n)");
+	}
+	if (!opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * btree index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, r.nspname, c.relname, c.relpages "
+						  "FROM relation r"
+						  "\nINNER JOIN pg_catalog.pg_index i "
+						  "ON r.oid = i.indrelid "
+						  "INNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nINNER JOIN pg_catalog.pg_namespace n "
+								 "ON c.relnamespace = n.oid"
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind = 'i'",
+						  BTREE_AM_OID);
+		if (opts.no_toast_expansion)
+			appendPQExpBuffer(&sql,
+							  " AND c.relnamespace != %u",
+							  PG_TOAST_NAMESPACE);
+		appendPQExpBufferStr(&sql, "\n)");
+	}
+
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with toast tables of
+		 * primary heap tables selected above, filtering by exclusion patterns
+		 * (if any) that match the toast index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", toast_index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, 'pg_toast', c.relname, c.relpages "
+						  "FROM toast t "
+						  "INNER JOIN pg_catalog.pg_index i "
+						  "ON t.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only "
+								 "WHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u"
+						  " AND c.relkind = 'i')",
+						  BTREE_AM_OID);
+	}
+
+	/*
+	 * Roll-up distinct rows from CTEs.
+	 *
+	 * Relations that match more than one pattern may occur more than once in
+	 * the list, and indexes and toast for primary relations may also have
+	 * matched in their own right, so we rely on UNION to deduplicate the
+	 * list.
+	 */
+	appendPQExpBuffer(&sql,
+					  "\nSELECT pattern_id, is_heap, is_btree, oid, nspname, relname, relpages "
+					  "FROM (");
+	appendPQExpBufferStr(&sql,
+	/* Inclusion patterns that failed to match */
+						 "\nSELECT pattern_id, is_heap, is_btree, "
+						 "NULL::OID AS oid, "
+						 "NULL::TEXT AS nspname, "
+						 "NULL::TEXT AS relname, "
+						 "NULL::INTEGER AS relpages"
+						 "\nFROM relation "
+						 "WHERE pattern_id IS NOT NULL "
+						 "UNION"
+	/* Primary relations */
+						 "\nSELECT NULL::INTEGER AS pattern_id, "
+						 "is_heap, is_btree, oid, nspname, relname, relpages "
+						 "FROM relation");
+	if (!opts.no_toast_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Toast tables for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, TRUE AS is_heap, "
+							 "FALSE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast");
+	if (!opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM index");
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for toast relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast_index");
+	appendPQExpBufferStr(&sql,
+						 "\n) AS combined_records "
+						 "ORDER BY relpages DESC NULLS FIRST, oid");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		bool		is_heap = false;
+		bool		is_btree = false;
+		Oid			oid = InvalidOid;
+		const char *nspname = NULL;
+		const char *relname = NULL;
+		int			relpages = 0;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			is_heap = (PQgetvalue(res, i, 1)[0] == 't');
+		if (!PQgetisnull(res, i, 2))
+			is_btree = (PQgetvalue(res, i, 2)[0] == 't');
+		if (!PQgetisnull(res, i, 3))
+			oid = atooid(PQgetvalue(res, i, 3));
+		if (!PQgetisnull(res, i, 4))
+			nspname = PQgetvalue(res, i, 4);
+		if (!PQgetisnull(res, i, 5))
+			relname = PQgetvalue(res, i, 5);
+		if (!PQgetisnull(res, i, 6))
+			relpages = atoi(PQgetvalue(res, i, 6));
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern.  Record that
+			 * it matched.
+			 */
+
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected relation pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+
+			opts.include.data[pattern_id].matched = true;
+		}
+		else
+		{
+			/* Current record pertains to a relation */
+
+			RelationInfo *rel = (RelationInfo *) pg_malloc0(sizeof(RelationInfo));
+
+			Assert(OidIsValid(oid));
+			Assert((is_heap && !is_btree) || (is_btree && !is_heap));
+
+			rel->datinfo = dat;
+			rel->reloid = oid;
+			rel->is_heap = is_heap;
+			rel->nspname = pstrdup(nspname);
+			rel->relname = pstrdup(relname);
+			rel->relpages = relpages;
+			rel->blocks_to_check = relpages;
+			if (is_heap && (opts.startblock >= 0 || opts.endblock >= 0))
+			{
+				/*
+				 * We apply --startblock and --endblock to heap tables, but
+				 * not btree indexes, and for progress purposes we need to
+				 * track how many blocks we expect to check.
+				 */
+				if (opts.endblock >= 0 && rel->blocks_to_check > opts.endblock)
+					rel->blocks_to_check = opts.endblock + 1;
+				if (opts.startblock >= 0)
+				{
+					if (rel->blocks_to_check > opts.startblock)
+						rel->blocks_to_check -= opts.startblock;
+					else
+						rel->blocks_to_check = 0;
+				}
+			}
+			*pagecount += rel->blocks_to_check;
+
+			simple_ptr_list_append(relations, rel);
+		}
+	}
+	PQclear(res);
+}
diff --git a/contrib/pg_amcheck/t/001_basic.pl b/contrib/pg_amcheck/t/001_basic.pl
new file mode 100644
index 0000000000..dfa0ae9e06
--- /dev/null
+++ b/contrib/pg_amcheck/t/001_basic.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_amcheck');
+program_version_ok('pg_amcheck');
+program_options_handling_ok('pg_amcheck');
diff --git a/contrib/pg_amcheck/t/002_nonesuch.pl b/contrib/pg_amcheck/t/002_nonesuch.pl
new file mode 100644
index 0000000000..b1adf965a8
--- /dev/null
+++ b/contrib/pg_amcheck/t/002_nonesuch.pl
@@ -0,0 +1,248 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 76;
+
+# Test set-up
+my ($node, $port);
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+#########################################
+# Test non-existent databases
+
+# Failing to connect to the initial database is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/FATAL:  database "qqq" does not exist/ ],
+	'checking a non-existent database');
+
+# Failing to resolve a database pattern is an error by default.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern');
+
+# But only a warning under --no-strict-names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '--no-strict-names', '-d', 'qqq' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern under --no-strict-names');
+
+# Check that a substring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'post' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "post"/ ],
+	'checking an unresolvable database pattern (substring of existent database)');
+
+# Check that a superstring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'postgresql' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "postgresql"/ ],
+	'checking an unresolvable database pattern (superstring of existent database)');
+
+#########################################
+# Test connecting with a non-existent user
+
+# Failing to connect to the initial database due to bad username is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user');
+
+# Failing to connect to the initial database due to bad username is an still an
+# error under --no-strict-names.
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user under --no-strict-names');
+
+#########################################
+# Test checking databases without amcheck installed
+
+# Attempting to check a database by name where amcheck is not installed should
+# raise a warning.  If all databases are skipped, having no relations to check
+# raises an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'checking a database by name without amcheck installed, no other databases');
+
+# Again, but this time with another database to check, so no error is raised.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1', '-d', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by name without amcheck installed, with other databases');
+
+# Again, but by way of checking all databases
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by pattern without amcheck installed, with other databases');
+
+#########################################
+# Test unreasonable patterns
+
+# Check three-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '..' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.\."/ ],
+	'checking table pattern ".."');
+
+# Again, but with non-trivial schema and relation parts
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '.foo.bar' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.foo\.bar"/ ],
+	'checking table pattern ".foo.bar"');
+
+# Check two-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '.' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no heap tables to check matching "\."/ ],
+	'checking table pattern "."');
+
+#########################################
+# Test checking non-existent databases, schemas, tables, and indexes
+
+# Use --no-strict-names and a single existent table so we only get warnings
+# about the failed pattern matches
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names',
+		'-t', 'no_such_table',
+		'-t', 'no*such*table',
+		'-i', 'no_such_index',
+		'-i', 'no*such*index',
+		'-r', 'no_such_relation',
+		'-r', 'no*such*relation',
+		'-d', 'no_such_database',
+		'-d', 'no*such*database',
+		'-r', 'none.none',
+		'-r', 'none.none.none',
+		'-r', 'this.is.a.really.long.dotted.string',
+		'-r', 'postgres.none.none',
+		'-r', 'postgres.long.dotted.string',
+		'-r', 'postgres.pg_catalog.none',
+		'-r', 'postgres.none.pg_class',
+		'-t', 'postgres.pg_catalog.pg_class',	# This exists
+	],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no heap tables to check matching "no_such_table"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no_such_index"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no\*such\*index"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no_such_relation"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no\*such\*relation"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no\*such\*database"/,
+	  qr/pg_amcheck: warning: no relations to check matching "none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "none\.none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "this\.is\.a\.really\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.pg_catalog\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.pg_class"/,
+	],
+	'many unmatched patterns and one matched pattern under --no-strict-names');
+
+#########################################
+# Test checking otherwise existent objects but in databases where they do not exist
+
+$node->safe_psql('postgres', q(
+	CREATE TABLE public.foo (f integer);
+	CREATE INDEX foo_idx ON foo(f);
+));
+$node->safe_psql('postgres', q(CREATE DATABASE another_db));
+
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '--no-strict-names',
+		'-t', 'template1.public.foo',
+		'-t', 'another_db.public.foo',
+		'-t', 'no_such_database.public.foo',
+		'-i', 'template1.public.foo_idx',
+		'-i', 'another_db.public.foo_idx',
+		'-i', 'no_such_database.public.foo_idx',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "template1\.public\.foo"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "another_db\.public\.foo"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "template1\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "another_db\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo_idx"/,
+	  qr/pg_amcheck: error: no relations to check/,
+	],
+	'checking otherwise existent objets in the wrong databases');
+
+
+#########################################
+# Test schema exclusion patterns
+
+# Check with only schema exclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-S', 'public',
+		'-S', 'pg_catalog',
+		'-S', 'pg_toast',
+		'-S', 'information_schema',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion patterns exclude all relations');
+
+# Check with schema exclusion patterns overriding relation and schema inclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-s', 'public',
+		'-s', 'pg_catalog',
+		'-s', 'pg_toast',
+		'-s', 'information_schema',
+		'-t', 'pg_catalog.pg_class',
+		'-S', '*'
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion pattern overrides all inclusion patterns');
diff --git a/contrib/pg_amcheck/t/003_check.pl b/contrib/pg_amcheck/t/003_check.pl
new file mode 100644
index 0000000000..78453aa2e5
--- /dev/null
+++ b/contrib/pg_amcheck/t/003_check.pl
@@ -0,0 +1,497 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 57;
+
+my ($node, $port, %corrupt_page, %remove_relation);
+
+# Returns the filesystem path for the named relation.
+#
+# Assumes the test node is running
+sub relation_filepath($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $pgdata = $node->data_dir;
+	my $rel = $node->safe_psql($dbname,
+							   qq(SELECT pg_relation_filepath('$relname')));
+	die "path not found for relation $relname" unless defined $rel;
+	return "$pgdata/$rel";
+}
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT c.reltoastrelid::regclass
+			FROM pg_catalog.pg_class c
+			WHERE c.oid = '$relname'::regclass
+			  AND c.reltoastrelid != 0
+			));
+	return undef unless defined $rel;
+	return $rel;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of overwriting junk in the first page.
+#
+# Assumes the test node is running.
+sub plan_to_corrupt_first_page($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$corrupt_page{$relpath} = 1;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of removing the file..
+#
+# Assumes the test node is running
+sub plan_to_remove_relation_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$remove_relation{$relpath} = 1;
+}
+
+# For the given (dbname, relname), if a corresponding toast table
+# exists, adds that toast table's relation file to the list to be
+# corrupted by means of removing the file.
+#
+# Assumes the test node is running.
+sub plan_to_remove_toast_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $toastname = relation_toast($dbname, $relname);
+	plan_to_remove_relation_file($dbname, $toastname) if ($toastname);
+}
+
+# Corrupts the first page of the given file path
+sub corrupt_first_page($)
+{
+	my ($relpath) = @_;
+
+	my $fh;
+	open($fh, '+<', $relpath)
+		or BAIL_OUT("open failed: $!");
+	binmode $fh;
+
+	# Corrupt some line pointers.  The values are chosen to hit the
+	# various line-pointer-corruption checks in verify_heapam.c
+	# on both little-endian and big-endian architectures.
+	seek($fh, 32, 0)
+		or BAIL_OUT("seek failed: $!");
+	syswrite(
+		$fh,
+		pack("L*",
+			0xAAA15550, 0xAAA0D550, 0x00010000,
+			0x00008000, 0x0000800F, 0x001e8000,
+			0xFFFFFFFF)
+	) or BAIL_OUT("syswrite failed: $!");
+	close($fh)
+		or BAIL_OUT("close failed: $!");
+}
+
+# Stops the node, performs all the corruptions previously planned, and
+# starts the node again.
+#
+sub perform_all_corruptions()
+{
+	$node->stop();
+	for my $relpath (keys %corrupt_page)
+	{
+		corrupt_first_page($relpath);
+	}
+	for my $relpath (keys %remove_relation)
+	{
+		unlink($relpath);
+	}
+	$node->start;
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+for my $dbname (qw(db1 db2 db3))
+{
+	# Create the database
+	$node->safe_psql('postgres', qq(CREATE DATABASE $dbname));
+
+	# Load the amcheck extension, upon which pg_amcheck depends.  Put the
+	# extension in an unexpected location to test that pg_amcheck finds it
+	# correctly.  Create tables with names that look like pg_catalog names to
+	# check that pg_amcheck does not get confused by them.  Create functions in
+	# schema public that look like amcheck functions to check that pg_amcheck
+	# does not use them.
+	$node->safe_psql($dbname, q(
+		CREATE SCHEMA amcheck_schema;
+		CREATE EXTENSION amcheck WITH SCHEMA amcheck_schema;
+		CREATE TABLE amcheck_schema.pg_database (junk text);
+		CREATE TABLE amcheck_schema.pg_namespace (junk text);
+		CREATE TABLE amcheck_schema.pg_class (junk text);
+		CREATE TABLE amcheck_schema.pg_operator (junk text);
+		CREATE TABLE amcheck_schema.pg_proc (junk text);
+		CREATE TABLE amcheck_schema.pg_tablespace (junk text);
+
+		CREATE FUNCTION public.bt_index_check(index regclass,
+											  heapallindexed boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.bt_index_parent_check(index regclass,
+													 heapallindexed boolean default false,
+													 rootdescend boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_parent_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.verify_heapam(relation regclass,
+											 on_error_stop boolean default false,
+											 check_toast boolean default false,
+											 skip text default 'none',
+											 startblock bigint default null,
+											 endblock bigint default null,
+											 blkno OUT bigint,
+											 offnum OUT integer,
+											 attnum OUT integer,
+											 msg OUT text)
+		RETURNS SETOF record AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong verify_heapam!';
+		END;
+		$$ LANGUAGE plpgsql;
+	));
+
+	# Create schemas, tables and indexes in five separate
+	# schemas.  The schemas are all identical to start, but
+	# we will corrupt them differently later.
+	#
+	for my $schema (qw(s1 s2 s3 s4 s5))
+	{
+		$node->safe_psql($dbname, qq(
+			CREATE SCHEMA $schema;
+			CREATE SEQUENCE $schema.seq1;
+			CREATE SEQUENCE $schema.seq2;
+			CREATE TABLE $schema.t1 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE TABLE $schema.t2 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE VIEW $schema.t2_view AS (
+				SELECT i*2, t FROM $schema.t2
+			);
+			ALTER TABLE $schema.t2
+				ALTER COLUMN t
+				SET STORAGE EXTERNAL;
+
+			INSERT INTO $schema.t1 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			INSERT INTO $schema.t2 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			CREATE MATERIALIZED VIEW $schema.t1_mv AS SELECT * FROM $schema.t1;
+			CREATE MATERIALIZED VIEW $schema.t2_mv AS SELECT * FROM $schema.t2;
+
+			create table $schema.p1 (a int, b int) PARTITION BY list (a);
+			create table $schema.p2 (a int, b int) PARTITION BY list (a);
+
+			create table $schema.p1_1 partition of $schema.p1 for values in (1, 2, 3);
+			create table $schema.p1_2 partition of $schema.p1 for values in (4, 5, 6);
+			create table $schema.p2_1 partition of $schema.p2 for values in (1, 2, 3);
+			create table $schema.p2_2 partition of $schema.p2 for values in (4, 5, 6);
+
+			CREATE INDEX t1_btree ON $schema.t1 USING BTREE (i);
+			CREATE INDEX t2_btree ON $schema.t2 USING BTREE (i);
+
+			CREATE INDEX t1_hash ON $schema.t1 USING HASH (i);
+			CREATE INDEX t2_hash ON $schema.t2 USING HASH (i);
+
+			CREATE INDEX t1_brin ON $schema.t1 USING BRIN (i);
+			CREATE INDEX t2_brin ON $schema.t2 USING BRIN (i);
+
+			CREATE INDEX t1_gist ON $schema.t1 USING GIST (b);
+			CREATE INDEX t2_gist ON $schema.t2 USING GIST (b);
+
+			CREATE INDEX t1_gin ON $schema.t1 USING GIN (ia);
+			CREATE INDEX t2_gin ON $schema.t2 USING GIN (ia);
+
+			CREATE INDEX t1_spgist ON $schema.t1 USING SPGIST (ir);
+			CREATE INDEX t2_spgist ON $schema.t2 USING SPGIST (ir);
+		));
+	}
+}
+
+# Database 'db1' corruptions
+#
+
+# Corrupt indexes in schema "s1"
+plan_to_remove_relation_file('db1', 's1.t1_btree');
+plan_to_corrupt_first_page('db1', 's1.t2_btree');
+
+# Corrupt tables in schema "s2"
+plan_to_remove_relation_file('db1', 's2.t1');
+plan_to_corrupt_first_page('db1', 's2.t2');
+
+# Corrupt tables, partitions, matviews, and btrees in schema "s3"
+plan_to_remove_relation_file('db1', 's3.t1');
+plan_to_corrupt_first_page('db1', 's3.t2');
+
+plan_to_remove_relation_file('db1', 's3.t1_mv');
+plan_to_remove_relation_file('db1', 's3.p1_1');
+
+plan_to_corrupt_first_page('db1', 's3.t2_mv');
+plan_to_corrupt_first_page('db1', 's3.p2_1');
+
+plan_to_remove_relation_file('db1', 's3.t1_btree');
+plan_to_corrupt_first_page('db1', 's3.t2_btree');
+
+# Corrupt toast table, partitions, and materialized views in schema "s4"
+plan_to_remove_toast_file('db1', 's4.t2');
+
+# Corrupt all other object types in schema "s5".  We don't have amcheck support
+# for these types, but we check that their corruption does not trigger any
+# errors in pg_amcheck
+plan_to_remove_relation_file('db1', 's5.seq1');
+plan_to_remove_relation_file('db1', 's5.t1_hash');
+plan_to_remove_relation_file('db1', 's5.t1_gist');
+plan_to_remove_relation_file('db1', 's5.t1_gin');
+plan_to_remove_relation_file('db1', 's5.t1_brin');
+plan_to_remove_relation_file('db1', 's5.t1_spgist');
+
+plan_to_corrupt_first_page('db1', 's5.seq2');
+plan_to_corrupt_first_page('db1', 's5.t2_hash');
+plan_to_corrupt_first_page('db1', 's5.t2_gist');
+plan_to_corrupt_first_page('db1', 's5.t2_gin');
+plan_to_corrupt_first_page('db1', 's5.t2_brin');
+plan_to_corrupt_first_page('db1', 's5.t2_spgist');
+
+
+# Database 'db2' corruptions
+#
+plan_to_remove_relation_file('db2', 's1.t1');
+plan_to_remove_relation_file('db2', 's1.t1_btree');
+
+
+# Leave 'db3' uncorrupted
+#
+
+# Perform the corruptions we planned above using only a single database restart.
+#
+perform_all_corruptions();
+
+
+# Standard first arguments to TestLib functions
+my @cmd = ('pg_amcheck', '--quiet', '-p', $port);
+
+# Regular expressions to match various expected output
+my $no_output_re = qr/^$/;
+my $line_pointer_corruption_re = qr/line pointer/;
+my $missing_file_re = qr/could not open file ".*": No such file or directory/;
+my $index_missing_relation_fork_re = qr/index ".*" lacks a main relation fork/;
+
+# Checking databases with amcheck installed and corrupt relations, pg_amcheck
+# command itself should return exit status = 2, because tables and indexes are
+# corrupt, not exit status = 1, which would mean the pg_amcheck command itself
+# failed.  Corruption messages should go to stdout, and nothing to stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in database db1');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-d', 'db2', '-d', 'db3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in databases db1, db2, and db3');
+
+# Scans of indexes in s1 should detect the specific corruption that we created
+# above.  For missing relation forks, we know what the error message looks
+# like.  For corrupted index pages, the error might vary depending on how the
+# page was formatted on disk, including variations due to alignment differences
+# between platforms, so we accept any non-empty error message.
+#
+# If we don't limit the check to databases with amcheck installed, we expect
+# complaint on stderr, but otherwise stderr should be quiet.
+#
+$node->command_checks_all(
+	[ @cmd, '--all', '-s', 's1', '-i', 't1_btree' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ qr/pg_amcheck: warning: skipping database "postgres": amcheck is not installed/ ],
+	'pg_amcheck index s1.t1_btree reports missing main relation fork');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't2_btree' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ $no_output_re ],
+	'pg_amcheck index s1.s2 reports index corruption');
+
+# Checking db1.s1 with indexes excluded should show no corruptions because we
+# did not corrupt any tables in db1.s1.  Verify that both stdout and stderr
+# are quiet.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db1.s1 excluding indexes');
+
+# Checking db2.s1 should show table corruptions if indexes are excluded
+#
+$node->command_checks_all(
+	[ @cmd, 'db2', '-t', 's1.*', '--no-dependent-indexes' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db2.s1 excluding indexes');
+
+# In schema db1.s3, the tables and indexes are both corrupt.  We should see
+# corruption messages on stdout, and nothing on stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck schema s3 reports table and index errors');
+
+# In schema db1.s4, only toast tables are corrupt.  Check that under default
+# options the toast corruption is reported, but when excluding toast we get no
+# error reports.
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's4' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 reports toast corruption');
+
+$node->command_checks_all(
+	[ @cmd, '--no-dependent-toast', '--exclude-toast-pointers', 'db1', '-s', 's4' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 excluding toast reports no corruption');
+
+# Check that no corruption is reported in schema db1.s5
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's5' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s5 reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we exclude
+# the indexes, no corruption is reported about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-I', 't1_btree', '-I', 't2_btree' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with corrupt indexes excluded reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we provide only
+# table inclusions, and disable index expansion, no corruption is reported
+# about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with all indexes excluded reports no corruption');
+
+# In schema db1.s2, only tables are corrupt.  Verify that when we exclude those
+# tables that no corruption is reported.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's2', '-T', 't1', '-T', 't2' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s2 with corrupt tables excluded reports no corruption');
+
+# Check errors about bad block range command line arguments.  We use schema s5
+# to avoid getting messages about corrupt tables or indexes.
+#
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', 'junk' ],
+	qr/relation start block argument contains garbage characters/,
+	'pg_amcheck rejects garbage startblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--endblock', '1234junk' ],
+	qr/relation end block argument contains garbage characters/,
+	'pg_amcheck rejects garbage endblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', '5', '--endblock', '4' ],
+	qr/relation end block argument precedes start block argument/,
+	'pg_amcheck rejects invalid block range');
+
+# Check bt_index_parent_check alternates.  We don't create any index corruption
+# that would behave differently under these modes, so just smoke test that the
+# arguments are handled sensibly.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--parent-check' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --parent-check');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --heapallindexed --rootdescend');
diff --git a/contrib/pg_amcheck/t/004_verify_heapam.pl b/contrib/pg_amcheck/t/004_verify_heapam.pl
new file mode 100644
index 0000000000..ee7193bdc0
--- /dev/null
+++ b/contrib/pg_amcheck/t/004_verify_heapam.pl
@@ -0,0 +1,517 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+
+use Test::More;
+
+# This regression test demonstrates that the pg_amcheck binary supplied with
+# the pg_amcheck contrib module correctly identifies specific kinds of
+# corruption within pages.  To test this, we need a mechanism to create corrupt
+# pages with predictable, repeatable corruption.  The postgres backend cannot
+# be expected to help us with this, as its design is not consistent with the
+# goal of intentionally corrupting pages.
+#
+# Instead, we create a table to corrupt, and with careful consideration of how
+# postgresql lays out heap pages, we seek to offsets within the page and
+# overwrite deliberately chosen bytes with specific values calculated to
+# corrupt the page in expected ways.  We then verify that pg_amcheck reports
+# the corruption, and that it runs without crashing.  Note that the backend
+# cannot simply be started to run queries against the corrupt table, as the
+# backend will crash, at least for some of the corruption types we generate.
+#
+# Autovacuum potentially touching the table in the background makes the exact
+# behavior of this test harder to reason about.  We turn it off to keep things
+# simpler.  We use a "belt and suspenders" approach, turning it off for the
+# system generally in postgresql.conf, and turning it off specifically for the
+# test table.
+#
+# This test depends on the table being written to the heap file exactly as we
+# expect it to be, so we take care to arrange the columns of the table, and
+# insert rows of the table, that give predictable sizes and locations within
+# the table page.
+#
+# The HeapTupleHeaderData has 23 bytes of fixed size fields before the variable
+# length t_bits[] array.  We have exactly 3 columns in the table, so natts = 3,
+# t_bits is 1 byte long, and t_hoff = MAXALIGN(23 + 1) = 24.
+#
+# We're not too fussy about which datatypes we use for the test, but we do care
+# about some specific properties.  We'd like to test both fixed size and
+# varlena types.  We'd like some varlena data inline and some toasted.  And
+# we'd like the layout of the table such that the datums land at predictable
+# offsets within the tuple.  We choose a structure without padding on all
+# supported architectures:
+#
+# 	a BIGINT
+#	b TEXT
+#	c TEXT
+#
+# We always insert a 7-ascii character string into field 'b', which with a
+# 1-byte varlena header gives an 8 byte inline value.  We always insert a long
+# text string in field 'c', long enough to force toast storage.
+#
+# We choose to read and write binary copies of our table's tuples, using perl's
+# pack() and unpack() functions.  Perl uses a packing code system in which:
+#
+#	L = "Unsigned 32-bit Long",
+#	S = "Unsigned 16-bit Short",
+#	C = "Unsigned 8-bit Octet",
+#	c = "signed 8-bit octet",
+#	q = "signed 64-bit quadword"
+#
+# Each tuple in our table has a layout as follows:
+#
+#    xx xx xx xx            t_xmin: xxxx		offset = 0		L
+#    xx xx xx xx            t_xmax: xxxx		offset = 4		L
+#    xx xx xx xx          t_field3: xxxx		offset = 8		L
+#    xx xx                   bi_hi: xx			offset = 12		S
+#    xx xx                   bi_lo: xx			offset = 14		S
+#    xx xx                ip_posid: xx			offset = 16		S
+#    xx xx             t_infomask2: xx			offset = 18		S
+#    xx xx              t_infomask: xx			offset = 20		S
+#    xx                     t_hoff: x			offset = 22		C
+#    xx                     t_bits: x			offset = 23		C
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		Cccccccc
+#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		SSSS
+#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued	SSSS
+#    xx xx                        : xx      	 ...continued	S
+#
+# We could choose to read and write columns 'b' and 'c' in other ways, but
+# it is convenient enough to do it this way.  We define packing code
+# constants here, where they can be compared easily against the layout.
+
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCcccccccSSSSSSSSS';
+use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
+
+# Read a tuple of our table from a heap page.
+#
+# Takes an open filehandle to the heap file, and the offset of the tuple.
+#
+# Rather than returning the binary data from the file, unpacks the data into a
+# perl hash with named fields.  These fields exactly match the ones understood
+# by write_tuple(), below.  Returns a reference to this hash.
+#
+sub read_tuple ($$)
+{
+	my ($fh, $offset) = @_;
+	my ($buffer, %tup);
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(sysread($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("sysread failed: $!");
+
+	@_ = unpack(HEAPTUPLE_PACK_CODE, $buffer);
+	%tup = (t_xmin => shift,
+			t_xmax => shift,
+			t_field3 => shift,
+			bi_hi => shift,
+			bi_lo => shift,
+			ip_posid => shift,
+			t_infomask2 => shift,
+			t_infomask => shift,
+			t_hoff => shift,
+			t_bits => shift,
+			a => shift,
+			b_header => shift,
+			b_body1 => shift,
+			b_body2 => shift,
+			b_body3 => shift,
+			b_body4 => shift,
+			b_body5 => shift,
+			b_body6 => shift,
+			b_body7 => shift,
+			c1 => shift,
+			c2 => shift,
+			c3 => shift,
+			c4 => shift,
+			c5 => shift,
+			c6 => shift,
+			c7 => shift,
+			c8 => shift,
+			c9 => shift);
+	# Stitch together the text for column 'b'
+	$tup{b} = join('', map { chr($tup{"b_body$_"}) } (1..7));
+	return \%tup;
+}
+
+# Write a tuple of our table to a heap page.
+#
+# Takes an open filehandle to the heap file, the offset of the tuple, and a
+# reference to a hash with the tuple values, as returned by read_tuple().
+# Writes the tuple fields from the hash into the heap file.
+#
+# The purpose of this function is to write a tuple back to disk with some
+# subset of fields modified.  The function does no error checking.  Use
+# cautiously.
+#
+sub write_tuple($$$)
+{
+	my ($fh, $offset, $tup) = @_;
+	my $buffer = pack(HEAPTUPLE_PACK_CODE,
+					$tup->{t_xmin},
+					$tup->{t_xmax},
+					$tup->{t_field3},
+					$tup->{bi_hi},
+					$tup->{bi_lo},
+					$tup->{ip_posid},
+					$tup->{t_infomask2},
+					$tup->{t_infomask},
+					$tup->{t_hoff},
+					$tup->{t_bits},
+					$tup->{a},
+					$tup->{b_header},
+					$tup->{b_body1},
+					$tup->{b_body2},
+					$tup->{b_body3},
+					$tup->{b_body4},
+					$tup->{b_body5},
+					$tup->{b_body6},
+					$tup->{b_body7},
+					$tup->{c1},
+					$tup->{c2},
+					$tup->{c3},
+					$tup->{c4},
+					$tup->{c5},
+					$tup->{c6},
+					$tup->{c7},
+					$tup->{c8},
+					$tup->{c9});
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(syswrite($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("syswrite failed: $!");;
+	return;
+}
+
+# Set umask so test directories and files are created with default permissions
+umask(0077);
+
+# Set up the node.  Once we create and corrupt the table,
+# autovacuum workers visiting the table could crash the backend.
+# Disable autovacuum so that won't happen.
+my $node = get_new_node('test');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
+
+# Start the node and load the extensions.  We depend on both
+# amcheck and pageinspect for this test.
+$node->start;
+my $port = $node->port;
+my $pgdata = $node->data_dir;
+$node->safe_psql('postgres', "CREATE EXTENSION amcheck");
+$node->safe_psql('postgres', "CREATE EXTENSION pageinspect");
+
+# Get a non-zero datfrozenxid
+$node->safe_psql('postgres', qq(VACUUM FREEZE));
+
+# Create the test table with precisely the schema that our corruption function
+# expects.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.test (a BIGINT, b TEXT, c TEXT);
+		ALTER TABLE public.test SET (autovacuum_enabled=false);
+		ALTER TABLE public.test ALTER COLUMN c SET STORAGE EXTERNAL;
+		CREATE INDEX test_idx ON public.test(a, b);
+	));
+
+# We want (0 < datfrozenxid < test.relfrozenxid).  To achieve this, we freeze
+# an otherwise unused table, public.junk, prior to inserting data and freezing
+# public.test
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.junk AS SELECT 'junk'::TEXT AS junk_column;
+		ALTER TABLE public.junk SET (autovacuum_enabled=false);
+		VACUUM FREEZE public.junk
+	));
+
+my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.test')));
+my $relpath = "$pgdata/$rel";
+
+# Insert data and freeze public.test
+use constant ROWCOUNT => 16;
+$node->safe_psql('postgres', qq(
+	INSERT INTO public.test (a, b, c)
+		VALUES (
+			12345678,
+			'abcdefg',
+			repeat('w', 10000)
+		);
+	VACUUM FREEZE public.test
+	)) for (1..ROWCOUNT);
+
+my $relfrozenxid = $node->safe_psql('postgres',
+	q(select relfrozenxid from pg_class where relname = 'test'));
+my $datfrozenxid = $node->safe_psql('postgres',
+	q(select datfrozenxid from pg_database where datname = 'postgres'));
+
+# Sanity check that our 'test' table has a relfrozenxid newer than the
+# datfrozenxid for the database, and that the datfrozenxid is greater than the
+# first normal xid.  We rely on these invariants in some of our tests.
+if ($datfrozenxid <= 3 || $datfrozenxid >= $relfrozenxid)
+{
+	$node->clean_node;
+	plan skip_all => "Xid thresholds not as expected: got datfrozenxid = $datfrozenxid, relfrozenxid = $relfrozenxid";
+	exit;
+}
+
+# Find where each of the tuples is located on the page.
+my @lp_off;
+for my $tup (0..ROWCOUNT-1)
+{
+	push (@lp_off, $node->safe_psql('postgres', qq(
+select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
+	offset $tup limit 1)));
+}
+
+# Sanity check that our 'test' table on disk layout matches expectations.  If
+# this is not so, we will have to skip the test until somebody updates the test
+# to work on this platform.
+$node->stop;
+my $file;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	# Sanity-check that the data appears on the page where we expect.
+	my $a = $tup->{a};
+	my $b = $tup->{b};
+	if ($a ne '12345678' || $b ne 'abcdefg')
+	{
+		close($file);  # ignore errors on close; we're exiting anyway
+		$node->clean_node;
+		plan skip_all => qq(Page layout differs from our expectations: expected (12345678, "abcdefg"), got ($a, "$b"));
+		exit;
+	}
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Ok, Xids and page layout look ok.  We can run corruption tests.
+plan tests => 20;
+
+# Check that pg_amcheck runs against the uncorrupted table without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table, prior to corruption');
+
+# Check that pg_amcheck runs against the uncorrupted table and index without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table and index, prior to corruption');
+
+$node->stop;
+
+# Some #define constants from access/htup_details.h for use while corrupting.
+use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
+use constant HEAP_XMIN_COMMITTED     => 0x0100;
+use constant HEAP_XMIN_INVALID       => 0x0200;
+use constant HEAP_XMAX_COMMITTED     => 0x0400;
+use constant HEAP_XMAX_INVALID       => 0x0800;
+use constant HEAP_NATTS_MASK         => 0x07FF;
+use constant HEAP_XMAX_IS_MULTI      => 0x1000;
+use constant HEAP_KEYS_UPDATED       => 0x2000;
+
+# Helper function to generate a regular expression matching the header we
+# expect verify_heapam() to return given which fields we expect to be non-null.
+sub header
+{
+	my ($blkno, $offnum, $attnum) = @_;
+	return qr/relation "postgres"\."public"\."test", block $blkno, offset $offnum, attribute $attnum\s+/ms
+		if (defined $attnum);
+	return qr/relation "postgres"\."public"\."test", block $blkno, offset $offnum\s+/ms
+		if (defined $offnum);
+	return qr/relation "postgres"\."public"\."test"\s+/ms
+		if (defined $blkno);
+	return qr/relation "postgres"\."public"\."test"\s+/ms;
+}
+
+# Corrupt the tuples, one type of corruption per tuple.  Some types of
+# corruption cause verify_heapam to skip to the next tuple without
+# performing any remaining checks, so we can't exercise the system properly if
+# we focus all our corruption on a single tuple.
+#
+my @expected;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	my $header = header(0, $offnum, undef);
+	if ($offnum == 1)
+	{
+		# Corruptly set xmin < relfrozenxid
+		my $xmin = $relfrozenxid - 1;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		# Expected corruption report
+		push @expected,
+			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
+	}
+	if ($offnum == 2)
+	{
+		# Corruptly set xmin < datfrozenxid
+		my $xmin = 3;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin $xmin precedes oldest valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 3)
+	{
+		# Corruptly set xmin < datfrozenxid, further back, noting circularity
+		# of xid comparison.  For a new cluster with epoch = 0, the corrupt
+		# xmin will be interpreted as in the future
+		$tup->{t_xmin} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 4)
+	{
+		# Corruptly set xmax < relminmxid;
+		$tup->{t_xmax} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMAX_INVALID;
+
+		push @expected,
+			qr/${$header}xmax 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 5)
+	{
+		# Corrupt the tuple t_hoff, but keep it aligned properly
+		$tup->{t_hoff} += 128;
+
+		push @expected,
+			qr/${$header}data begins at offset 152 beyond the tuple length 58/,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 152 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 6)
+	{
+		# Corrupt the tuple t_hoff, wrong alignment
+		$tup->{t_hoff} += 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 27 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 7)
+	{
+		# Corrupt the tuple t_hoff, underflow but correct alignment
+		$tup->{t_hoff} -= 8;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 16 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 8)
+	{
+		# Corrupt the tuple t_hoff, underflow and wrong alignment
+		$tup->{t_hoff} -= 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 21 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 9)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, not just 3
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+
+		push @expected,
+			qr/${$header}number of attributes 2047 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 10)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, some of
+		# them null.  This falsely creates the impression that the t_bits
+		# array is longer than just one byte, but t_hoff still says otherwise.
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+		$tup->{t_bits} = 0xAA;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 280, but actually begins at byte 24 \(2047 attributes, has nulls\)/;
+	}
+	elsif ($offnum == 11)
+	{
+		# Same as above, but this time t_hoff plays along
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= (HEAP_NATTS_MASK & 0x40);
+		$tup->{t_bits} = 0xAA;
+		$tup->{t_hoff} = 32;
+
+		push @expected,
+			qr/${$header}number of attributes 67 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 12)
+	{
+		# Corrupt the bits in column 'b' 1-byte varlena header
+		$tup->{b_header} = 0x80;
+
+		$header = header(0, $offnum, 1);
+		push @expected,
+			qr/${header}attribute 1 with length 4294967295 ends at offset 416848000 beyond total tuple length 58/;
+	}
+	elsif ($offnum == 13)
+	{
+		# Corrupt the bits in column 'c' toast pointer
+		$tup->{c6} = 41;
+		$tup->{c7} = 41;
+
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}final toast chunk number 0 differs from expected value 6/,
+			qr/${header}toasted value for attribute 2 missing from toast table/;
+	}
+	elsif ($offnum == 14)
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4;
+
+		push @expected,
+			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
+	}
+	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4000000000;
+
+		push @expected,
+			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
+	}
+	write_tuple($file, $offset, $tup);
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Run pg_amcheck against the corrupt table with epoch=0, comparing actual
+# corruption messages against the expected messages
+$node->command_checks_all(
+	['pg_amcheck', '--no-dependent-indexes', '-p', $port, 'postgres'],
+	2,
+	[ @expected ],
+	[ ],
+	'Expected corruption message output');
+
+$node->teardown_node;
+$node->clean_node;
diff --git a/contrib/pg_amcheck/t/005_opclass_damage.pl b/contrib/pg_amcheck/t/005_opclass_damage.pl
new file mode 100644
index 0000000000..eba8ea9cae
--- /dev/null
+++ b/contrib/pg_amcheck/t/005_opclass_damage.pl
@@ -0,0 +1,54 @@
+# This regression test checks the behavior of the btree validation in the
+# presence of breaking sort order changes.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create a custom operator class and an index which uses it.
+$node->safe_psql('postgres', q(
+	CREATE EXTENSION amcheck;
+
+	CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;
+
+	CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
+	    OPERATOR 1 < (int4, int4), OPERATOR 2 <= (int4, int4),
+	    OPERATOR 3 = (int4, int4), OPERATOR 4 >= (int4, int4),
+	    OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);
+
+	CREATE TABLE int4tbl (i int4);
+	INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);
+	CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
+));
+
+# We have not yet broken the index, so we should get no corruption
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $node->port, 'postgres' ],
+	qr/^$/,
+	'pg_amcheck all schemas, tables and indexes reports no corruption');
+
+# Change the operator class to use a function which sorts in a different
+# order to corrupt the btree index
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION int4_desc_cmp (int4, int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN -1 ELSE 1 END; $$;
+	UPDATE pg_catalog.pg_amproc
+		SET amproc = 'int4_desc_cmp'::regproc
+		WHERE amproc = 'int4_asc_cmp'::regproc
+));
+
+# Index corruption should now be reported
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $node->port, 'postgres' ],
+	2,
+	[ qr/item order invariant violated for index "fickleidx"/ ],
+	[ ],
+	'pg_amcheck all schemas, tables and indexes reports fickleidx corruption'
+);
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d3ca4b6932..7e101f7c11 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -185,6 +185,7 @@ pages.
   </para>
 
  &oid2name;
+ &pgamcheck;
  &vacuumlo;
  </sect1>
 
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index db1d369743..5115cb03d0 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY oldsnapshot     SYSTEM "oldsnapshot.sgml">
 <!ENTITY pageinspect     SYSTEM "pageinspect.sgml">
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
+<!ENTITY pgamcheck       SYSTEM "pgamcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
diff --git a/doc/src/sgml/pgamcheck.sgml b/doc/src/sgml/pgamcheck.sgml
new file mode 100644
index 0000000000..d5b145a133
--- /dev/null
+++ b/doc/src/sgml/pgamcheck.sgml
@@ -0,0 +1,701 @@
+<!-- doc/src/sgml/pgamcheck.sgml -->
+
+<refentry id="pgamcheck">
+ <indexterm zone="pgamcheck">
+  <primary>pg_amcheck</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle><application>pg_amcheck</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_amcheck</refname>
+  <refpurpose>checks for corruption in one or more
+  <productname>PostgreSQL</productname> databases</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_amcheck</command>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+   <arg><replaceable>dbname</replaceable></arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <application>pg_amcheck</application> supports running
+   <xref linkend="amcheck"/>'s corruption checking functions against one or
+   more databases, with options to select which schemas, tables and indexes to
+   check, which kinds of checking to perform, and whether to perform the checks
+   in parallel, and if so, the number of parallel connections to establish and
+   use.
+  </para>
+
+  <para>
+   Only table relations and btree indexes are currently supported.  Other
+   relation types are silently skipped.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <para>
+   pg_amcheck accepts the following command-line arguments:
+
+   <variablelist>
+
+    <varlistentry>
+     <term><option><replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the name of a database to be checked, or a connection string
+       to use while connecting.
+      </para>
+      <para>
+       If no <replaceable>dbname</replaceable> is specified, and if
+       <option>-a</option> <option>--all</option> is not used, the database name
+       is read from the environment variable <envar>PGDATABASE</envar>.  If
+       that is not set, the user name specified for the connection is used.
+       The <replaceable>dbname</replaceable> can be a <link
+       linkend="libpq-connstring">connection string</link>.  If so, connection
+       string parameters will override any conflicting command line options,
+       and connection string parameters other than the database
+       name itself will be re-used when connecting to other databases.
+      </para>
+      <para>
+       If a connection string is given which contains no database name, the other
+       parameters of the string will be used while the database name to use is
+       determined as described above.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-a</option></term>
+     <term><option>--all</option></term>
+       <listitem>
+      <para>
+       Perform checking in all databases which are not otherwise excluded.
+      </para>
+      <para>
+       In the absence of any other options, selects all objects across all
+       schemas and databases.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-database</option> takes
+       precedence over <option>-a</option> <option>--all</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-d <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking in databases matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       that are not otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern.  By default, all objects in all matching databases will be
+       checked.
+      </para>
+      <para>
+       If <option>-a</option> <option>--all</option> is also specified,
+       <option>-d</option> <option>--database</option> has no effect.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-database</option> takes
+       precedence over <option>-d</option> <option>--database</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-D <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Do not include databases matching other patterns or included by option
+       <option>-a</option> <option>--all</option> if they also match the
+       specified exclusion
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This does not exclude any database that was listed explicitly as a
+       <replaceable>dbname</replaceable> on the command line, nor does it exclude
+       the database chosen in the absence of any
+       <replaceable>dbname</replaceable> argument.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       exclusion pattern.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-e</option></term>
+     <term><option>--echo</option></term>
+     <listitem>
+      <para>
+       Print to stdout all commands and queries being executed against the
+       server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--endblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Skip (do not check) all pages after the given ending
+       <replaceable>block</replaceable>.
+      </para>
+      <para>
+       By default, no pages are skipped.  This option will be applied to all
+       table relations that are checked, including toast tables, but note that
+       unless <option>--exclude-toast-pointers</option> is given, toast
+       pointers found in the main table will be followed into the toast table
+       without regard to the location in the toast table.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--exclude-toast-pointers</option></term>
+     <listitem>
+      <para>
+       When checking main relations, do not look up entries in toast tables
+       corresponding to toast pointers in the main relation.
+      </para>
+      <para>
+       The default behavior checks each toast pointer encountered in the main
+       table to verify, as much as possible, that the pointer points at
+       something in the toast table that is reasonable.  Toast pointers which
+       point beyond the end of the toast table, or to the middle (rather than
+       the beginning) of a toast entry, are identified as corrupt.
+      </para>
+      <para>
+       The process by which <xref linkend="amcheck"/>'s
+       <function>verify_heapam</function> function checks each toast pointer is
+       slow and may be improved in a future release.  Some users may wish to
+       disable this check to save time.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--heapallindexed</option></term>
+     <listitem>
+      <para>
+       For each index checked, verify the presence of all heap tuples as index
+       tuples in the index using <xref linkend="amcheck"/>'s
+       <option>heapallindexed</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-?</option></term>
+     <term><option>--help</option></term>
+     <listitem>
+      <para>
+       Show help about <application>pg_amcheck</application> command line
+       arguments, and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-h <replaceable class="parameter">hostname</replaceable></option></term>
+     <term><option>--host=<replaceable class="parameter">hostname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the host name of the machine on which the server is running.
+       If the value begins with a slash, it is used as the directory for the
+       Unix domain socket.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-i <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checks on indexes which match the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       unless they are otherwise excluded.
+      </para>
+      <para>
+       This is similar to the <option>-r</option> <option>--relation</option>
+       option, except that it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-I <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on the indexes which match the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This is similar to the <option>-R</option>
+       <option>--exclude-relation</option> option, except that it applies only
+       to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-j <replaceable class="parameter">num</replaceable></option></term>
+     <term><option>--jobs=<replaceable class="parameter">num</replaceable></option></term>
+     <listitem>
+      <para>
+       Use <replaceable>num</replaceable> concurrent connections to the server,
+       or one per object to be checked, whichever number is smaller.
+      </para>
+      <para>
+       The default is to use a single connection.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--maintenance-db=<replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the name of the database to connect to to discover which
+       databases should be checked, when
+       <option>-a</option>/<option>--all</option> is used.  If not specified,
+       the <literal>postgres</literal> database will be used, or if that does
+       not exist, <literal>template1</literal> will be used.  This can be a
+       <link linkend="libpq-connstring">connection string</link>.  If so,
+       connection string parameters will override any conflicting command line
+       options.  Also, connection string parameters other than the database
+       name itself will be re-used when connecting to other databases.
+      </para>
+      <para>
+       If a connection string is given which contains no database name, the other
+       parameters of the string will be used while the database name to use is
+       determined as described above.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-indexes</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include btree indexes associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated btree indexes, if any.  If this option is given, only
+       those indexes which match a <option>--relation</option> or
+       <option>--index</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-strict-names</option></term>
+     <listitem>
+      <para>
+       When calculating the list of databases to check, and the objects within
+       those databases to be checked, do not raise an error for database,
+       schema, relation, table, or index inclusion patterns which match no
+       corresponding objects.
+      </para>
+      <para>
+       Exclusion patterns are not required to match any objects, but by
+       default unmatched inclusion patterns raise an error, including when
+       they fail to match as a result of an exclusion pattern having
+       prohibited them matching an existent object, and when they fail to
+       match a database because it is unconnectable (datallowconn is false).
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-toast</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include toast tables associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated toast tables, if any.  If this option is given, only
+       those toast tables which match a <option>--relation</option> or
+       <option>--table</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--on-error-stop</option></term>
+     <listitem>
+      <para>
+       After reporting all corruptions on the first page of a table where
+       corruptions are found, stop processing that table relation and move on
+       to the next table or index.
+      </para>
+      <para>
+       Note that index checking always stops after the first corrupt page.
+       This option only has meaning relative to table relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--parent-check</option></term>
+     <listitem>
+      <para>
+       For each btree index checked, use <xref linkend="amcheck"/>'s
+       <function>bt_index_parent_check</function> function, which performs
+       additional checks of parent/child relationships during index checking.
+      </para>
+      <para>
+       The default is to use <application>amcheck</application>'s
+       <function>bt_index_check</function> function, but note that use of the
+       <option>--rootdescend</option> option implicitly selects
+       <function>bt_index_parent_check</function>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-p <replaceable class="parameter">port</replaceable></option></term>
+     <term><option>--port=<replaceable class="parameter">port</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the TCP port or local Unix domain socket file extension on
+       which the server is listening for connections.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-P</option></term>
+     <term><option>--progress</option></term>
+     <listitem>
+      <para>
+       Show progress information about how many relations have been checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-q</option></term>
+     <term><option>--quiet</option></term>
+     <listitem>
+      <para>
+       Do not write additional messages beyond those about corruption.
+      </para>
+      <para>
+       This option does not quiet any output specifically due to the use of
+       the <option>-e</option> <option>--echo</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-r <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking on all relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       unless they are otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern.
+      </para>
+      <para>
+       Patterns may be unqualified, or they may be schema-qualified or
+       database- and schema-qualified, such as
+       <literal>"my*relation"</literal>,
+       <literal>"my*schema*.my*relation*"</literal>, or
+       <literal>"my*database.my*schema.my*relation</literal>.  There is no
+       problem specifying relation patterns that match databases that are not
+       otherwise included, as the relation in the matching database will still
+       be checked.
+      </para>
+      <para>
+       The <option>-R</option> <option>--exclude-relation</option> option takes
+       precedence over <option>-r</option> <option>--relation</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-R <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       As with <option>-r</option> <option>--relation</option>, the
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> may be unqualified, schema-qualified,
+       or database- and schema-qualified.
+      </para>
+      <para>
+       The <option>-R</option> <option>--exclude-relation</option> option takes
+       precedence over <option>-r</option> <option>--relation</option>,
+       <option>-t</option> <option>--table</option> and <option>-i</option>
+       <option>--index</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--rootdescend</option></term>
+     <listitem>
+      <para>
+       For each index checked, re-find tuples on the leaf level by performing a
+       new search from the root page for each tuple using
+       <xref linkend="amcheck"/>'s <option>rootdescend</option> option.
+      </para>
+      <para>
+       Use of this option implicitly also selects the
+       <option>--parent-check</option> option.
+      </para>
+      <para>
+       This form of verification was originally written to help in the
+       development of btree index features.  It may be of limited use or even
+       of no use in helping detect the kinds of corruption that occur in
+       practice.  It may also cause corruption checking to take considerably
+       longer and consume considerably more resources on the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-s <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> that are not otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern for checking.  By default, all objects in all matching schema(s)
+       will be checked.
+      </para>
+      <para>
+       Option <option>-S</option> <option>--exclude-schema</option> takes
+       precedence over <option>-s</option> <option>--schema</option>.
+      </para>
+      <para>
+       Note that both tables and indexes are included using this option, which
+       might not be what you want if you are also using
+       <option>--no-dependent-indexes</option>.  To specify all tables in a
+       schema without also specifying all indexes, <option>--table</option> can
+       be used with a pattern that specifies the schema.  For example, to check
+       all tables in schema <literal>corp</literal>, the option
+       <literal>--table="corp.*"</literal> may be used.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-S <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Do not perform checking in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern for exclusion.
+      </para>
+      <para>
+       If a schema which is included using
+       <option>-s</option> <option>--schema</option> is also excluded using
+       <option>-S</option> <option>--exclude-schema</option>, the schema will
+       be excluded.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--skip=<replaceable class="parameter">option</replaceable></option></term>
+     <listitem>
+      <para>
+       If <literal>"all-frozen"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all frozen.
+      </para>
+      <para>
+       If <literal>"all-visible"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all visible.
+      </para>
+      <para>
+       By default, no pages are skipped.  This can be specified as
+       <literal>"none"</literal>, but since this is the default, it need not be
+       mentioned.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--startblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Skip (do not check) pages prior to the given starting block.
+      </para>
+      <para>
+       By default, no pages are skipped.  This option will be applied to all
+       table relations that are checked, including toast tables, but note
+       that unless <option>--exclude-toast-pointers</option> is given, toast
+       pointers found in the main table will be followed into the toast table
+       without regard to the location in the toast table.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-t <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checks on all tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       unless they are otherwise excluded.
+      </para>
+      <para>
+       This is similar to the <option>-r</option> <option>--relation</option>
+       option, except that it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-T <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This is similar to the <option>-R</option>
+       <option>--exclude-relation</option> option, except that it applies only
+       to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-U</option></term>
+     <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
+     <listitem>
+      <para>
+       User name to connect as.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-v</option></term>
+     <term><option>--verbose</option></term>
+     <listitem>
+      <para>
+       Increases the log level verbosity.  This option may be given more than
+       once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-V</option></term>
+     <term><option>--version</option></term>
+     <listitem>
+      <para>
+       Print the <application>pg_amcheck</application> version and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-w</option></term>
+     <term><option>--no-password</option></term>
+     <listitem>
+      <para>
+       Never issue a password prompt.  If the server requires password
+       authentication and a password is not available by other means such as
+       a <filename>.pgpass</filename> file, the connection attempt will fail.
+       This option can be useful in batch jobs and scripts where no user is
+       present to enter a password.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-W</option></term>
+     <term><option>--password</option></term>
+     <listitem>
+      <para>
+       Force <application>pg_amcheck</application> to prompt for a password
+       before connecting to a database.
+      </para>
+      <para>
+       This option is never essential, since
+       <application>pg_amcheck</application> will automatically prompt for a
+       password if the server demands password authentication.  However,
+       <application>pg_amcheck</application> will waste a connection attempt
+       finding out that the server wants a password.  In some cases it is
+       worth typing <option>-W</option> to avoid the extra connection attempt.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+   <application>pg_amcheck</application> is designed to work with
+   <productname>PostgreSQL</productname> 14.0 and later.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Author</title>
+
+  <para>
+   Mark Dilger <email>mark.dilger@enterprisedb.com</email>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="amcheck"/></member>
+  </simplelist>
+ </refsect1>
+</refentry>
diff --git a/src/tools/msvc/Install.pm b/src/tools/msvc/Install.pm
index ea3af48777..49ad558b74 100644
--- a/src/tools/msvc/Install.pm
+++ b/src/tools/msvc/Install.pm
@@ -18,7 +18,7 @@ our (@ISA, @EXPORT_OK);
 @EXPORT_OK = qw(Install);
 
 my $insttype;
-my @client_contribs = ('oid2name', 'pgbench', 'vacuumlo');
+my @client_contribs = ('oid2name', 'pg_amcheck', 'pgbench', 'vacuumlo');
 my @client_program_files = (
 	'clusterdb',      'createdb',   'createuser',    'dropdb',
 	'dropuser',       'ecpg',       'libecpg',       'libecpg_compat',
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 49614106dc..f680544e07 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -33,9 +33,9 @@ my @unlink_on_exit;
 
 # Set of variables for modules in contrib/ and src/test/modules/
 my $contrib_defines = { 'refint' => 'REFINT_VERBOSE' };
-my @contrib_uselibpq = ('dblink', 'oid2name', 'postgres_fdw', 'vacuumlo');
-my @contrib_uselibpgport   = ('oid2name', 'vacuumlo');
-my @contrib_uselibpgcommon = ('oid2name', 'vacuumlo');
+my @contrib_uselibpq = ('dblink', 'oid2name', 'pg_amcheck', 'postgres_fdw', 'vacuumlo');
+my @contrib_uselibpgport   = ('oid2name', 'pg_amcheck', 'vacuumlo');
+my @contrib_uselibpgcommon = ('oid2name', 'pg_amcheck', 'vacuumlo');
 my $contrib_extralibs      = undef;
 my $contrib_extraincludes = { 'dblink' => ['src/backend'] };
 my $contrib_extrasource = {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8ef71bd900..2eb3d12310 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -101,6 +101,7 @@ AlterUserMappingStmt
 AlteredTableInfo
 AlternativeSubPlan
 AlternativeSubPlanState
+AmcheckOptions
 AnalyzeAttrComputeStatsFunc
 AnalyzeAttrFetchFunc
 AnalyzeForeignTable_function
@@ -499,6 +500,7 @@ DSA
 DWORD
 DataDumperPtr
 DataPageDeleteStack
+DatabaseInfo
 DateADT
 Datum
 DatumTupleFields
@@ -1802,6 +1804,8 @@ PathHashStack
 PathKey
 PathKeysComparison
 PathTarget
+PatternInfo
+PatternInfoArray
 Pattern_Prefix_Status
 Pattern_Type
 PendingFsyncEntry
@@ -2084,6 +2088,7 @@ RelToCluster
 RelabelType
 Relation
 RelationData
+RelationInfo
 RelationPtr
 RelationSyncEntry
 RelcacheCallbackFunction
-- 
2.21.1 (Apple Git-122.3)

v44-0003-Extending-PostgresNode-to-test-corruption.patchapplication/octet-stream; name=v44-0003-Extending-PostgresNode-to-test-corruption.patch; x-unix-mode=0644Download
From 773c47eb1c1c77e7488e2a9150e8087f2a1631fa Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Feb 2021 12:37:58 -0800
Subject: [PATCH v44 3/3] Extending PostgresNode to test corruption.

PostgresNode now has functions for overwriting relation files
with full or partial prior versions of those files, creating
corruption beyond merely twiddling the bits of a heap relation
file.

Adding a regression test for pg_amcheck based on this new
functionality.
---
 contrib/pg_amcheck/t/006_relfile_damage.pl    | 145 ++++++++++
 src/test/modules/Makefile                     |   1 +
 src/test/modules/corruption/Makefile          |  16 ++
 .../modules/corruption/t/001_corruption.pl    |  83 ++++++
 src/test/perl/PostgresNode.pm                 | 265 ++++++++++++++++++
 5 files changed, 510 insertions(+)
 create mode 100644 contrib/pg_amcheck/t/006_relfile_damage.pl
 create mode 100644 src/test/modules/corruption/Makefile
 create mode 100644 src/test/modules/corruption/t/001_corruption.pl

diff --git a/contrib/pg_amcheck/t/006_relfile_damage.pl b/contrib/pg_amcheck/t/006_relfile_damage.pl
new file mode 100644
index 0000000000..45ad223531
--- /dev/null
+++ b/contrib/pg_amcheck/t/006_relfile_damage.pl
@@ -0,0 +1,145 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 22;
+use PostgresNode;
+
+my ($node, $port);
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT ct.relname
+			FROM pg_catalog.pg_class cr, pg_catalog.pg_class ct
+			WHERE cr.oid = '$relname'::regclass
+			  AND cr.reltoastrelid = ct.oid
+			));
+	return undef unless defined $rel;
+	return "pg_toast.$rel";
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+# Create a table with a btree index.  Use a fillfactor for the table and index
+# that will allow some fraction of updates to be on the original pages and some
+# on new pages.
+#
+$node->safe_psql('postgres', qq(
+create schema t;
+create table t.t1 (id integer, t text) with (fillfactor=75);
+alter table t.t1 alter column t set storage external;
+insert into t.t1 select gs, repeat('x',gs) from generate_series(9990,10000) gs;
+create index t1_idx on t.t1 (id) with (fillfactor=75);
+));
+
+my $toastrel = relation_toast('postgres', 't.t1');
+
+# Flush relation files to disk and take snapshots of the toast and index
+#
+$node->restart;
+$node->take_relfile_snapshot_minimal('postgres', 'idx', 't.t1_idx');
+$node->take_relfile_snapshot_minimal('postgres', 'toast', $toastrel);
+
+# Insert new data into the table and index
+#
+$node->safe_psql('postgres', qq(
+insert into t.t1 select gs, repeat('y',gs) from generate_series(10001,10100) gs;
+));
+
+# Revert index.  The reverted snapshot file is not corrupt, but it also
+# does not match the current contents of the table.
+#
+$node->stop;
+$node->revert_to_snapshot('idx');
+
+# Restart the node and check table and index with varying options.
+#
+$node->start;
+
+# Checks which do not reconcile the index and table via --heapallindexed will
+# not notice any problems
+#
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--parent-check' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --parent-check');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--rootdescend' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --rootdescend');
+
+# Checks which do reconcile the index and table via --heapallindexed will
+# notice the mismatch in their contents
+#
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed --rootdescend');
+
+# Revert the toast.  The reverted toast table is not corrupt, but it does not
+# have entries for all toast pointers in the main table
+#
+$node->stop;
+$node->revert_to_snapshot('toast');
+
+# Restart the node and check table and toast with varying options.  When
+# checking the toast pointers, we may get errors produced by verify_heapam, but
+# we may also get errors from failure to read toast blocks that are beyond the
+# end of the toast table, of the form /ERROR:  could not read block/.  To avoid
+# having a brittle test, we accept any error message.
+#
+$node->start;
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', $toastrel ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck reverted toast table');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--exclude-toast-pointers' ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck with reverted toast using --exclude-toast-pointers');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ ],
+	'pg_amcheck with reverted toast and default checking');
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 5391f461a2..c92d1702b4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  corruption \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/corruption/Makefile b/src/test/modules/corruption/Makefile
new file mode 100644
index 0000000000..ba461c645d
--- /dev/null
+++ b/src/test/modules/corruption/Makefile
@@ -0,0 +1,16 @@
+# src/test/modules/corruption/Makefile
+
+# EXTRA_INSTALL = contrib/pg_amcheck
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/corruption
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/corruption/t/001_corruption.pl b/src/test/modules/corruption/t/001_corruption.pl
new file mode 100644
index 0000000000..ae4a262e06
--- /dev/null
+++ b/src/test/modules/corruption/t/001_corruption.pl
@@ -0,0 +1,83 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 10;
+use PostgresNode;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create something non-trivial for the first snapshot
+$node->safe_psql('postgres', qq(
+create table t1 (id integer, short_text text, long_text text);
+insert into t1 (id, short_text, long_text)
+	(select gs, 'foo', repeat('x', gs)
+		from generate_series(1,10000) gs);
+create unique index idx1 on t1 (id, short_text);
+vacuum freeze;
+));
+
+# Flush relation files to disk and take snapshot of them
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap1', 'public.t1');
+
+# Update data in the table, toast table, and index
+$node->safe_psql('postgres', qq(
+update t1 set
+	short_text = 'bar',
+	long_text = repeat('y', id);
+));
+
+# Flush relation files to disk and take second snapshot
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap2', 'public.t1');
+
+# Revert the first page of t1 using a torn snapshot.  This should be a partial
+# and corrupt reverting of the update.
+$node->stop;
+$node->revert_to_torn_relfile_snapshot('snap1', 8192);
+
+# Restart the node and count the number of rows in t1 with the original
+# (pre-update) values.  It should not be zero, but nor will it be the full
+# 10000.
+$node->start;
+my ($old, $new, $oldtoast, $newtoast) = counts();
+ok($old > 0 && $old < 10000, "Torn snapshot reverts some of the main updates");
+ok($new > 0 && $new <= 10000, "Torn snapshot retains some of the main updates");
+
+# Revert t1 fully to the first snapshot.  This should fully restore the
+# original (pre-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap1');
+
+# Restart the node and verify only old values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 10000, "Full snapshot restores all the old main values");
+is($oldtoast, 10000, "Full snapshot restores all the old toast values");
+is($new, 0, "Full snapshot reverts all the new main values");
+is($newtoast, 0, "Full snapshot reverts all the new toast values");
+
+# Restore t1 fully to the second snapshot.  This should fully restore the
+# new (post-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap2');
+
+# Restart the node and verify only new values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 0, "Full snapshot reverts all the old main values");
+is($oldtoast, 0, "Full snapshot reverts all the old toast values");
+is($new, 10000, "Full snapshot restores all the new main values");
+is($newtoast, 10000, "Full snapshot restores all the new toast values");
+
+sub counts {
+	return map {
+		$node->safe_psql('postgres', qq(select count(*) from t1 where $_))
+	} ("short_text = 'foo'",
+	   "short_text = 'bar'",
+	   "long_text ~ 'x'",
+	   "long_text ~ 'y'");
+}
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..5402d020f1 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2225,6 +2225,271 @@ sub pg_recvlogical_upto
 
 =back
 
+=head1 DATABASE CORRUPTION METHODS
+
+=over
+
+=item $node->relfile_snapshot_repository()
+
+The path to the parent directory of all directories storing snapshots of
+relation backing files.
+
+=cut
+
+sub relfile_snapshot_repository
+{
+	my ($self) = @_;
+	my $snaprepo = join('/', $self->basedir, 'snapshot');
+	unless (-d $snaprepo)
+	{
+		mkdir $snaprepo
+			or $!{EEXIST}
+			or BAIL_OUT("could not create snapshot repository directory \"$snaprepo\": $!");
+	}
+	return $snaprepo;
+}
+
+=pod
+
+=item $node->relfile_snapshot_directory(snapname)
+
+The path to the directory for storing the named snapshot.
+
+=cut
+
+sub relfile_snapshot_directory
+{
+	my ($self, $snapname) = @_;
+
+	join("/", $self->relfile_snapshot_repository(), $snapname);
+}
+
+=pod
+
+=item $node->take_relfile_snapshot($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relname>, the associated
+toast relations (if any), and all associated indexes (if any).  No attempt is
+made to flush these files to disk, meaning the snapshot taken could be stale
+unless the caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relations.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+=pod
+
+=item $node->take_relfile_snapshot_minimal($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relnames>.  No attempt is made
+to flush these files to disk, meaning the snapshot taken could be stale unless the
+caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relation.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+sub take_relfile_snapshot
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 1, @relnames);
+}
+
+sub take_relfile_snapshot_minimal
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 0, @relnames);
+}
+
+sub take_relfile_snapshot_helper
+{
+	my ($self, $dbname, $snapname, $extended, @relnames) = @_;
+
+	croak "dbname must be specified" unless defined $dbname;
+	croak "relnames must be defined" unless scalar(grep { defined $_ } @relnames);
+	croak "snapname must be specified" unless defined $snapname;
+	croak "snapname must be unique" if exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snapdir = $self->relfile_snapshot_directory($snapname);
+	croak "snapname directory name already in use: $snapdir" if (-e $snapdir);
+	mkdir $snapdir
+		or BAIL_OUT("could not create snapshot directory \"$snapdir\": $!");
+
+	my @relpaths = map {
+		$self->safe_psql($dbname,
+			qq(SELECT pg_relation_filepath('$_')));
+	} @relnames;
+
+	my (@toastpaths, @idxpaths);
+	if ($extended)
+	{
+		for my $relname (@relnames)
+		{
+			push (@toastpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(c.reltoastrelid)
+					FROM pg_catalog.pg_class c
+					WHERE c.oid = '$relname'::regclass
+					AND c.reltoastrelid != 0::oid))));
+			push (@idxpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(i.indexrelid)
+					FROM pg_catalog.pg_index i
+					WHERE i.indrelid = '$relname'::regclass))));
+		}
+	}
+
+	$self->{snapshot}->{$snapname} = {};
+	for my $path (@relpaths, grep { defined($_) } @toastpaths, @idxpaths)
+	{
+		croak "file backing relation is missing: $pgdata/$path" unless -f "$pgdata/$path";
+		copy_file($snapdir, $pgdata, 0, $path);
+		$self->{snapshot}->{$snapname}->{$path} = 1;
+	}
+}
+
+=pod
+
+=item $node->revert_to_snapshot($self, $snapname)
+
+Overwrites the database's relation files with files previously saved in
+B<$snapname>.
+
+Dies if the given B<$snapname> does not exist.
+
+=cut
+
+=pod
+
+=item $node->revert_to_torn_relfile_snapshot($self, $snapname, $bytes)
+
+Partially overwrites the database's relation files using prefixes of the given
+number of bytes from the files saved in B<$snapname>.  If B<$bytes> is
+negative, uses suffixes of the given byte length rather than prefixes.
+
+If B<$bytes> is null, replaces the database's relation files using the saved
+files in the B<$snapname>, which unlike for non-undef values, means the file
+may become shorter if the saved file is shorter than the current file.
+
+=cut
+
+sub revert_to_snapshot
+{
+	my ($self, $snapname) = @_;
+	$self->revert_to_torn_relfile_snapshot($snapname, undef);
+}
+
+sub revert_to_torn_relfile_snapshot
+{
+	my ($self, $snapname, $bytes) = @_;
+
+	croak "no such snapshot" unless exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snaprepo = join('/', $self->relfile_snapshot_repository, $snapname);
+	croak "snapname directory missing: $snaprepo" unless (-d $snaprepo);
+
+	if (defined $bytes)
+	{
+		tear_file($pgdata, $snaprepo, $bytes, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+	else
+	{
+		copy_file($pgdata, $snaprepo, 1, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+}
+
+sub copy_file
+{
+	my ($dstdir, $srcdir, $overwrite, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	foreach my $part (split(m{/}, $path))
+	{
+		my $srcpart = "$srcdir/$part";
+		my $dstpart = "$dstdir/$part";
+
+		if (-d $srcpart)
+		{
+			$srcdir = $srcpart;
+			$dstdir = $dstpart;
+			die "$dstdir is in the way" if (-e $dstdir && ! -d $dstdir);
+			unless (-d $dstdir)
+			{
+				mkdir $dstdir
+					or BAIL_OUT("could not create directory \"$dstdir\": $!");
+			}
+		}
+		elsif (-f $srcpart)
+		{
+			die "$dstdir/$part is in the way" if (!$overwrite && -e "$dstdir/$part");
+
+			File::Copy::copy($srcpart, "$dstdir/$part");
+		}
+	}
+}
+
+sub tear_file
+{
+	my ($dstdir, $srcdir, $bytes, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	my $srcfile = "$srcdir/$path";
+	my $dstfile = "$dstdir/$path";
+
+	croak "No such file: $srcfile" unless -f $srcfile;
+	croak "No such file: $dstfile" unless -f $dstfile;
+
+	my ($srcfh, $dstfh);
+	open($srcfh, '<', $srcfile) or die "Cannot read $srcfile: $!";
+	open($dstfh, '+<', $dstfile) or die "Cannot modify $dstfile: $!";
+	binmode($srcfh);
+	binmode($dstfh);
+
+	my $buffer;
+	if ($bytes < 0)
+	{
+		$bytes *= -1;		# Easier to use positive value
+		my $srcsize = (stat($srcfh))[7];
+		my $offset = $srcsize - $bytes;
+		seek($srcfh, $offset, 0) or die "seek failed: $!";
+		seek($dstfh, $offset, 0) or die "seek failed: $!";
+		defined(sysread($srcfh, $buffer, $bytes))
+			or die "sysread failed: $!";
+		defined(syswrite($dstfh, $buffer, $bytes))
+			or die "syswrite failed: $!";
+	}
+	else
+	{
+		seek($srcfh, 0, 0) or die "seek failed: $!";
+		seek($dstfh, 0, 0) or die "seek failed: $!";
+		defined(sysread($srcfh, $buffer, $bytes))
+			or die "sysread failed: $!";
+		defined(syswrite($dstfh, $buffer, $bytes))
+			or die "syswrite failed: $!";
+	}
+
+	close($srcfh);
+	close($dstfh);
+}
+
+=pod
+
+=back
+
 =cut
 
 1;
-- 
2.21.1 (Apple Git-122.3)

#12Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#11)
Re: pg_amcheck contrib application

On Wed, Mar 10, 2021 at 11:10 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Once again, I think you are right and have removed the objectionable behavior, but....

The --startblock and --endblock options make the most sense when the user is only checking one table, like

pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table

because the user likely has some knowledge about that table, perhaps from a prior run of pg_amcheck. The --startblock and --endblock arguments are a bit strange when used globally, as relations don't all have the same number of blocks, so

pg_amcheck --startblock=17 --endblock=19 mydb

will very likely emit lots of error messages for tables which don't have blocks in that range. That's not entirely pg_amcheck's fault, as it just did what the user asked, but it also doesn't seem super helpful. I'm not going to do anything about it in this release.

+1 to all that. I tend toward the opinion that trying to make
--startblock and --endblock do anything useful in the context of
checking multiple relations is not really possible, and therefore we
just shouldn't put any effort into it. But if user feedback shows
otherwise, we can always do something about it later.

After running 'make installcheck', if I delete all entries from pg_class where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck regression', I get lines that look like this:

heap relation "regression"."public"."quad_poly_tbl":
ERROR: could not open relation with OID 17177

In this here example, the first line ends in a colon.

relation "regression"."public"."functional_dependencies", block 28, offset 54, attribute 0
attribute 0 with length 4294967295 ends at offset 50 beyond total tuple length 43

But this here one does not. Seems like it should be consistent.

The QUALIFIED_NAME_FIELDS macro doesn't seem to be used anywhere,
which is good, because macros with unbalanced parentheses are usually
not a good plan; and a macro that expands to a comma-separated list of
things is suspect too.

"invalid skip options\n" seems too plural.

With regard to your use of strtol() for --{start,end}block, telling
the user that their input is garbage seems pejorative, even though it
may be accurate. Compare:

[rhaas EDBAS]$ pg_dump -jdsgdsgd
pg_dump: error: invalid number of parallel jobs

In the message "relation end block argument precedes start block
argument\n", I think you could lose both instances of the word
"argument" and probably the word "relation" as well. I actually don't
know why all of these messages about start and end block mention
"relation". It's not like there is some other kind of
non-relation-related start block with which it could be confused.

The comment for run_command() explains some things about the cparams
argument, but those things are false. In fact the argument is unused.

Usual PostgreSQL practice when freeing memory in e.g.
verify_heap_slot_handler is to set the pointers to NULL as well. The
performance cost of this is trivial, and it makes debugging a lot
easier should somebody accidentally write code to access one of those
things after it's been freed.

The documentation says that -D "does exclude any database that was
listed explicitly as dbname on the command line, nor does it exclude
the database chosen in the absence of any dbname argument." The first
part of this makes complete sense to me, but I'm not sure about the
second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting
that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d
'bob*', I think I only want to check the bob-related databases and not
rhaas.

I suggest documenting --endblock as "Check table blocks up to and
including the specified ending block number. An error will occur if a
relation being checked has fewer than this number of blocks." And
similarly for --startblock: "Check table blocks beginning with the
specified block number. An error will occur, etc." Perhaps even
mention something like "This option is probably only useful when
checking a single table." Also, the documentation here isn't clear
that this affects only table checking, not index checking.

It appears that pg_amcheck sometimes makes dummy connections to the
database that don't do anything, e.g. pg_amcheck -t 'q*' resulted in:

2021-03-10 15:00:14.273 EST [95473] LOG: connection received: host=[local]
2021-03-10 15:00:14.274 EST [95473] LOG: connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.286 EST [95473] LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2021-03-10 15:00:14.290 EST [95464] DEBUG: forked new backend,
pid=95474 socket=11
2021-03-10 15:00:14.291 EST [95464] DEBUG: server process (PID 95473)
exited with exit code 0
2021-03-10 15:00:14.291 EST [95474] LOG: connection received: host=[local]
2021-03-10 15:00:14.293 EST [95474] LOG: connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.296 EST [95474] LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
<...more queries from PID 95474...>
2021-03-10 15:00:14.321 EST [95464] DEBUG: server process (PID 95474)
exited with exit code 0

It doesn't seem to make sense to connect to a database, set the search
path, exit, and then immediately reconnect to the same database.

This is slightly inconsistent:

pg_amcheck: checking heap table "rhaas"."public"."foo"
heap relation "rhaas"."public"."foo":
ERROR: XX000: catalog is missing 144 attribute(s) for relid 16392
LOCATION: RelationBuildTupleDesc, relcache.c:652
query was: SELECT blkno, offnum, attnum, msg FROM "public".verify_heapam(
relation := 16392, on_error_stop := false, check_toast := true, skip := 'none')

In line 1 it's a heap table, but in line 2 it's a heap relation.

That's all I've got.

--
Robert Haas
EDB: http://www.enterprisedb.com

#13Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#12)
3 attachment(s)
Re: pg_amcheck contrib application

On Mar 10, 2021, at 12:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 10, 2021 at 11:10 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Once again, I think you are right and have removed the objectionable behavior, but....

The --startblock and --endblock options make the most sense when the user is only checking one table, like

pg_amcheck --startblock=17 --endblock=19 --table=my_schema.my_corrupt_table

because the user likely has some knowledge about that table, perhaps from a prior run of pg_amcheck. The --startblock and --endblock arguments are a bit strange when used globally, as relations don't all have the same number of blocks, so

pg_amcheck --startblock=17 --endblock=19 mydb

will very likely emit lots of error messages for tables which don't have blocks in that range. That's not entirely pg_amcheck's fault, as it just did what the user asked, but it also doesn't seem super helpful. I'm not going to do anything about it in this release.

+1 to all that. I tend toward the opinion that trying to make
--startblock and --endblock do anything useful in the context of
checking multiple relations is not really possible, and therefore we
just shouldn't put any effort into it. But if user feedback shows
otherwise, we can always do something about it later.

After running 'make installcheck', if I delete all entries from pg_class where relnamespace = 'pg_toast'::regclass, by running 'pg_amcheck regression', I get lines that look like this:

heap relation "regression"."public"."quad_poly_tbl":
ERROR: could not open relation with OID 17177

In this here example, the first line ends in a colon.

relation "regression"."public"."functional_dependencies", block 28, offset 54, attribute 0
attribute 0 with length 4294967295 ends at offset 50 beyond total tuple length 43

But this here one does not. Seems like it should be consistent.

Good point. It also seems inconsistent that in one it refers to a "relation" and in the other to a "heap relation", but they're both heap relations. Changed to use "heap relation" both places, and to both use colons.

The QUALIFIED_NAME_FIELDS macro doesn't seem to be used anywhere,
which is good, because macros with unbalanced parentheses are usually
not a good plan; and a macro that expands to a comma-separated list of
things is suspect too.

Yeah, that whole macro was supposed to be removed. Looks like I somehow only removed the end of it, plus some functions that were using it. Not sure how I fat fingered that in the editor, but I've removed the rest now.

"invalid skip options\n" seems too plural.

Changed to something less plural.

With regard to your use of strtol() for --{start,end}block, telling
the user that their input is garbage seems pejorative, even though it
may be accurate. Compare:

[rhaas EDBAS]$ pg_dump -jdsgdsgd
pg_dump: error: invalid number of parallel jobs

In the message "relation end block argument precedes start block
argument\n", I think you could lose both instances of the word
"argument" and probably the word "relation" as well. I actually don't
know why all of these messages about start and end block mention
"relation". It's not like there is some other kind of
non-relation-related start block with which it could be confused.

Changed.

The comment for run_command() explains some things about the cparams
argument, but those things are false. In fact the argument is unused.

Removed unused argument and associated comment.

Usual PostgreSQL practice when freeing memory in e.g.
verify_heap_slot_handler is to set the pointers to NULL as well. The
performance cost of this is trivial, and it makes debugging a lot
easier should somebody accidentally write code to access one of those
things after it's been freed.

I had been doing that and removed it, anticipating a complaint about useless code. Ok, I put it back.

The documentation says that -D "does exclude any database that was
listed explicitly as dbname on the command line, nor does it exclude
the database chosen in the absence of any dbname argument." The first
part of this makes complete sense to me, but I'm not sure about the
second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting
that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d
'bob*', I think I only want to check the bob-related databases and not
rhaas.

I think it's a tricky definitional problem. I'll argue the other side for the moment:

If you say `pg_amcheck bob`, I think it is fair to assume that "bob" gets checked. If you say `pg_amcheck bob -d="b*" -D="bo*"`, it is fair to expect all databases starting with /b/ to be checked, except those starting with /bo/, except that since you *explicitly* asked for "bob", that "bob" gets checked. We both agree on this point, I think.

If you say `pg_amcheck --maintenance-db=bob -d="b*" -D="bo*", you don't expect "bob" to get checked, even though it was explicitly stated.

If you are named "bob", and run `pg_amcheck`, you expect it to get your name "bob" from the environment, and check database "bob". It's implicit rather than explicit, but that doesn't change what you expect to happen. It's just a short-hand for saying `pg_amcheck bob`.

Saying that `pg_amcheck -d="b*" -D="bo*" should not check "bob" implies that the database being retrieved from the environment is acting like a maintenance-db. But that's not how it is treated when you just say `pg_amcheck` with no arguments. I think treating it as a maintenance-db in some situations but not in others is strangely non-orthogonal.

On the other hand, I would expect some users to come back with precisely your complaint, so I don't know how best to solve this.

I suggest documenting --endblock as "Check table blocks up to and
including the specified ending block number. An error will occur if a
relation being checked has fewer than this number of blocks." And
similarly for --startblock: "Check table blocks beginning with the
specified block number. An error will occur, etc." Perhaps even
mention something like "This option is probably only useful when
checking a single table." Also, the documentation here isn't clear
that this affects only table checking, not index checking.

Changed.

It appears that pg_amcheck sometimes makes dummy connections to the
database that don't do anything, e.g. pg_amcheck -t 'q*' resulted in:

2021-03-10 15:00:14.273 EST [95473] LOG: connection received: host=[local]
2021-03-10 15:00:14.274 EST [95473] LOG: connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.286 EST [95473] LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
2021-03-10 15:00:14.290 EST [95464] DEBUG: forked new backend,
pid=95474 socket=11
2021-03-10 15:00:14.291 EST [95464] DEBUG: server process (PID 95473)
exited with exit code 0
2021-03-10 15:00:14.291 EST [95474] LOG: connection received: host=[local]
2021-03-10 15:00:14.293 EST [95474] LOG: connection authorized:
user=rhaas database=rhaas application_name=pg_amcheck
2021-03-10 15:00:14.296 EST [95474] LOG: statement: SELECT
pg_catalog.set_config('search_path', '', false);
<...more queries from PID 95474...>
2021-03-10 15:00:14.321 EST [95464] DEBUG: server process (PID 95474)
exited with exit code 0

It doesn't seem to make sense to connect to a database, set the search
path, exit, and then immediately reconnect to the same database.

I think I've cleaned that up now.

This is slightly inconsistent:

pg_amcheck: checking heap table "rhaas"."public"."foo"
heap relation "rhaas"."public"."foo":
ERROR: XX000: catalog is missing 144 attribute(s) for relid 16392
LOCATION: RelationBuildTupleDesc, relcache.c:652
query was: SELECT blkno, offnum, attnum, msg FROM "public".verify_heapam(
relation := 16392, on_error_stop := false, check_toast := true, skip := 'none')

In line 1 it's a heap table, but in line 2 it's a heap relation.

Changed to use "heap table" consistently, and along those lines, to use "btree index" rather than "btree relation".

Attachments:

v45-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patchapplication/octet-stream; name=v45-0001-Reworking-ParallelSlots-for-mutliple-DB-use.patch; x-unix-mode=0644Download
From 3469b2304fd5c6604607fcd5c7c405c0995d0fa7 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 3 Mar 2021 07:16:55 -0800
Subject: [PATCH v45 1/3] Reworking ParallelSlots for mutliple DB use

The existing implementation of ParallelSlots is used by reindexdb
and vacuumdb to process tables in parallel in only one database at
a time.  The ParallelSlots interface reflects this usage pattern.
The function to set up the slots assumes all slots should be
connected to the same database, and the function for getting the
next idle slot pays no attention to which database the slot may be
connected to.

In anticipation of pg_amcheck using parallel slots to process
multiple databases in parallel, reworking the interface while
trying to remain reasonably simple for reindexdb and vacuumdb to
use:

ParallelSlotsSetup() no longer creates or receives database
connections.  It takes arguments that it stores for use in
subsequent operations when a connection needs to be formed.

Callers who already have a connection and want to reuse it can give
it to the parallel slots using a new function,
ParallelSlotsAdoptConn().  Both reindexdb and vacuumdb use this.

ParallelSlotsGetIdle() is extended to take a dbname argument
indicating the database to which a connection is desired, and to
manage a heterogeneous set of slots potentially connected to varying
databases and some perhaps not yet connected.  The function will
reuse an existing connection or form a new connection as necessary.

The logic for determining whether a slot's connection is suitable
for reuse is based on the database the slot's connection is
connected to, and whether that matches the database desired.  Other
connection parameters (user, host, port, etc.) are assumed not to
change from slot to slot.
---
 src/bin/scripts/reindexdb.c          |  17 +-
 src/bin/scripts/vacuumdb.c           |  46 +--
 src/fe_utils/parallel_slot.c         | 407 +++++++++++++++++++--------
 src/include/fe_utils/parallel_slot.h |  27 +-
 src/tools/pgindent/typedefs.list     |   2 +
 5 files changed, 338 insertions(+), 161 deletions(-)

diff --git a/src/bin/scripts/reindexdb.c b/src/bin/scripts/reindexdb.c
index cf28176243..fc0681538a 100644
--- a/src/bin/scripts/reindexdb.c
+++ b/src/bin/scripts/reindexdb.c
@@ -36,7 +36,7 @@ static SimpleStringList *get_parallel_object_list(PGconn *conn,
 												  ReindexType type,
 												  SimpleStringList *user_list,
 												  bool echo);
-static void reindex_one_database(const ConnParams *cparams, ReindexType type,
+static void reindex_one_database(ConnParams *cparams, ReindexType type,
 								 SimpleStringList *user_list,
 								 const char *progname,
 								 bool echo, bool verbose, bool concurrently,
@@ -330,7 +330,7 @@ main(int argc, char *argv[])
 }
 
 static void
-reindex_one_database(const ConnParams *cparams, ReindexType type,
+reindex_one_database(ConnParams *cparams, ReindexType type,
 					 SimpleStringList *user_list,
 					 const char *progname, bool echo,
 					 bool verbose, bool concurrently, int concurrentCons,
@@ -341,7 +341,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 	bool		parallel = concurrentCons > 1;
 	SimpleStringList *process_list = user_list;
 	ReindexType process_type = type;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	bool		failed = false;
 	int			items_count = 0;
 
@@ -461,7 +461,8 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 
 	Assert(process_list != NULL);
 
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, NULL);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	cell = process_list->head;
 	do
@@ -475,7 +476,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -489,7 +490,7 @@ reindex_one_database(const ConnParams *cparams, ReindexType type,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
@@ -499,8 +500,8 @@ finish:
 		pg_free(process_list);
 	}
 
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pfree(slots);
+	ParallelSlotsTerminate(sa);
+	pfree(sa);
 
 	if (failed)
 		exit(1);
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 602fd45c42..7901c41f16 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -45,7 +45,7 @@ typedef struct vacuumingOptions
 } vacuumingOptions;
 
 
-static void vacuum_one_database(const ConnParams *cparams,
+static void vacuum_one_database(ConnParams *cparams,
 								vacuumingOptions *vacopts,
 								int stage,
 								SimpleStringList *tables,
@@ -408,7 +408,7 @@ main(int argc, char *argv[])
  * a list of tables from the database.
  */
 static void
-vacuum_one_database(const ConnParams *cparams,
+vacuum_one_database(ConnParams *cparams,
 					vacuumingOptions *vacopts,
 					int stage,
 					SimpleStringList *tables,
@@ -421,13 +421,14 @@ vacuum_one_database(const ConnParams *cparams,
 	PGresult   *res;
 	PGconn	   *conn;
 	SimpleStringListCell *cell;
-	ParallelSlot *slots;
+	ParallelSlotArray *sa;
 	SimpleStringList dbtables = {NULL, NULL};
 	int			i;
 	int			ntups;
 	bool		failed = false;
 	bool		tables_listed = false;
 	bool		has_where = false;
+	const char *initcmd;
 	const char *stage_commands[] = {
 		"SET default_statistics_target=1; SET vacuum_cost_delay=0;",
 		"SET default_statistics_target=10; RESET vacuum_cost_delay;",
@@ -684,26 +685,25 @@ vacuum_one_database(const ConnParams *cparams,
 		concurrentCons = 1;
 
 	/*
-	 * Setup the database connections. We reuse the connection we already have
-	 * for the first slot.  If not in parallel mode, the first slot in the
-	 * array contains the connection.
+	 * All slots need to be prepared to run the appropriate analyze stage, if
+	 * caller requested that mode.  We have to prepare the initial connection
+	 * ourselves before setting up the slots.
 	 */
-	slots = ParallelSlotsSetup(cparams, progname, echo, conn, concurrentCons);
+	if (stage == ANALYZE_NO_STAGE)
+		initcmd = NULL;
+	else
+	{
+		initcmd = stage_commands[stage];
+		executeCommand(conn, initcmd, echo);
+	}
 
 	/*
-	 * Prepare all the connections to run the appropriate analyze stage, if
-	 * caller requested that mode.
+	 * Setup the database connections. We reuse the connection we already have
+	 * for the first slot.  If not in parallel mode, the first slot in the
+	 * array contains the connection.
 	 */
-	if (stage != ANALYZE_NO_STAGE)
-	{
-		int			j;
-
-		/* We already emitted the message above */
-
-		for (j = 0; j < concurrentCons; j++)
-			executeCommand((slots + j)->connection,
-						   stage_commands[stage], echo);
-	}
+	sa = ParallelSlotsSetup(concurrentCons, cparams, progname, echo, initcmd);
+	ParallelSlotsAdoptConn(sa, conn);
 
 	initPQExpBuffer(&sql);
 
@@ -719,7 +719,7 @@ vacuum_one_database(const ConnParams *cparams,
 			goto finish;
 		}
 
-		free_slot = ParallelSlotsGetIdle(slots, concurrentCons);
+		free_slot = ParallelSlotsGetIdle(sa, NULL);
 		if (!free_slot)
 		{
 			failed = true;
@@ -740,12 +740,12 @@ vacuum_one_database(const ConnParams *cparams,
 		cell = cell->next;
 	} while (cell != NULL);
 
-	if (!ParallelSlotsWaitCompletion(slots, concurrentCons))
+	if (!ParallelSlotsWaitCompletion(sa))
 		failed = true;
 
 finish:
-	ParallelSlotsTerminate(slots, concurrentCons);
-	pg_free(slots);
+	ParallelSlotsTerminate(sa);
+	pg_free(sa);
 
 	termPQExpBuffer(&sql);
 
diff --git a/src/fe_utils/parallel_slot.c b/src/fe_utils/parallel_slot.c
index b625deb254..69581157c2 100644
--- a/src/fe_utils/parallel_slot.c
+++ b/src/fe_utils/parallel_slot.c
@@ -25,25 +25,16 @@
 #include "common/logging.h"
 #include "fe_utils/cancel.h"
 #include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
 
 #define ERRCODE_UNDEFINED_TABLE  "42P01"
 
-static void init_slot(ParallelSlot *slot, PGconn *conn);
 static int	select_loop(int maxFd, fd_set *workerset);
 static bool processQueryResult(ParallelSlot *slot, PGresult *result);
 
-static void
-init_slot(ParallelSlot *slot, PGconn *conn)
-{
-	slot->connection = conn;
-	/* Initially assume connection is idle */
-	slot->isFree = true;
-	ParallelSlotClearHandler(slot);
-}
-
 /*
  * Process (and delete) a query result.  Returns true if there's no problem,
- * false otherwise. It's up to the handler to decide what cosntitutes a
+ * false otherwise. It's up to the handler to decide what constitutes a
  * problem.
  */
 static bool
@@ -137,151 +128,316 @@ select_loop(int maxFd, fd_set *workerset)
 }
 
 /*
- * ParallelSlotsGetIdle
- *		Return a connection slot that is ready to execute a command.
- *
- * This returns the first slot we find that is marked isFree, if one is;
- * otherwise, we loop on select() until one socket becomes available.  When
- * this happens, we read the whole set and mark as free all sockets that
- * become available.  If an error occurs, NULL is returned.
+ * Return the offset of a suitable idle slot, or -1 if none are available.  If
+ * the given dbname is not null, only idle slots connected to the given
+ * database are considered suitable, otherwise all idle connected slots are
+ * considered suitable.
  */
-ParallelSlot *
-ParallelSlotsGetIdle(ParallelSlot *slots, int numslots)
+static int
+find_matching_idle_slot(const ParallelSlotArray *sa, const char *dbname)
 {
 	int			i;
-	int			firstFree = -1;
 
-	/*
-	 * Look for any connection currently free.  If there is one, mark it as
-	 * taken and let the caller know the slot to use.
-	 */
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (slots[i].isFree)
-		{
-			slots[i].isFree = false;
-			return slots + i;
-		}
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			continue;
+
+		if (dbname == NULL ||
+			strcmp(PQdb(sa->slots[i].connection), dbname) == 0)
+			return i;
+	}
+	return -1;
+}
+
+/*
+ * Return the offset of the first slot without a database connection, or -1 if
+ * all slots are connected.
+ */
+static int
+find_unconnected_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		if (sa->slots[i].inUse)
+			continue;
+
+		if (sa->slots[i].connection == NULL)
+			return i;
+	}
+
+	return -1;
+}
+
+/*
+ * Return the offset of the first idle slot, or -1 if all slots are busy.
+ */
+static int
+find_any_idle_slot(const ParallelSlotArray *sa)
+{
+	int			i;
+
+	for (i = 0; i < sa->numslots; i++)
+		if (!sa->slots[i].inUse)
+			return i;
+
+	return -1;
+}
+
+/*
+ * Wait for any slot's connection to have query results, consume the results,
+ * and update the slot's status as appropriate.  Returns true on success,
+ * false on cancellation, on error, or if no slots are connected.
+ */
+static bool
+wait_on_slots(ParallelSlotArray *sa)
+{
+	int			i;
+	fd_set		slotset;
+	int			maxFd = 0;
+	PGconn	   *cancelconn = NULL;
+
+	/* We must reconstruct the fd_set for each call to select_loop */
+	FD_ZERO(&slotset);
+
+	for (i = 0; i < sa->numslots; i++)
+	{
+		int			sock;
+
+		/* We shouldn't get here if we still have slots without connections */
+		Assert(sa->slots[i].connection != NULL);
+
+		sock = PQsocket(sa->slots[i].connection);
+
+		/*
+		 * We don't really expect any connections to lose their sockets after
+		 * startup, but just in case, cope by ignoring them.
+		 */
+		if (sock < 0)
+			continue;
+
+		/* Keep track of the first valid connection we see. */
+		if (cancelconn == NULL)
+			cancelconn = sa->slots[i].connection;
+
+		FD_SET(sock, &slotset);
+		if (sock > maxFd)
+			maxFd = sock;
 	}
 
 	/*
-	 * No free slot found, so wait until one of the connections has finished
-	 * its task and return the available slot.
+	 * If we get this far with no valid connections, processing cannot
+	 * continue.
 	 */
-	while (firstFree < 0)
+	if (cancelconn == NULL)
+		return false;
+
+	SetCancelConn(sa->slots->connection);
+	i = select_loop(maxFd, &slotset);
+	ResetCancelConn();
+
+	/* failure? */
+	if (i < 0)
+		return false;
+
+	for (i = 0; i < sa->numslots; i++)
 	{
-		fd_set		slotset;
-		int			maxFd = 0;
+		int			sock;
 
-		/* We must reconstruct the fd_set for each call to select_loop */
-		FD_ZERO(&slotset);
+		sock = PQsocket(sa->slots[i].connection);
 
-		for (i = 0; i < numslots; i++)
+		if (sock >= 0 && FD_ISSET(sock, &slotset))
 		{
-			int			sock = PQsocket(slots[i].connection);
-
-			/*
-			 * We don't really expect any connections to lose their sockets
-			 * after startup, but just in case, cope by ignoring them.
-			 */
-			if (sock < 0)
-				continue;
-
-			FD_SET(sock, &slotset);
-			if (sock > maxFd)
-				maxFd = sock;
+			/* select() says input is available, so consume it */
+			PQconsumeInput(sa->slots[i].connection);
 		}
 
-		SetCancelConn(slots->connection);
-		i = select_loop(maxFd, &slotset);
-		ResetCancelConn();
-
-		/* failure? */
-		if (i < 0)
-			return NULL;
-
-		for (i = 0; i < numslots; i++)
+		/* Collect result(s) as long as any are available */
+		while (!PQisBusy(sa->slots[i].connection))
 		{
-			int			sock = PQsocket(slots[i].connection);
+			PGresult   *result = PQgetResult(sa->slots[i].connection);
 
-			if (sock >= 0 && FD_ISSET(sock, &slotset))
+			if (result != NULL)
 			{
-				/* select() says input is available, so consume it */
-				PQconsumeInput(slots[i].connection);
+				/* Handle and discard the command result */
+				if (!processQueryResult(&sa->slots[i], result))
+					return false;
 			}
-
-			/* Collect result(s) as long as any are available */
-			while (!PQisBusy(slots[i].connection))
+			else
 			{
-				PGresult   *result = PQgetResult(slots[i].connection);
-
-				if (result != NULL)
-				{
-					/* Handle and discard the command result */
-					if (!processQueryResult(slots + i, result))
-						return NULL;
-				}
-				else
-				{
-					/* This connection has become idle */
-					slots[i].isFree = true;
-					ParallelSlotClearHandler(slots + i);
-					if (firstFree < 0)
-						firstFree = i;
-					break;
-				}
+				/* This connection has become idle */
+				sa->slots[i].inUse = false;
+				ParallelSlotClearHandler(&sa->slots[i]);
+				break;
 			}
 		}
 	}
+	return true;
+}
 
-	slots[firstFree].isFree = false;
-	return slots + firstFree;
+/*
+ * Open a new database connection using the stored connection parameters and
+ * optionally a given dbname if not null, execute the stored initial command if
+ * any, and associate the new connection with the given slot.
+ */
+static void
+connect_slot(ParallelSlotArray *sa, int slotno, const char *dbname)
+{
+	const char *old_override;
+	ParallelSlot *slot = &sa->slots[slotno];
+
+	old_override = sa->cparams->override_dbname;
+	if (dbname)
+		sa->cparams->override_dbname = dbname;
+	slot->connection = connectDatabase(sa->cparams, sa->progname, sa->echo, false, true);
+	sa->cparams->override_dbname = old_override;
+
+	if (PQsocket(slot->connection) >= FD_SETSIZE)
+	{
+		pg_log_fatal("too many jobs for this platform");
+		exit(1);
+	}
+
+	/* Setup the connection using the supplied command, if any. */
+	if (sa->initcmd)
+		executeCommand(slot->connection, sa->initcmd, sa->echo);
 }
 
 /*
- * ParallelSlotsSetup
- *		Prepare a set of parallel slots to use on a given database.
+ * ParallelSlotsGetIdle
+ *		Return a connection slot that is ready to execute a command.
+ *
+ * The slot returned is chosen as follows:
+ *
+ * If any idle slot already has an open connection, and if either dbname is
+ * null or the existing connection is to the given database, that slot will be
+ * returned allowing the connection to be reused.
+ *
+ * Otherwise, if any idle slot is not yet connected to any database, the slot
+ * will be returned with it's connection opened using the stored cparams and
+ * optionally the given dbname if not null.
+ *
+ * Otherwise, if any idle slot exists, an idle slot will be chosen and returned
+ * after having it's connection disconnected and reconnected using the stored
+ * cparams and optionally the given dbname if not null.
  *
- * This creates and initializes a set of connections to the database
- * using the information given by the caller, marking all parallel slots
- * as free and ready to use.  "conn" is an initial connection set up
- * by the caller and is associated with the first slot in the parallel
- * set.
+ * Otherwise, if any slots have connections that are busy, we loop on select()
+ * until one socket becomes available.  When this happens, we read the whole
+ * set and mark as free all sockets that become available.  We then select a
+ * slot using the same rules as above.
+ *
+ * Otherwise, we cannot return a slot, which is an error, and NULL is returned.
+ *
+ * For any connection created, if the stored initcmd is not null, it will be
+ * executed as a command on the newly formed connection before the slot is
+ * returned.
+ *
+ * If an error occurs, NULL is returned.
  */
 ParallelSlot *
-ParallelSlotsSetup(const ConnParams *cparams,
-				   const char *progname, bool echo,
-				   PGconn *conn, int numslots)
+ParallelSlotsGetIdle(ParallelSlotArray *sa, const char *dbname)
 {
-	ParallelSlot *slots;
-	int			i;
+	int			offset;
 
-	Assert(conn != NULL);
+	Assert(sa);
+	Assert(sa->numslots > 0);
 
-	slots = (ParallelSlot *) pg_malloc(sizeof(ParallelSlot) * numslots);
-	init_slot(slots, conn);
-	if (numslots > 1)
+	while (1)
 	{
-		for (i = 1; i < numslots; i++)
+		/* First choice: a slot already connected to the desired database. */
+		offset = find_matching_idle_slot(sa, dbname);
+		if (offset >= 0)
 		{
-			conn = connectDatabase(cparams, progname, echo, false, true);
-
-			/*
-			 * Fail and exit immediately if trying to use a socket in an
-			 * unsupported range.  POSIX requires open(2) to use the lowest
-			 * unused file descriptor and the hint given relies on that.
-			 */
-			if (PQsocket(conn) >= FD_SETSIZE)
-			{
-				pg_log_fatal("too many jobs for this platform -- try %d", i);
-				exit(1);
-			}
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
+
+		/* Second choice: a slot not connected to any database. */
+		offset = find_unconnected_slot(sa);
+		if (offset >= 0)
+		{
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
+		}
 
-			init_slot(slots + i, conn);
+		/* Third choice: a slot connected to the wrong database. */
+		offset = find_any_idle_slot(sa);
+		if (offset >= 0)
+		{
+			disconnectDatabase(sa->slots[offset].connection);
+			sa->slots[offset].connection = NULL;
+			connect_slot(sa, offset, dbname);
+			sa->slots[offset].inUse = true;
+			return &sa->slots[offset];
 		}
+
+		/*
+		 * Fourth choice: block until one or more slots become available. If
+		 * any slots hit a fatal error, we'll find out about that here and
+		 * return NULL.
+		 */
+		if (!wait_on_slots(sa))
+			return NULL;
 	}
+}
+
+/*
+ * ParallelSlotsSetup
+ *		Prepare a set of parallel slots but do not connect to any database.
+ *
+ * This creates and initializes a set of slots, marking all parallel slots as
+ * free and ready to use.  Establishing connections is delayed until requesting
+ * a free slot.  The cparams, progname, echo, and initcmd are stored for later
+ * use and must remain valid for the lifetime of the returned array.
+ */
+ParallelSlotArray *
+ParallelSlotsSetup(int numslots, ConnParams *cparams, const char *progname,
+				   bool echo, const char *initcmd)
+{
+	ParallelSlotArray *sa;
 
-	return slots;
+	Assert(numslots > 0);
+	Assert(cparams != NULL);
+	Assert(progname != NULL);
+
+	sa = (ParallelSlotArray *) palloc0(offsetof(ParallelSlotArray, slots) +
+									   numslots * sizeof(ParallelSlot));
+
+	sa->numslots = numslots;
+	sa->cparams = cparams;
+	sa->progname = progname;
+	sa->echo = echo;
+	sa->initcmd = initcmd;
+
+	return sa;
+}
+
+/*
+ * ParallelSlotsAdoptConn
+ *		Assign an open connection to the slots array for reuse.
+ *
+ * This turns over ownership of an open connection to a slots array.  The
+ * caller should not further use or close the connection.  All the connection's
+ * parameters (user, host, port, etc.) except possibly dbname should match
+ * those of the slots array's cparams, as given in ParallelSlotsSetup.  If
+ * these parameters differ, subsequent behavior is undefined.
+ */
+void
+ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn)
+{
+	int			offset;
+
+	offset = find_unconnected_slot(sa);
+	if (offset >= 0)
+		sa->slots[offset].connection = conn;
+	else
+		disconnectDatabase(conn);
 }
 
 /*
@@ -292,13 +448,13 @@ ParallelSlotsSetup(const ConnParams *cparams,
  * terminate all connections.
  */
 void
-ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
+ParallelSlotsTerminate(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		PGconn	   *conn = slots[i].connection;
+		PGconn	   *conn = sa->slots[i].connection;
 
 		if (conn == NULL)
 			continue;
@@ -314,13 +470,15 @@ ParallelSlotsTerminate(ParallelSlot *slots, int numslots)
  * error has been found on the way.
  */
 bool
-ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
+ParallelSlotsWaitCompletion(ParallelSlotArray *sa)
 {
 	int			i;
 
-	for (i = 0; i < numslots; i++)
+	for (i = 0; i < sa->numslots; i++)
 	{
-		if (!consumeQueryResult(slots + i))
+		if (sa->slots[i].connection == NULL)
+			continue;
+		if (!consumeQueryResult(&sa->slots[i]))
 			return false;
 	}
 
@@ -350,6 +508,9 @@ ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots)
 bool
 TableCommandResultHandler(PGresult *res, PGconn *conn, void *context)
 {
+	Assert(res != NULL);
+	Assert(conn != NULL);
+
 	/*
 	 * If it's an error, report it.  Errors about a missing table are harmless
 	 * so we continue processing; but die for other errors.
diff --git a/src/include/fe_utils/parallel_slot.h b/src/include/fe_utils/parallel_slot.h
index 8902f8d4f4..b7e2b0a29b 100644
--- a/src/include/fe_utils/parallel_slot.h
+++ b/src/include/fe_utils/parallel_slot.h
@@ -21,7 +21,7 @@ typedef bool (*ParallelSlotResultHandler) (PGresult *res, PGconn *conn,
 typedef struct ParallelSlot
 {
 	PGconn	   *connection;		/* One connection */
-	bool		isFree;			/* Is it known to be idle? */
+	bool		inUse;			/* Is the slot being used? */
 
 	/*
 	 * Prior to issuing a command or query on 'connection', a handler callback
@@ -33,6 +33,16 @@ typedef struct ParallelSlot
 	void	   *handler_context;
 } ParallelSlot;
 
+typedef struct ParallelSlotArray
+{
+	int			numslots;
+	ConnParams *cparams;
+	const char *progname;
+	bool		echo;
+	const char *initcmd;
+	ParallelSlot slots[FLEXIBLE_ARRAY_MEMBER];
+} ParallelSlotArray;
+
 static inline void
 ParallelSlotSetHandler(ParallelSlot *slot, ParallelSlotResultHandler handler,
 					   void *context)
@@ -48,15 +58,18 @@ ParallelSlotClearHandler(ParallelSlot *slot)
 	slot->handler_context = NULL;
 }
 
-extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlot *slots, int numslots);
+extern ParallelSlot *ParallelSlotsGetIdle(ParallelSlotArray *slots,
+										  const char *dbname);
+
+extern ParallelSlotArray *ParallelSlotsSetup(int numslots, ConnParams *cparams,
+											 const char *progname, bool echo,
+											 const char *initcmd);
 
-extern ParallelSlot *ParallelSlotsSetup(const ConnParams *cparams,
-										const char *progname, bool echo,
-										PGconn *conn, int numslots);
+extern void ParallelSlotsAdoptConn(ParallelSlotArray *sa, PGconn *conn);
 
-extern void ParallelSlotsTerminate(ParallelSlot *slots, int numslots);
+extern void ParallelSlotsTerminate(ParallelSlotArray *sa);
 
-extern bool ParallelSlotsWaitCompletion(ParallelSlot *slots, int numslots);
+extern bool ParallelSlotsWaitCompletion(ParallelSlotArray *sa);
 
 extern bool TableCommandResultHandler(PGresult *res, PGconn *conn,
 									  void *context);
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 574a8a94fa..e017557e3e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -404,6 +404,7 @@ ConfigData
 ConfigVariable
 ConnCacheEntry
 ConnCacheKey
+ConnParams
 ConnStatusType
 ConnType
 ConnectionStateEnum
@@ -1730,6 +1731,7 @@ ParallelHashJoinState
 ParallelIndexScanDesc
 ParallelReadyList
 ParallelSlot
+ParallelSlotArray
 ParallelState
 ParallelTableScanDesc
 ParallelTableScanDescData
-- 
2.21.1 (Apple Git-122.3)

v45-0002-Adding-contrib-module-pg_amcheck.patchapplication/octet-stream; name=v45-0002-Adding-contrib-module-pg_amcheck.patch; x-unix-mode=0644Download
From 486366c57b683607d3513fa3d3bf41cd97f8cd5b Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Mar 2021 08:34:40 -0800
Subject: [PATCH v45 2/3] Adding contrib module pg_amcheck

Adding new contrib module pg_amcheck, which is a command line
interface for running amcheck's verifications against tables and
indexes.
---
 contrib/Makefile                           |    1 +
 contrib/pg_amcheck/.gitignore              |    3 +
 contrib/pg_amcheck/Makefile                |   29 +
 contrib/pg_amcheck/pg_amcheck.c            | 2117 ++++++++++++++++++++
 contrib/pg_amcheck/t/001_basic.pl          |    9 +
 contrib/pg_amcheck/t/002_nonesuch.pl       |  248 +++
 contrib/pg_amcheck/t/003_check.pl          |  497 +++++
 contrib/pg_amcheck/t/004_verify_heapam.pl  |  517 +++++
 contrib/pg_amcheck/t/005_opclass_damage.pl |   54 +
 doc/src/sgml/contrib.sgml                  |    1 +
 doc/src/sgml/filelist.sgml                 |    1 +
 doc/src/sgml/pgamcheck.sgml                |  713 +++++++
 src/tools/msvc/Install.pm                  |    2 +-
 src/tools/msvc/Mkvcbuild.pm                |    6 +-
 src/tools/pgindent/typedefs.list           |    5 +
 15 files changed, 4199 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pg_amcheck/.gitignore
 create mode 100644 contrib/pg_amcheck/Makefile
 create mode 100644 contrib/pg_amcheck/pg_amcheck.c
 create mode 100644 contrib/pg_amcheck/t/001_basic.pl
 create mode 100644 contrib/pg_amcheck/t/002_nonesuch.pl
 create mode 100644 contrib/pg_amcheck/t/003_check.pl
 create mode 100644 contrib/pg_amcheck/t/004_verify_heapam.pl
 create mode 100644 contrib/pg_amcheck/t/005_opclass_damage.pl
 create mode 100644 doc/src/sgml/pgamcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index f27e458482..a72dcf7304 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -30,6 +30,7 @@ SUBDIRS = \
 		old_snapshot	\
 		pageinspect	\
 		passwordcheck	\
+		pg_amcheck	\
 		pg_buffercache	\
 		pg_freespacemap \
 		pg_prewarm	\
diff --git a/contrib/pg_amcheck/.gitignore b/contrib/pg_amcheck/.gitignore
new file mode 100644
index 0000000000..c21a14de31
--- /dev/null
+++ b/contrib/pg_amcheck/.gitignore
@@ -0,0 +1,3 @@
+pg_amcheck
+
+/tmp_check/
diff --git a/contrib/pg_amcheck/Makefile b/contrib/pg_amcheck/Makefile
new file mode 100644
index 0000000000..bc61ee7970
--- /dev/null
+++ b/contrib/pg_amcheck/Makefile
@@ -0,0 +1,29 @@
+# contrib/pg_amcheck/Makefile
+
+PGFILEDESC = "pg_amcheck - detects corruption within database relations"
+PGAPPICON = win32
+
+PROGRAM = pg_amcheck
+OBJS = \
+	$(WIN32RES) \
+	pg_amcheck.o
+
+REGRESS_OPTS += --load-extension=amcheck --load-extension=pageinspect
+EXTRA_INSTALL += contrib/amcheck contrib/pageinspect
+
+TAP_TESTS = 1
+
+PG_CPPFLAGS = -I$(libpq_srcdir)
+PG_LIBS_INTERNAL = -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+SHLIB_PREREQS = submake-libpq
+subdir = contrib/pg_amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_amcheck/pg_amcheck.c b/contrib/pg_amcheck/pg_amcheck.c
new file mode 100644
index 0000000000..96092cd0d5
--- /dev/null
+++ b/contrib/pg_amcheck/pg_amcheck.c
@@ -0,0 +1,2117 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_amcheck.c
+ *		Detects corruption within database relations.
+ *
+ * Copyright (c) 2017-2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_amcheck/pg_amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <time.h>
+
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_namespace_d.h"
+#include "common/logging.h"
+#include "common/username.h"
+#include "fe_utils/cancel.h"
+#include "fe_utils/option_utils.h"
+#include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
+#include "fe_utils/simple_list.h"
+#include "fe_utils/string_utils.h"
+#include "getopt_long.h"		/* pgrminclude ignore */
+#include "pgtime.h"
+#include "storage/block.h"
+
+typedef struct PatternInfo
+{
+	const char *pattern;		/* Unaltered pattern from the command line */
+	char	   *db_regex;		/* Database regexp parsed from pattern, or
+								 * NULL */
+	char	   *nsp_regex;		/* Schema regexp parsed from pattern, or NULL */
+	char	   *rel_regex;		/* Relation regexp parsed from pattern, or
+								 * NULL */
+	bool		heap_only;		/* true if rel_regex should only match heap
+								 * tables */
+	bool		btree_only;		/* true if rel_regex should only match btree
+								 * indexes */
+	bool		matched;		/* true if the pattern matched in any database */
+} PatternInfo;
+
+typedef struct PatternInfoArray
+{
+	PatternInfo *data;
+	size_t		len;
+} PatternInfoArray;
+
+/* pg_amcheck command line options controlled by user flags */
+typedef struct AmcheckOptions
+{
+	bool		alldb;
+	bool		echo;
+	bool		quiet;
+	bool		verbose;
+	bool		strict_names;
+	bool		show_progress;
+	int			jobs;
+
+	/* Objects to check or not to check, as lists of PatternInfo structs. */
+	PatternInfoArray include;
+	PatternInfoArray exclude;
+
+	/*
+	 * As an optimization, if any pattern in the exclude list applies to heap
+	 * tables, or similarly if any such pattern applies to btree indexes, or
+	 * to schemas, then these will be true, otherwise false.  These should
+	 * always agree with what you'd conclude by grep'ing through the exclude
+	 * list.
+	 */
+	bool		excludetbl;
+	bool		excludeidx;
+	bool		excludensp;
+
+	/*
+	 * If any inclusion pattern exists, then we should only be checking
+	 * matching relations rather than all relations, so this is true iff
+	 * include is empty.
+	 */
+	bool		allrel;
+
+	/* heap table checking options */
+	bool		no_toast_expansion;
+	bool		reconcile_toast;
+	bool		on_error_stop;
+	int64		startblock;
+	int64		endblock;
+	const char *skip;
+
+	/* btree index checking options */
+	bool		parent_check;
+	bool		rootdescend;
+	bool		heapallindexed;
+
+	/* heap and btree hybrid option */
+	bool		no_btree_expansion;
+} AmcheckOptions;
+
+static AmcheckOptions opts = {
+	.alldb = false,
+	.echo = false,
+	.quiet = false,
+	.verbose = false,
+	.strict_names = true,
+	.show_progress = false,
+	.jobs = 1,
+	.include = {NULL, 0},
+	.exclude = {NULL, 0},
+	.excludetbl = false,
+	.excludeidx = false,
+	.excludensp = false,
+	.allrel = true,
+	.no_toast_expansion = false,
+	.reconcile_toast = true,
+	.on_error_stop = false,
+	.startblock = -1,
+	.endblock = -1,
+	.skip = "none",
+	.parent_check = false,
+	.rootdescend = false,
+	.heapallindexed = false,
+	.no_btree_expansion = false
+};
+
+static const char *progname = NULL;
+
+/* Whether all relations have so far passed their corruption checks */
+static bool all_checks_pass = true;
+
+/* Time last progress report was displayed */
+static pg_time_t last_progress_report = 0;
+static bool progress_since_last_stderr = false;
+
+typedef struct DatabaseInfo
+{
+	char	   *datname;
+	char	   *amcheck_schema; /* escaped, quoted literal */
+} DatabaseInfo;
+
+typedef struct RelationInfo
+{
+	const DatabaseInfo *datinfo;	/* shared by other relinfos */
+	Oid			reloid;
+	bool		is_heap;		/* true if heap, false if btree */
+	char	   *nspname;
+	char	   *relname;
+	int			relpages;
+	int			blocks_to_check;
+	char	   *sql;			/* set during query run, pg_free'd after */
+} RelationInfo;
+
+/*
+ * Query for determining if contrib's amcheck is installed.  If so, selects the
+ * namespace name where amcheck's functions can be found.
+ */
+static const char *amcheck_sql =
+"SELECT n.nspname, x.extversion FROM pg_catalog.pg_extension x"
+"\nJOIN pg_catalog.pg_namespace n ON x.extnamespace = n.oid"
+"\nWHERE x.extname = 'amcheck'";
+
+static void prepare_heap_command(PQExpBuffer sql, RelationInfo *rel,
+								 PGconn *conn);
+static void prepare_btree_command(PQExpBuffer sql, RelationInfo *rel,
+								  PGconn *conn);
+static void run_command(ParallelSlot *slot, const char *sql);
+static bool verify_heap_slot_handler(PGresult *res, PGconn *conn,
+									 void *context);
+static bool verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context);
+static void help(const char *progname);
+static void progress_report(uint64 relations_total, uint64 relations_checked,
+							uint64 relpages_total, uint64 relpages_checked,
+							const char *datname, bool force, bool finished);
+
+static void append_database_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_schema_pattern(PatternInfoArray *pia, const char *pattern,
+								  int encoding);
+static void append_relation_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_heap_pattern(PatternInfoArray *pia, const char *pattern,
+								int encoding);
+static void append_btree_pattern(PatternInfoArray *pia, const char *pattern,
+								 int encoding);
+static void compile_database_list(PGconn *conn, SimplePtrList *databases,
+								  const char *initial_dbname);
+static void compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+										 const DatabaseInfo *datinfo,
+										 uint64 *pagecount);
+
+#define log_no_match(...) do { \
+		if (opts.strict_names) \
+			pg_log_generic(PG_LOG_ERROR, __VA_ARGS__); \
+		else \
+			pg_log_generic(PG_LOG_WARNING, __VA_ARGS__); \
+	} while(0)
+
+#define FREE_AND_SET_NULL(x) do { \
+	pg_free(x); \
+	(x) = NULL; \
+	} while (0)
+
+int
+main(int argc, char *argv[])
+{
+	PGconn	   *conn = NULL;
+	SimplePtrListCell *cell;
+	SimplePtrList databases = {NULL, NULL};
+	SimplePtrList relations = {NULL, NULL};
+	bool		failed = false;
+	const char *latest_datname;
+	int			parallel_workers;
+	ParallelSlotArray *sa;
+	PQExpBufferData sql;
+	uint64		reltotal = 0;
+	uint64		pageschecked = 0;
+	uint64		pagestotal = 0;
+	uint64		relprogress = 0;
+	int			pattern_id;
+
+	static struct option long_options[] = {
+		/* Connection options */
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"maintenance-db", required_argument, NULL, 1},
+
+		/* check options */
+		{"all", no_argument, NULL, 'a'},
+		{"database", required_argument, NULL, 'd'},
+		{"exclude-database", required_argument, NULL, 'D'},
+		{"echo", no_argument, NULL, 'e'},
+		{"index", required_argument, NULL, 'i'},
+		{"exclude-index", required_argument, NULL, 'I'},
+		{"jobs", required_argument, NULL, 'j'},
+		{"progress", no_argument, NULL, 'P'},
+		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'r'},
+		{"exclude-relation", required_argument, NULL, 'R'},
+		{"schema", required_argument, NULL, 's'},
+		{"exclude-schema", required_argument, NULL, 'S'},
+		{"table", required_argument, NULL, 't'},
+		{"exclude-table", required_argument, NULL, 'T'},
+		{"verbose", no_argument, NULL, 'v'},
+		{"no-dependent-indexes", no_argument, NULL, 2},
+		{"no-dependent-toast", no_argument, NULL, 3},
+		{"exclude-toast-pointers", no_argument, NULL, 4},
+		{"on-error-stop", no_argument, NULL, 5},
+		{"skip", required_argument, NULL, 6},
+		{"startblock", required_argument, NULL, 7},
+		{"endblock", required_argument, NULL, 8},
+		{"rootdescend", no_argument, NULL, 9},
+		{"no-strict-names", no_argument, NULL, 10},
+		{"heapallindexed", no_argument, NULL, 11},
+		{"parent-check", no_argument, NULL, 12},
+
+		{NULL, 0, NULL, 0}
+	};
+
+	int			optindex;
+	int			c;
+
+	const char *db = NULL;
+	const char *maintenance_db = NULL;
+
+	const char *host = NULL;
+	const char *port = NULL;
+	const char *username = NULL;
+	enum trivalue prompt_password = TRI_DEFAULT;
+	int			encoding = pg_get_encoding_from_locale(NULL, false);
+	ConnParams	cparams;
+
+	pg_logging_init(argv[0]);
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("contrib"));
+
+	handle_help_version_opts(argc, argv, progname, help);
+
+	/* process command-line options */
+	while ((c = getopt_long(argc, argv, "ad:D:eh:Hi:I:j:p:Pqr:R:s:S:t:T:U:wWv",
+							long_options, &optindex)) != -1)
+	{
+		char	   *endptr;
+
+		switch (c)
+		{
+			case 'a':
+				opts.alldb = true;
+				break;
+			case 'd':
+				append_database_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'D':
+				append_database_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'e':
+				opts.echo = true;
+				break;
+			case 'h':
+				host = pg_strdup(optarg);
+				break;
+			case 'i':
+				opts.allrel = false;
+				append_btree_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'I':
+				opts.excludeidx = true;
+				append_btree_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'j':
+				opts.jobs = atoi(optarg);
+				if (opts.jobs < 1)
+				{
+					fprintf(stderr,
+							"number of parallel jobs must be at least 1\n");
+					exit(1);
+				}
+				break;
+			case 'p':
+				port = pg_strdup(optarg);
+				break;
+			case 'P':
+				opts.show_progress = true;
+				break;
+			case 'q':
+				opts.quiet = true;
+				break;
+			case 'r':
+				opts.allrel = false;
+				append_relation_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'R':
+				opts.excludeidx = true;
+				opts.excludetbl = true;
+				append_relation_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 's':
+				opts.allrel = false;
+				append_schema_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'S':
+				opts.excludensp = true;
+				append_schema_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 't':
+				opts.allrel = false;
+				append_heap_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'T':
+				opts.excludetbl = true;
+				append_heap_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'U':
+				username = pg_strdup(optarg);
+				break;
+			case 'w':
+				prompt_password = TRI_NO;
+				break;
+			case 'W':
+				prompt_password = TRI_YES;
+				break;
+			case 'v':
+				opts.verbose = true;
+				pg_logging_increase_verbosity();
+				break;
+			case 1:
+				maintenance_db = pg_strdup(optarg);
+				break;
+			case 2:
+				opts.no_btree_expansion = true;
+				break;
+			case 3:
+				opts.no_toast_expansion = true;
+				break;
+			case 4:
+				opts.reconcile_toast = false;
+				break;
+			case 5:
+				opts.on_error_stop = true;
+				break;
+			case 6:
+				if (pg_strcasecmp(optarg, "all-visible") == 0)
+					opts.skip = "all visible";
+				else if (pg_strcasecmp(optarg, "all-frozen") == 0)
+					opts.skip = "all frozen";
+				else
+				{
+					fprintf(stderr, "invalid skip option\n");
+					exit(1);
+				}
+				break;
+			case 7:
+				opts.startblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"invalid start block\n");
+					exit(1);
+				}
+				if (opts.startblock > MaxBlockNumber || opts.startblock < 0)
+				{
+					fprintf(stderr,
+							"start block out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 8:
+				opts.endblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"invalid end block\n");
+					exit(1);
+				}
+				if (opts.endblock > MaxBlockNumber || opts.endblock < 0)
+				{
+					fprintf(stderr,
+							"end block out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 9:
+				opts.rootdescend = true;
+				opts.parent_check = true;
+				break;
+			case 10:
+				opts.strict_names = false;
+				break;
+			case 11:
+				opts.heapallindexed = true;
+				break;
+			case 12:
+				opts.parent_check = true;
+				break;
+			default:
+				fprintf(stderr,
+						"Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+		}
+	}
+
+	if (opts.endblock >= 0 && opts.endblock < opts.startblock)
+	{
+		fprintf(stderr,
+				"end block precedes start block\n");
+		exit(1);
+	}
+
+	/*
+	 * A single non-option arguments specifies a database name or connection
+	 * string.
+	 */
+	if (optind < argc)
+	{
+		db = argv[optind];
+		optind++;
+	}
+
+	if (optind < argc)
+	{
+		pg_log_error("too many command-line arguments (first is \"%s\")",
+					 argv[optind]);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+		exit(1);
+	}
+
+	/* fill cparams except for dbname, which is set below */
+	cparams.pghost = host;
+	cparams.pgport = port;
+	cparams.pguser = username;
+	cparams.prompt_password = prompt_password;
+	cparams.override_dbname = NULL;
+
+	setup_cancel_handler(NULL);
+
+	/* choose the database for our initial connection */
+	if (opts.alldb)
+	{
+		/*
+		 * Prefer a maintenance_db argument over a database argument when
+		 * --all is specified, but don't ignore the database argument when no
+		 * maintenance_db was given.  This allows users to give a connection
+		 * string with --all, like `pg_amcheck --all "port=7777
+		 * sslmode=require".
+		 */
+		if (db != NULL && maintenance_db == NULL)
+			cparams.dbname = db;
+		else
+			cparams.dbname = maintenance_db;
+	}
+	else if (db != NULL)
+		cparams.dbname = db;
+	else
+	{
+		const char *default_db;
+
+		if (getenv("PGDATABASE"))
+			default_db = getenv("PGDATABASE");
+		else if (getenv("PGUSER"))
+			default_db = getenv("PGUSER");
+		else
+			default_db = get_user_name_or_exit(progname);
+
+		cparams.dbname = default_db;
+	}
+
+	if (opts.alldb)
+	{
+		conn = connectMaintenanceDatabase(&cparams, progname, opts.echo);
+		compile_database_list(conn, &databases, NULL);
+	}
+	else
+	{
+		conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+		compile_database_list(conn, &databases, PQdb(conn));
+	}
+
+	if (databases.head == NULL)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		pg_log_error("no databases to check");
+		exit(0);
+	}
+
+	/*
+	 * Compile a list of all relations spanning all databases to be checked.
+	 */
+	for (cell = databases.head; cell; cell = cell->next)
+	{
+		PGresult   *result;
+		int			ntups;
+		const char *amcheck_schema = NULL;
+		DatabaseInfo *dat = (DatabaseInfo *) cell->ptr;
+
+		cparams.override_dbname = dat->datname;
+		if (conn == NULL || strcmp(PQdb(conn), dat->datname) != 0)
+		{
+			if (conn != NULL)
+				disconnectDatabase(conn);
+			conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+		}
+
+		/*
+		 * Verify that amcheck is installed for this next database.  User
+		 * error could result in a database not having amcheck that should
+		 * have it, but we also could be iterating over multiple databases
+		 * where not all of them have amcheck installed (for example,
+		 * 'template1').
+		 */
+		result = executeQuery(conn, amcheck_sql, opts.echo);
+		if (PQresultStatus(result) != PGRES_TUPLES_OK)
+		{
+			/* Querying the catalog failed. */
+			pg_log_error("database \"%s\": %s",
+						 PQdb(conn), PQerrorMessage(conn));
+			pg_log_info("query was: %s", amcheck_sql);
+			PQclear(result);
+			disconnectDatabase(conn);
+			exit(1);
+		}
+		ntups = PQntuples(result);
+		if (ntups == 0)
+		{
+			/* Querying the catalog succeeded, but amcheck is missing. */
+			pg_log_warning("skipping database \"%s\": amcheck is not installed",
+						   PQdb(conn));
+			disconnectDatabase(conn);
+			conn = NULL;
+			continue;
+		}
+		amcheck_schema = PQgetvalue(result, 0, 0);
+		if (opts.verbose)
+			pg_log_info("in database \"%s\": using amcheck version \"%s\" in schema \"%s\"",
+						PQdb(conn), PQgetvalue(result, 0, 1), amcheck_schema);
+		dat->amcheck_schema = PQescapeIdentifier(conn, amcheck_schema,
+												 strlen(amcheck_schema));
+		PQclear(result);
+
+		compile_relation_list_one_db(conn, &relations, dat, &pagestotal);
+	}
+
+	/*
+	 * Check that all inclusion patterns matched at least one schema or
+	 * relation that we can check.
+	 */
+	for (pattern_id = 0; pattern_id < opts.include.len; pattern_id++)
+	{
+		PatternInfo *pat = &opts.include.data[pattern_id];
+
+		if (!pat->matched && (pat->nsp_regex != NULL || pat->rel_regex != NULL))
+		{
+			failed = opts.strict_names;
+
+			if (!opts.quiet || failed)
+			{
+				if (pat->heap_only)
+					log_no_match("no heap tables to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->btree_only)
+					log_no_match("no btree indexes to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->rel_regex == NULL)
+					log_no_match("no relations to check in schemas matching \"%s\"",
+								 pat->pattern);
+				else
+					log_no_match("no relations to check matching \"%s\"",
+								 pat->pattern);
+			}
+		}
+	}
+
+	if (failed)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		exit(1);
+	}
+
+	/*
+	 * Set parallel_workers to the lesser of opts.jobs and the number of
+	 * relations.
+	 */
+	parallel_workers = 0;
+	for (cell = relations.head; cell; cell = cell->next)
+	{
+		reltotal++;
+		if (parallel_workers < opts.jobs)
+			parallel_workers++;
+	}
+
+	if (reltotal == 0)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		pg_log_error("no relations to check");
+		exit(1);
+	}
+	progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, false);
+
+	/*
+	 * Main event loop.
+	 *
+	 * We use server-side parallelism to check up to parallel_workers
+	 * relations in parallel.  The list of relations was computed in database
+	 * order, which minimizes the number of connects and disconnects as we
+	 * process the list.
+	 */
+	latest_datname = NULL;
+	sa = ParallelSlotsSetup(parallel_workers, &cparams, progname, opts.echo,
+							NULL);
+	if (conn != NULL)
+	{
+		ParallelSlotsAdoptConn(sa, conn);
+		conn = NULL;
+	}
+
+	initPQExpBuffer(&sql);
+	for (relprogress = 0, cell = relations.head; cell; cell = cell->next)
+	{
+		ParallelSlot *free_slot;
+		RelationInfo *rel;
+
+		rel = (RelationInfo *) cell->ptr;
+
+		if (CancelRequested)
+		{
+			failed = true;
+			break;
+		}
+
+		/*
+		 * The list of relations is in database sorted order.  If this next
+		 * relation is in a different database than the last one seen, we are
+		 * about to start checking this database.  Note that other slots may
+		 * still be working on relations from prior databases.
+		 */
+		latest_datname = rel->datinfo->datname;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, latest_datname, false, false);
+
+		relprogress++;
+		pageschecked += rel->blocks_to_check;
+
+		/*
+		 * Get a parallel slot for the next amcheck command, blocking if
+		 * necessary until one is available, or until a previously issued slot
+		 * command fails, indicating that we should abort checking the
+		 * remaining objects.
+		 */
+		free_slot = ParallelSlotsGetIdle(sa, rel->datinfo->datname);
+		if (!free_slot)
+		{
+			/*
+			 * Something failed.  We don't need to know what it was, because
+			 * the handler should already have emitted the necessary error
+			 * messages.
+			 */
+			failed = true;
+			break;
+		}
+
+		if (opts.verbose)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_VERBOSE);
+		else if (opts.quiet)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_TERSE);
+
+		/*
+		 * Execute the appropriate amcheck command for this relation using our
+		 * slot's database connection.  We do not wait for the command to
+		 * complete, nor do we perform any error checking, as that is done by
+		 * the parallel slots and our handler callback functions.
+		 */
+		if (rel->is_heap)
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+				pg_log_info("checking heap table \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_heap_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_heap_slot_handler, rel);
+			run_command(free_slot, rel->sql);
+		}
+		else
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+
+				pg_log_info("checking btree index \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_btree_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_btree_slot_handler, rel);
+			run_command(free_slot, rel->sql);
+		}
+	}
+	termPQExpBuffer(&sql);
+
+	if (!failed)
+	{
+
+		/*
+		 * Wait for all slots to complete, or for one to indicate that an
+		 * error occurred.  Like above, we rely on the handler emitting the
+		 * necessary error messages.
+		 */
+		if (sa && !ParallelSlotsWaitCompletion(sa))
+			failed = true;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, true);
+	}
+
+	if (sa)
+	{
+		ParallelSlotsTerminate(sa);
+		FREE_AND_SET_NULL(sa);
+	}
+
+	if (failed)
+		exit(1);
+
+	if (!all_checks_pass)
+		exit(2);
+}
+
+/*
+ * prepare_heap_command
+ *
+ * Creates a SQL command for running amcheck checking on the given heap
+ * relation.  The command is phrased as a SQL query, with column order and
+ * names matching the expectations of verify_heap_slot_handler, which will
+ * receive and handle each row returned from the verify_heapam() function.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the heap table to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_heap_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+	appendPQExpBuffer(sql,
+					  "SELECT blkno, offnum, attnum, msg FROM %s.verify_heapam("
+					  "\nrelation := %u, on_error_stop := %s, check_toast := %s, skip := '%s'",
+					  rel->datinfo->amcheck_schema,
+					  rel->reloid,
+					  opts.on_error_stop ? "true" : "false",
+					  opts.reconcile_toast ? "true" : "false",
+					  opts.skip);
+
+	if (opts.startblock >= 0)
+		appendPQExpBuffer(sql, ", startblock := " INT64_FORMAT, opts.startblock);
+	if (opts.endblock >= 0)
+		appendPQExpBuffer(sql, ", endblock := " INT64_FORMAT, opts.endblock);
+
+	appendPQExpBuffer(sql, ")");
+}
+
+/*
+ * prepare_btree_command
+ *
+ * Creates a SQL command for running amcheck checking on the given btree index
+ * relation.  The command does not select any columns, as btree checking
+ * functions do not return any, but rather return corruption information by
+ * raising errors, which verify_btree_slot_handler expects.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the index to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_btree_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+
+	/*
+	 * Embed the database, schema, and relation name in the query, so if the
+	 * check throws an error, the user knows which relation the error came
+	 * from.
+	 */
+	if (opts.parent_check)
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_parent_check("
+						  "index := '%u'::regclass, heapallindexed := %s, "
+						  "rootdescend := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"),
+						  (opts.rootdescend ? "true" : "false"));
+	else
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_check("
+						  "index := '%u'::regclass, heapallindexed := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"));
+}
+
+/*
+ * run_command
+ *
+ * Sends a command to the server without waiting for the command to complete.
+ * Logs an error if the command cannot be sent, but otherwise any errors are
+ * expected to be handled by a ParallelSlotHandler.
+ *
+ * If reconnecting to the database is necessary, the cparams argument may be
+ * modified.
+ *
+ * slot: slot with connection to the server we should use for the command
+ * sql: query to send
+ */
+static void
+run_command(ParallelSlot *slot, const char *sql)
+{
+	if (opts.echo)
+		printf("%s\n", sql);
+
+	if (PQsendQuery(slot->connection, sql) == 0)
+	{
+		pg_log_error("error sending command to database \"%s\": %s",
+					 PQdb(slot->connection),
+					 PQerrorMessage(slot->connection));
+		pg_log_error("command was: %s", sql);
+		exit(1);
+	}
+}
+
+/*
+ * should_processing_continue
+ *
+ * Checks a query result returned from a query (presumably issued on a slot's
+ * connection) to determine if parallel slots should continue issuing further
+ * commands.
+ *
+ * Note: Heap relation corruption is reported by verify_heapam() via the result
+ * set, rather than an ERROR, but running verify_heapam() on a corrupted heap
+ * table may still result in an error being returned from the server due to
+ * missing relation files, bad checksums, etc.  The btree corruption checking
+ * functions always use errors to communicate corruption messages.  We can't
+ * just abort processing because we got a mere ERROR.
+ *
+ * res: result from an executed sql query
+ */
+static bool
+should_processing_continue(PGresult *res)
+{
+	const char *severity;
+
+	switch (PQresultStatus(res))
+	{
+			/* These are expected and ok */
+		case PGRES_COMMAND_OK:
+		case PGRES_TUPLES_OK:
+		case PGRES_NONFATAL_ERROR:
+			break;
+
+			/* This is expected but requires closer scrutiny */
+		case PGRES_FATAL_ERROR:
+			severity = PQresultErrorField(res, PG_DIAG_SEVERITY_NONLOCALIZED);
+			if (strcmp(severity, "FATAL") == 0)
+				return false;
+			if (strcmp(severity, "PANIC") == 0)
+				return false;
+			break;
+
+			/* These are unexpected */
+		case PGRES_BAD_RESPONSE:
+		case PGRES_EMPTY_QUERY:
+		case PGRES_COPY_OUT:
+		case PGRES_COPY_IN:
+		case PGRES_COPY_BOTH:
+		case PGRES_SINGLE_TUPLE:
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Returns a copy of the argument string with all lines indented four spaces.
+ *
+ * The caller should pg_free the result when finished with it.
+ */
+static char *
+indent_lines(const char *str)
+{
+	PQExpBufferData buf;
+	const char *c;
+	char	   *result;
+
+	initPQExpBuffer(&buf);
+	appendPQExpBufferStr(&buf, "    ");
+	for (c = str; *c; c++)
+	{
+		appendPQExpBufferChar(&buf, *c);
+		if (c[0] == '\n' && c[1] != '\0')
+			appendPQExpBufferStr(&buf, "    ");
+	}
+	result = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+
+	return result;
+}
+
+/*
+ * verify_heap_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a heap table checking command
+ * created by prepare_heap_command and outputs the results for the user.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: the sql query being handled, as a cstring
+ */
+static bool
+verify_heap_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			i;
+		int			ntups = PQntuples(res);
+
+		if (ntups > 0)
+			all_checks_pass = false;
+
+		for (i = 0; i < ntups; i++)
+		{
+			const char *msg;
+
+			/* The message string should never be null, but check */
+			if (PQgetisnull(res, i, 3))
+				msg = "NO MESSAGE";
+			else
+				msg = PQgetvalue(res, i, 3);
+
+			if (!PQgetisnull(res, i, 2))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s, attribute %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   PQgetvalue(res, i, 2),	/* attnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 1))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 0))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   msg);
+
+			else
+				printf("heap table \"%s\".\"%s\".\"%s\":\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		}
+	}
+	else if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("heap table \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		FREE_AND_SET_NULL(msg);
+	}
+
+	FREE_AND_SET_NULL(rel->sql);
+	FREE_AND_SET_NULL(rel->nspname);
+	FREE_AND_SET_NULL(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * verify_btree_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a btree checking command
+ * created by prepare_btree_command and outputs them for the user.  The results
+ * from the btree checking command is assumed to be empty, but when the results
+ * are an error code, the useful information about the corruption is expected
+ * in the connection's error message.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: unused
+ */
+static bool
+verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			ntups = PQntuples(res);
+
+		if (ntups != 1)
+		{
+			/*
+			 * We expect the btree checking functions to return one void row
+			 * each, so we should output some sort of warning if we get
+			 * anything else, not because it indicates corruption, but because
+			 * it suggests a mismatch between amcheck and pg_amcheck versions.
+			 *
+			 * In conjunction with --progress, anything written to stderr at
+			 * this time would present strangely to the user without an extra
+			 * newline, so we print one.  If we were multithreaded, we'd have
+			 * to avoid splitting this across multiple calls, but we're in an
+			 * event loop, so it doesn't matter.
+			 */
+			if (opts.show_progress && progress_since_last_stderr)
+				fprintf(stderr, "\n");
+			pg_log_warning("btree index \"%s\".\"%s\".\"%s\": btree checking function returned unexpected number of rows: %d",
+						   rel->datinfo->datname, rel->nspname, rel->relname, ntups);
+			if (opts.verbose)
+				pg_log_info("query was: %s", rel->sql);
+			pg_log_warning("are %s's and amcheck's versions compatible?",
+						   progname);
+			progress_since_last_stderr = false;
+		}
+	}
+	else
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("btree index \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		FREE_AND_SET_NULL(msg);
+	}
+
+	FREE_AND_SET_NULL(rel->sql);
+	FREE_AND_SET_NULL(rel->nspname);
+	FREE_AND_SET_NULL(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_amcheck"
+ */
+static void
+help(const char *progname)
+{
+	printf("%s uses amcheck module to check objects in a PostgreSQL database for corruption.\n\n", progname);
+	printf("Usage:\n");
+	printf("  %s [OPTION]... [DBNAME]\n", progname);
+	printf("\nTarget Options:\n");
+	printf("  -a, --all                      check all databases\n");
+	printf("  -d, --database=PATTERN         check matching database(s)\n");
+	printf("  -D, --exclude-database=PATTERN do NOT check matching database(s)\n");
+	printf("  -i, --index=PATTERN            check matching index(es)\n");
+	printf("  -I, --exclude-index=PATTERN    do NOT check matching index(es)\n");
+	printf("  -r, --relation=PATTERN         check matching relation(s)\n");
+	printf("  -R, --exclude-relation=PATTERN do NOT check matching relation(s)\n");
+	printf("  -s, --schema=PATTERN           check matching schema(s)\n");
+	printf("  -S, --exclude-schema=PATTERN   do NOT check matching schema(s)\n");
+	printf("  -t, --table=PATTERN            check matching table(s)\n");
+	printf("  -T, --exclude-table=PATTERN    do NOT check matching table(s)\n");
+	printf("      --no-dependent-indexes     do NOT expand list of relations to include indexes\n");
+	printf("      --no-dependent-toast       do NOT expand list of relations to include toast\n");
+	printf("      --no-strict-names          do NOT require patterns to match objects\n");
+	printf("\nTable Checking Options:\n");
+	printf("      --exclude-toast-pointers   do NOT follow relation toast pointers\n");
+	printf("      --on-error-stop            stop checking at end of first corrupt page\n");
+	printf("      --skip=OPTION              do NOT check \"all-frozen\" or \"all-visible\" blocks\n");
+	printf("      --startblock=BLOCK         begin checking table(s) at the given block number\n");
+	printf("      --endblock=BLOCK           check table(s) only up to the given block number\n");
+	printf("\nBtree Index Checking Options:\n");
+	printf("      --heapallindexed           check all heap tuples are found within indexes\n");
+	printf("      --parent-check             check index parent/child relationships\n");
+	printf("      --rootdescend              search from root page to refind tuples\n");
+	printf("\nConnection options:\n");
+	printf("  -h, --host=HOSTNAME            database server host or socket directory\n");
+	printf("  -p, --port=PORT                database server port\n");
+	printf("  -U, --username=USERNAME        user name to connect as\n");
+	printf("  -w, --no-password              never prompt for password\n");
+	printf("  -W, --password                 force password prompt\n");
+	printf("      --maintenance-db=DBNAME    alternate maintenance database\n");
+	printf("\nOther Options:\n");
+	printf("  -e, --echo                     show the commands being sent to the server\n");
+	printf("  -j, --jobs=NUM                 use this many concurrent connections to the server\n");
+	printf("  -q, --quiet                    don't write any messages\n");
+	printf("  -v, --verbose                  write a lot of output\n");
+	printf("  -V, --version                  output version information, then exit\n");
+	printf("  -P, --progress                 show progress information\n");
+	printf("  -?, --help                     show this help, then exit\n");
+
+	printf("\nReport bugs to <%s>.\n", PACKAGE_BUGREPORT);
+	printf("%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Print a progress report based on the global variables.
+ *
+ * Progress report is written at maximum once per second, unless the force
+ * parameter is set to true.
+ *
+ * If finished is set to true, this is the last progress report. The cursor
+ * is moved to the next line.
+ */
+static void
+progress_report(uint64 relations_total, uint64 relations_checked,
+				uint64 relpages_total, uint64 relpages_checked,
+				const char *datname, bool force, bool finished)
+{
+	int			percent_rel = 0;
+	int			percent_pages = 0;
+	char		checked_rel[32];
+	char		total_rel[32];
+	char		checked_pages[32];
+	char		total_pages[32];
+	pg_time_t	now;
+
+	if (!opts.show_progress)
+		return;
+
+	now = time(NULL);
+	if (now == last_progress_report && !force && !finished)
+		return;					/* Max once per second */
+
+	last_progress_report = now;
+	if (relations_total)
+		percent_rel = (int) (relations_checked * 100 / relations_total);
+	if (relpages_total)
+		percent_pages = (int) (relpages_checked * 100 / relpages_total);
+
+	/*
+	 * Separate step to keep platform-dependent format code out of fprintf
+	 * calls.  We only test for INT64_FORMAT availability in snprintf, not
+	 * fprintf.
+	 */
+	snprintf(checked_rel, sizeof(checked_rel), INT64_FORMAT, relations_checked);
+	snprintf(total_rel, sizeof(total_rel), INT64_FORMAT, relations_total);
+	snprintf(checked_pages, sizeof(checked_pages), INT64_FORMAT, relpages_checked);
+	snprintf(total_pages, sizeof(total_pages), INT64_FORMAT, relpages_total);
+
+#define VERBOSE_DATNAME_LENGTH 35
+	if (opts.verbose)
+	{
+		if (!datname)
+
+			/*
+			 * No datname given, so clear the status line (used for first and
+			 * last call)
+			 */
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%) %*s",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+					VERBOSE_DATNAME_LENGTH + 2, "");
+		else
+		{
+			bool		truncate = (strlen(datname) > VERBOSE_DATNAME_LENGTH);
+
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%), (%s%-*.*s)",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+			/* Prefix with "..." if we do leading truncation */
+					truncate ? "..." : "",
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+			/* Truncate datname at beginning if it's too long */
+					truncate ? datname + strlen(datname) - VERBOSE_DATNAME_LENGTH + 3 : datname);
+		}
+	}
+	else
+		fprintf(stderr,
+				"%*s/%s relations (%d%%) %*s/%s pages (%d%%)",
+				(int) strlen(total_rel),
+				checked_rel, total_rel, percent_rel,
+				(int) strlen(total_pages),
+				checked_pages, total_pages, percent_pages);
+
+	/*
+	 * Stay on the same line if reporting to a terminal and we're not done
+	 * yet.
+	 */
+	if (!finished && isatty(fileno(stderr)))
+	{
+		fputc('\r', stderr);
+		progress_since_last_stderr = true;
+	}
+	else
+		fputc('\n', stderr);
+}
+
+/*
+ * Extend the pattern info array to hold one additional initialized pattern
+ * info entry.
+ *
+ * Returns a pointer to the new entry.
+ */
+static PatternInfo *
+extend_pattern_info_array(PatternInfoArray *pia)
+{
+	PatternInfo *result;
+
+	pia->len++;
+	pia->data = (PatternInfo *) pg_realloc(pia->data, pia->len * sizeof(PatternInfo));
+	result = &pia->data[pia->len - 1];
+	memset(result, 0, sizeof(*result));
+
+	return result;
+}
+
+/*
+ * append_database_pattern
+ *
+ * Adds the given pattern interpreted as a database name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the database name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_database_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->db_regex = pstrdup(buf.data);
+
+	termPQExpBuffer(&buf);
+}
+
+/*
+ * append_schema_pattern
+ *
+ * Adds the given pattern interpreted as a schema name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the schema name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_schema_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->nsp_regex = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+}
+
+/*
+ * append_relation_pattern_helper
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ * heap_only: whether the pattern should only be matched against heap tables
+ * btree_only: whether the pattern should only be matched against btree indexes
+ */
+static void
+append_relation_pattern_helper(PatternInfoArray *pia, const char *pattern,
+							   int encoding, bool heap_only, bool btree_only)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PQExpBufferData relbuf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+	initPQExpBuffer(&relbuf);
+
+	patternToSQLRegex(encoding, &dbbuf, &nspbuf, &relbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+		info->db_regex = pstrdup(dbbuf.data);
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+	if (relbuf.data[0])
+		info->rel_regex = pstrdup(relbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+	termPQExpBuffer(&relbuf);
+
+	info->heap_only = heap_only;
+	info->btree_only = btree_only;
+}
+
+/*
+ * append_relation_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched
+ * against both heap tables and btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_relation_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, false);
+}
+
+/*
+ * append_heap_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against heap tables.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_heap_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, true, false);
+}
+
+/*
+ * append_btree_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_btree_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, true);
+}
+
+/*
+ * append_db_pattern_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the database portions filtered from the list of patterns expressed as two
+ * columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     rgx: the database regular expression parsed from the pattern
+ *
+ * Patterns without a database portion are skipped.  Patterns with more than
+ * just a database portion are optionally skipped, depending on argument
+ * 'inclusive'.
+ *
+ * buf: the buffer to be appended
+ * pia: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ * inclusive: whether to include patterns with schema and/or relation parts
+ *
+ * Returns whether any database patterns were appended.
+ */
+static bool
+append_db_pattern_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+					  PGconn *conn, bool inclusive)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (info->db_regex != NULL &&
+			(inclusive || (info->nsp_regex == NULL && info->rel_regex == NULL)))
+		{
+			if (!have_values)
+				appendPQExpBufferStr(buf, "\nVALUES");
+			have_values = true;
+			appendPQExpBuffer(buf, "%s\n(%d, ", comma, pattern_id);
+			appendStringLiteralConn(buf, info->db_regex, conn);
+			appendPQExpBufferStr(buf, ")");
+			comma = ",";
+		}
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf, "\nSELECT NULL, NULL, NULL WHERE false");
+
+	return have_values;
+}
+
+/*
+ * compile_database_list
+ *
+ * If any database patterns exist, or if --all was given, compiles a distinct
+ * list of databases to check using a SQL query based on the patterns plus the
+ * literal initial database name, if given.  If no database patterns exist and
+ * --all was not given, the query is not necessary, and only the initial
+ * database name (if any) is added to the list.
+ *
+ * conn: connection to the initial database
+ * databases: the list onto which databases should be appended
+ * initial_dbname: an optional extra database name to include in the list
+ */
+static void
+compile_database_list(PGconn *conn, SimplePtrList *databases,
+					  const char *initial_dbname)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	bool		fatal;
+
+	if (initial_dbname)
+	{
+		DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+		/* This database is included.  Add to list */
+		if (opts.verbose)
+			pg_log_info("including database: \"%s\"", initial_dbname);
+
+		dat->datname = pstrdup(initial_dbname);
+		simple_ptr_list_append(databases, dat);
+	}
+
+	initPQExpBuffer(&sql);
+
+	/* Append the include patterns CTE. */
+	appendPQExpBufferStr(&sql, "WITH include_raw (pattern_id, rgx) AS (");
+	if (!append_db_pattern_cte(&sql, &opts.include, conn, true) &&
+		!opts.alldb)
+	{
+		/*
+		 * None of the inclusion patterns (if any) contain database portions,
+		 * so there is no need to query the database to resolve database
+		 * patterns.
+		 *
+		 * Since we're also not operating under --all, we don't need to query
+		 * the exhaustive list of connectable databases, either.
+		 */
+		termPQExpBuffer(&sql);
+		return;
+	}
+
+	/* Append the exclude patterns CTE. */
+	appendPQExpBufferStr(&sql, "),\nexclude_raw (pattern_id, rgx) AS (");
+	append_db_pattern_cte(&sql, &opts.exclude, conn, false);
+	appendPQExpBufferStr(&sql, "),");
+
+	/*
+	 * Append the database CTE, which includes whether each database is
+	 * connectable and also joins against exclude_raw to determine whether
+	 * each database is excluded.
+	 */
+	appendPQExpBufferStr(&sql,
+						 "\ndatabase (datname) AS ("
+						 "\nSELECT d.datname "
+						 "FROM pg_catalog.pg_database d "
+						 "LEFT OUTER JOIN exclude_raw e "
+						 "ON d.datname ~ e.rgx "
+						 "\nWHERE d.datallowconn "
+						 "AND e.pattern_id IS NULL"
+						 "),"
+
+	/*
+	 * Append the include_pat CTE, which joins the include_raw CTE against the
+	 * databases CTE to determine if all the inclusion patterns had matches,
+	 * and whether each matched pattern had the misfortune of only matching
+	 * excluded or unconnectable databases.
+	 */
+						 "\ninclude_pat (pattern_id, checkable) AS ("
+						 "\nSELECT i.pattern_id, "
+						 "COUNT(*) FILTER ("
+						 "WHERE d IS NOT NULL"
+						 ") AS checkable"
+						 "\nFROM include_raw i "
+						 "LEFT OUTER JOIN database d "
+						 "ON d.datname ~ i.rgx"
+						 "\nGROUP BY i.pattern_id"
+						 "),"
+
+	/*
+	 * Append the filtered_databases CTE, which selects from the database CTE
+	 * optionally joined against the include_raw CTE to only select databases
+	 * that match an inclusion pattern.  This appears to duplicate what the
+	 * include_pat CTE already did above, but here we want only databases, and
+	 * there we wanted patterns.
+	 */
+						 "\nfiltered_databases (datname) AS ("
+						 "\nSELECT DISTINCT d.datname "
+						 "FROM database d");
+	if (!opts.alldb)
+		appendPQExpBufferStr(&sql,
+							 " INNER JOIN include_raw i "
+							 "ON d.datname ~ i.rgx");
+	appendPQExpBufferStr(&sql,
+						 ")"
+
+	/*
+	 * Select the checkable databases and the unmatched inclusion patterns.
+	 */
+						 "\nSELECT pattern_id, datname FROM ("
+						 "\nSELECT pattern_id, NULL::TEXT AS datname "
+						 "FROM include_pat "
+						 "WHERE checkable = 0 "
+						 "UNION ALL"
+						 "\nSELECT NULL, datname "
+						 "FROM filtered_databases"
+						 ") AS combined_records"
+						 "\nORDER BY pattern_id NULLS LAST, datname");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (fatal = false, i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		const char *datname = NULL;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			datname = PQgetvalue(res, i, 1);
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern that matched no
+			 * checkable databases.
+			 */
+			fatal = opts.strict_names;
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected database pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+			log_no_match("no connectable databases to check matching \"%s\"",
+						 opts.include.data[pattern_id].pattern);
+		}
+		else
+		{
+			/* Current record pertains to a database */
+			Assert(datname != NULL);
+
+			/* Avoid entering a duplicate entry matching the initial_dbname */
+			if (initial_dbname != NULL && strcmp(initial_dbname, datname) == 0)
+				continue;
+
+			DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+			/* This database is included.  Add to list */
+			if (opts.verbose)
+				pg_log_info("including database: \"%s\"", datname);
+
+			dat->datname = pstrdup(datname);
+			simple_ptr_list_append(databases, dat);
+		}
+	}
+	PQclear(res);
+
+	if (fatal)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		exit(1);
+	}
+}
+
+/*
+ * append_rel_pattern_raw_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the given patterns as six columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     db_regex: the database regexp parsed from the pattern, or NULL if the
+ *               pattern had no database part
+ *     nsp_regex: the namespace regexp parsed from the pattern, or NULL if the
+ *                pattern had no namespace part
+ *     rel_regex: the relname regexp parsed from the pattern, or NULL if the
+ *                pattern had no relname part
+ *     heap_only: true if the pattern applies only to heap tables (not indexes)
+ *     btree_only: true if the pattern applies only to btree indexes (not tables)
+ *
+ * buf: the buffer to be appended
+ * patterns: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_raw_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+						   PGconn *conn)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (!have_values)
+			appendPQExpBufferStr(buf, "\nVALUES");
+		have_values = true;
+		appendPQExpBuffer(buf, "%s\n(%d::INTEGER, ", comma, pattern_id);
+		if (info->db_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->db_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->nsp_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->nsp_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->rel_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->rel_regex, conn);
+		if (info->heap_only)
+			appendPQExpBufferStr(buf, "::TEXT, true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, "::TEXT, false::BOOLEAN");
+		if (info->btree_only)
+			appendPQExpBufferStr(buf, ", true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, ", false::BOOLEAN");
+		appendPQExpBufferStr(buf, ")");
+		comma = ",";
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf,
+							 "\nSELECT NULL::INTEGER, NULL::TEXT, NULL::TEXT, "
+							 "NULL::TEXT, NULL::BOOLEAN, NULL::BOOLEAN "
+							 "WHERE false");
+}
+
+/*
+ * append_rel_pattern_filtered_cte
+ *
+ * Appends to the buffer a Common Table Expression (CTE) which selects
+ * all patterns from the named raw CTE, filtered by database.  All patterns
+ * which have no database portion or whose database portion matches our
+ * connection's database name are selected, with other patterns excluded.
+ *
+ * The basic idea here is that if we're connected to database "foo" and we have
+ * patterns "foo.bar.baz", "alpha.beta" and "one.two.three", we only want to
+ * use the first two while processing relations in this database, as the third
+ * one is not relevant.
+ *
+ * buf: the buffer to be appended
+ * raw: the name of the CTE to select from
+ * filtered: the name of the CTE to create
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_filtered_cte(PQExpBuffer buf, const char *raw,
+								const char *filtered, PGconn *conn)
+{
+	appendPQExpBuffer(buf,
+					  "\n%s (pattern_id, nsp_regex, rel_regex, heap_only, btree_only) AS ("
+					  "\nSELECT pattern_id, nsp_regex, rel_regex, heap_only, btree_only "
+					  "FROM %s r"
+					  "\nWHERE (r.db_regex IS NULL "
+					  "OR ",
+					  filtered, raw);
+	appendStringLiteralConn(buf, PQdb(conn), conn);
+	appendPQExpBufferStr(buf, " ~ r.db_regex)");
+	appendPQExpBufferStr(buf,
+						 " AND (r.nsp_regex IS NOT NULL"
+						 " OR r.rel_regex IS NOT NULL)"
+						 "),");
+}
+
+/*
+ * compile_relation_list_one_db
+ *
+ * Compiles a list of relations to check within the currently connected
+ * database based on the user supplied options, sorted by descending size,
+ * and appends them to the given list of relations.
+ *
+ * The cells of the constructed list contain all information about the relation
+ * necessary to connect to the database and check the object, including which
+ * database to connect to, where contrib/amcheck is installed, and the Oid and
+ * type of object (heap table vs. btree index).  Rather than duplicating the
+ * database details per relation, the relation structs use references to the
+ * same database object, provided by the caller.
+ *
+ * conn: connection to this next database, which should be the same as in 'dat'
+ * relations: list onto which the relations information should be appended
+ * dat: the database info struct for use by each relation
+ * pagecount: gets incremented by the number of blocks to check in all
+ * relations added
+ */
+static void
+compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+							 const DatabaseInfo *dat,
+							 uint64 *pagecount)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+
+	initPQExpBuffer(&sql);
+	appendPQExpBufferStr(&sql, "WITH");
+
+	/* Append CTEs for the relation inclusion patterns, if any */
+	if (!opts.allrel)
+	{
+		appendPQExpBufferStr(&sql,
+							 " include_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.include, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "include_raw", "include_pat", conn);
+	}
+
+	/* Append CTEs for the relation exclusion patterns, if any */
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+	{
+		appendPQExpBufferStr(&sql,
+							 " exclude_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.exclude, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "exclude_raw", "exclude_pat", conn);
+	}
+
+	/* Append the relation CTE. */
+	appendPQExpBufferStr(&sql,
+						 " relation (pattern_id, oid, nspname, relname, reltoastrelid, relpages, is_heap, is_btree) AS ("
+						 "\nSELECT DISTINCT ON (c.oid");
+	if (!opts.allrel)
+		appendPQExpBufferStr(&sql, ", ip.pattern_id) ip.pattern_id,");
+	else
+		appendPQExpBufferStr(&sql, ") NULL::INTEGER AS pattern_id,");
+	appendPQExpBuffer(&sql,
+					  "\nc.oid, n.nspname, c.relname, c.reltoastrelid, c.relpages, "
+					  "c.relam = %u AS is_heap, "
+					  "c.relam = %u AS is_btree"
+					  "\nFROM pg_catalog.pg_class c "
+					  "INNER JOIN pg_catalog.pg_namespace n "
+					  "ON c.relnamespace = n.oid",
+					  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (!opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nINNER JOIN include_pat ip"
+						  "\nON (n.nspname ~ ip.nsp_regex OR ip.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ip.rel_regex OR ip.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ip.heap_only)"
+						  "\nAND (c.relam = %u OR NOT ip.btree_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBuffer(&sql,
+						  "\nLEFT OUTER JOIN exclude_pat ep"
+						  "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.heap_only)"
+						  "\nAND (c.relam = %u OR NOT ep.btree_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBufferStr(&sql, "\nWHERE ep.pattern_id IS NULL");
+	else
+		appendPQExpBufferStr(&sql, "\nWHERE true");
+
+	/*
+	 * We need to be careful not to break the --no-dependent-toast and
+	 * --no-dependent-indexes options.  By default, the btree indexes, toast
+	 * tables, and toast table btree indexes associated with primary heap
+	 * tables are included, using their own CTEs below.  We implement the
+	 * --exclude-* options by not creating those CTEs, but that's no use if
+	 * we've already selected the toast and indexes here.  On the other hand,
+	 * we want inclusion patterns that match indexes or toast tables to be
+	 * honored.  So, if inclusion patterns were given, we want to select all
+	 * tables, toast tables, or indexes that match the patterns.  But if no
+	 * inclusion patterns were given, and we're simply matching all relations,
+	 * then we only want to match the primary tables here.
+	 */
+	if (opts.allrel)
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind IN ('r', 'm', 't') "
+						  "AND c.relnamespace != %u",
+						  HEAP_TABLE_AM_OID, PG_TOAST_NAMESPACE);
+	else
+		appendPQExpBuffer(&sql,
+						  " AND c.relam IN (%u, %u)"
+						  "AND c.relkind IN ('r', 'm', 't', 'i') "
+						  "AND ((c.relam = %u AND c.relkind IN ('r', 'm', 't')) OR "
+						  "(c.relam = %u AND c.relkind = 'i'))",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID,
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	appendPQExpBufferStr(&sql,
+						 "\nORDER BY c.oid)");
+
+	if (!opts.no_toast_expansion)
+	{
+		/*
+		 * Include a CTE for toast tables associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * toast table names.
+		 */
+		appendPQExpBufferStr(&sql,
+							 ", toast (oid, nspname, relname, relpages) AS ("
+							 "\nSELECT t.oid, 'pg_toast', t.relname, t.relpages"
+							 "\nFROM pg_catalog.pg_class t "
+							 "INNER JOIN relation r "
+							 "ON r.reltoastrelid = t.oid");
+		if (opts.excludetbl || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (t.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.heap_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		appendPQExpBufferStr(&sql,
+							 "\n)");
+	}
+	if (!opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * btree index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, r.nspname, c.relname, c.relpages "
+						  "FROM relation r"
+						  "\nINNER JOIN pg_catalog.pg_index i "
+						  "ON r.oid = i.indrelid "
+						  "INNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nINNER JOIN pg_catalog.pg_namespace n "
+								 "ON c.relnamespace = n.oid"
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind = 'i'",
+						  BTREE_AM_OID);
+		if (opts.no_toast_expansion)
+			appendPQExpBuffer(&sql,
+							  " AND c.relnamespace != %u",
+							  PG_TOAST_NAMESPACE);
+		appendPQExpBufferStr(&sql, "\n)");
+	}
+
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with toast tables of
+		 * primary heap tables selected above, filtering by exclusion patterns
+		 * (if any) that match the toast index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", toast_index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, 'pg_toast', c.relname, c.relpages "
+						  "FROM toast t "
+						  "INNER JOIN pg_catalog.pg_index i "
+						  "ON t.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only "
+								 "WHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u"
+						  " AND c.relkind = 'i')",
+						  BTREE_AM_OID);
+	}
+
+	/*
+	 * Roll-up distinct rows from CTEs.
+	 *
+	 * Relations that match more than one pattern may occur more than once in
+	 * the list, and indexes and toast for primary relations may also have
+	 * matched in their own right, so we rely on UNION to deduplicate the
+	 * list.
+	 */
+	appendPQExpBuffer(&sql,
+					  "\nSELECT pattern_id, is_heap, is_btree, oid, nspname, relname, relpages "
+					  "FROM (");
+	appendPQExpBufferStr(&sql,
+	/* Inclusion patterns that failed to match */
+						 "\nSELECT pattern_id, is_heap, is_btree, "
+						 "NULL::OID AS oid, "
+						 "NULL::TEXT AS nspname, "
+						 "NULL::TEXT AS relname, "
+						 "NULL::INTEGER AS relpages"
+						 "\nFROM relation "
+						 "WHERE pattern_id IS NOT NULL "
+						 "UNION"
+	/* Primary relations */
+						 "\nSELECT NULL::INTEGER AS pattern_id, "
+						 "is_heap, is_btree, oid, nspname, relname, relpages "
+						 "FROM relation");
+	if (!opts.no_toast_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Toast tables for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, TRUE AS is_heap, "
+							 "FALSE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast");
+	if (!opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM index");
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for toast relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast_index");
+	appendPQExpBufferStr(&sql,
+						 "\n) AS combined_records "
+						 "ORDER BY relpages DESC NULLS FIRST, oid");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		bool		is_heap = false;
+		bool		is_btree = false;
+		Oid			oid = InvalidOid;
+		const char *nspname = NULL;
+		const char *relname = NULL;
+		int			relpages = 0;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			is_heap = (PQgetvalue(res, i, 1)[0] == 't');
+		if (!PQgetisnull(res, i, 2))
+			is_btree = (PQgetvalue(res, i, 2)[0] == 't');
+		if (!PQgetisnull(res, i, 3))
+			oid = atooid(PQgetvalue(res, i, 3));
+		if (!PQgetisnull(res, i, 4))
+			nspname = PQgetvalue(res, i, 4);
+		if (!PQgetisnull(res, i, 5))
+			relname = PQgetvalue(res, i, 5);
+		if (!PQgetisnull(res, i, 6))
+			relpages = atoi(PQgetvalue(res, i, 6));
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern.  Record that
+			 * it matched.
+			 */
+
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected relation pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+
+			opts.include.data[pattern_id].matched = true;
+		}
+		else
+		{
+			/* Current record pertains to a relation */
+
+			RelationInfo *rel = (RelationInfo *) pg_malloc0(sizeof(RelationInfo));
+
+			Assert(OidIsValid(oid));
+			Assert((is_heap && !is_btree) || (is_btree && !is_heap));
+
+			rel->datinfo = dat;
+			rel->reloid = oid;
+			rel->is_heap = is_heap;
+			rel->nspname = pstrdup(nspname);
+			rel->relname = pstrdup(relname);
+			rel->relpages = relpages;
+			rel->blocks_to_check = relpages;
+			if (is_heap && (opts.startblock >= 0 || opts.endblock >= 0))
+			{
+				/*
+				 * We apply --startblock and --endblock to heap tables, but
+				 * not btree indexes, and for progress purposes we need to
+				 * track how many blocks we expect to check.
+				 */
+				if (opts.endblock >= 0 && rel->blocks_to_check > opts.endblock)
+					rel->blocks_to_check = opts.endblock + 1;
+				if (opts.startblock >= 0)
+				{
+					if (rel->blocks_to_check > opts.startblock)
+						rel->blocks_to_check -= opts.startblock;
+					else
+						rel->blocks_to_check = 0;
+				}
+			}
+			*pagecount += rel->blocks_to_check;
+
+			simple_ptr_list_append(relations, rel);
+		}
+	}
+	PQclear(res);
+}
diff --git a/contrib/pg_amcheck/t/001_basic.pl b/contrib/pg_amcheck/t/001_basic.pl
new file mode 100644
index 0000000000..dfa0ae9e06
--- /dev/null
+++ b/contrib/pg_amcheck/t/001_basic.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_amcheck');
+program_version_ok('pg_amcheck');
+program_options_handling_ok('pg_amcheck');
diff --git a/contrib/pg_amcheck/t/002_nonesuch.pl b/contrib/pg_amcheck/t/002_nonesuch.pl
new file mode 100644
index 0000000000..b1adf965a8
--- /dev/null
+++ b/contrib/pg_amcheck/t/002_nonesuch.pl
@@ -0,0 +1,248 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 76;
+
+# Test set-up
+my ($node, $port);
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+#########################################
+# Test non-existent databases
+
+# Failing to connect to the initial database is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/FATAL:  database "qqq" does not exist/ ],
+	'checking a non-existent database');
+
+# Failing to resolve a database pattern is an error by default.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern');
+
+# But only a warning under --no-strict-names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '--no-strict-names', '-d', 'qqq' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern under --no-strict-names');
+
+# Check that a substring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'post' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "post"/ ],
+	'checking an unresolvable database pattern (substring of existent database)');
+
+# Check that a superstring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-d', 'postgresql' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "postgresql"/ ],
+	'checking an unresolvable database pattern (superstring of existent database)');
+
+#########################################
+# Test connecting with a non-existent user
+
+# Failing to connect to the initial database due to bad username is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user');
+
+# Failing to connect to the initial database due to bad username is an still an
+# error under --no-strict-names.
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user under --no-strict-names');
+
+#########################################
+# Test checking databases without amcheck installed
+
+# Attempting to check a database by name where amcheck is not installed should
+# raise a warning.  If all databases are skipped, having no relations to check
+# raises an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'checking a database by name without amcheck installed, no other databases');
+
+# Again, but this time with another database to check, so no error is raised.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1', '-d', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by name without amcheck installed, with other databases');
+
+# Again, but by way of checking all databases
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by pattern without amcheck installed, with other databases');
+
+#########################################
+# Test unreasonable patterns
+
+# Check three-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '..' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.\."/ ],
+	'checking table pattern ".."');
+
+# Again, but with non-trivial schema and relation parts
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '.foo.bar' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.foo\.bar"/ ],
+	'checking table pattern ".foo.bar"');
+
+# Check two-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '-t', '.' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no heap tables to check matching "\."/ ],
+	'checking table pattern "."');
+
+#########################################
+# Test checking non-existent databases, schemas, tables, and indexes
+
+# Use --no-strict-names and a single existent table so we only get warnings
+# about the failed pattern matches
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names',
+		'-t', 'no_such_table',
+		'-t', 'no*such*table',
+		'-i', 'no_such_index',
+		'-i', 'no*such*index',
+		'-r', 'no_such_relation',
+		'-r', 'no*such*relation',
+		'-d', 'no_such_database',
+		'-d', 'no*such*database',
+		'-r', 'none.none',
+		'-r', 'none.none.none',
+		'-r', 'this.is.a.really.long.dotted.string',
+		'-r', 'postgres.none.none',
+		'-r', 'postgres.long.dotted.string',
+		'-r', 'postgres.pg_catalog.none',
+		'-r', 'postgres.none.pg_class',
+		'-t', 'postgres.pg_catalog.pg_class',	# This exists
+	],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no heap tables to check matching "no_such_table"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no_such_index"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no\*such\*index"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no_such_relation"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no\*such\*relation"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no\*such\*database"/,
+	  qr/pg_amcheck: warning: no relations to check matching "none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "none\.none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "this\.is\.a\.really\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.pg_catalog\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.pg_class"/,
+	],
+	'many unmatched patterns and one matched pattern under --no-strict-names');
+
+#########################################
+# Test checking otherwise existent objects but in databases where they do not exist
+
+$node->safe_psql('postgres', q(
+	CREATE TABLE public.foo (f integer);
+	CREATE INDEX foo_idx ON foo(f);
+));
+$node->safe_psql('postgres', q(CREATE DATABASE another_db));
+
+$node->command_checks_all(
+	[ 'pg_amcheck', 'postgres', '--no-strict-names',
+		'-t', 'template1.public.foo',
+		'-t', 'another_db.public.foo',
+		'-t', 'no_such_database.public.foo',
+		'-i', 'template1.public.foo_idx',
+		'-i', 'another_db.public.foo_idx',
+		'-i', 'no_such_database.public.foo_idx',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "template1\.public\.foo"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "another_db\.public\.foo"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "template1\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "another_db\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo_idx"/,
+	  qr/pg_amcheck: error: no relations to check/,
+	],
+	'checking otherwise existent objets in the wrong databases');
+
+
+#########################################
+# Test schema exclusion patterns
+
+# Check with only schema exclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-S', 'public',
+		'-S', 'pg_catalog',
+		'-S', 'pg_toast',
+		'-S', 'information_schema',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion patterns exclude all relations');
+
+# Check with schema exclusion patterns overriding relation and schema inclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-s', 'public',
+		'-s', 'pg_catalog',
+		'-s', 'pg_toast',
+		'-s', 'information_schema',
+		'-t', 'pg_catalog.pg_class',
+		'-S', '*'
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion pattern overrides all inclusion patterns');
diff --git a/contrib/pg_amcheck/t/003_check.pl b/contrib/pg_amcheck/t/003_check.pl
new file mode 100644
index 0000000000..54889f260d
--- /dev/null
+++ b/contrib/pg_amcheck/t/003_check.pl
@@ -0,0 +1,497 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 57;
+
+my ($node, $port, %corrupt_page, %remove_relation);
+
+# Returns the filesystem path for the named relation.
+#
+# Assumes the test node is running
+sub relation_filepath($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $pgdata = $node->data_dir;
+	my $rel = $node->safe_psql($dbname,
+							   qq(SELECT pg_relation_filepath('$relname')));
+	die "path not found for relation $relname" unless defined $rel;
+	return "$pgdata/$rel";
+}
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT c.reltoastrelid::regclass
+			FROM pg_catalog.pg_class c
+			WHERE c.oid = '$relname'::regclass
+			  AND c.reltoastrelid != 0
+			));
+	return undef unless defined $rel;
+	return $rel;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of overwriting junk in the first page.
+#
+# Assumes the test node is running.
+sub plan_to_corrupt_first_page($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$corrupt_page{$relpath} = 1;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of removing the file..
+#
+# Assumes the test node is running
+sub plan_to_remove_relation_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$remove_relation{$relpath} = 1;
+}
+
+# For the given (dbname, relname), if a corresponding toast table
+# exists, adds that toast table's relation file to the list to be
+# corrupted by means of removing the file.
+#
+# Assumes the test node is running.
+sub plan_to_remove_toast_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $toastname = relation_toast($dbname, $relname);
+	plan_to_remove_relation_file($dbname, $toastname) if ($toastname);
+}
+
+# Corrupts the first page of the given file path
+sub corrupt_first_page($)
+{
+	my ($relpath) = @_;
+
+	my $fh;
+	open($fh, '+<', $relpath)
+		or BAIL_OUT("open failed: $!");
+	binmode $fh;
+
+	# Corrupt some line pointers.  The values are chosen to hit the
+	# various line-pointer-corruption checks in verify_heapam.c
+	# on both little-endian and big-endian architectures.
+	seek($fh, 32, 0)
+		or BAIL_OUT("seek failed: $!");
+	syswrite(
+		$fh,
+		pack("L*",
+			0xAAA15550, 0xAAA0D550, 0x00010000,
+			0x00008000, 0x0000800F, 0x001e8000,
+			0xFFFFFFFF)
+	) or BAIL_OUT("syswrite failed: $!");
+	close($fh)
+		or BAIL_OUT("close failed: $!");
+}
+
+# Stops the node, performs all the corruptions previously planned, and
+# starts the node again.
+#
+sub perform_all_corruptions()
+{
+	$node->stop();
+	for my $relpath (keys %corrupt_page)
+	{
+		corrupt_first_page($relpath);
+	}
+	for my $relpath (keys %remove_relation)
+	{
+		unlink($relpath);
+	}
+	$node->start;
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+for my $dbname (qw(db1 db2 db3))
+{
+	# Create the database
+	$node->safe_psql('postgres', qq(CREATE DATABASE $dbname));
+
+	# Load the amcheck extension, upon which pg_amcheck depends.  Put the
+	# extension in an unexpected location to test that pg_amcheck finds it
+	# correctly.  Create tables with names that look like pg_catalog names to
+	# check that pg_amcheck does not get confused by them.  Create functions in
+	# schema public that look like amcheck functions to check that pg_amcheck
+	# does not use them.
+	$node->safe_psql($dbname, q(
+		CREATE SCHEMA amcheck_schema;
+		CREATE EXTENSION amcheck WITH SCHEMA amcheck_schema;
+		CREATE TABLE amcheck_schema.pg_database (junk text);
+		CREATE TABLE amcheck_schema.pg_namespace (junk text);
+		CREATE TABLE amcheck_schema.pg_class (junk text);
+		CREATE TABLE amcheck_schema.pg_operator (junk text);
+		CREATE TABLE amcheck_schema.pg_proc (junk text);
+		CREATE TABLE amcheck_schema.pg_tablespace (junk text);
+
+		CREATE FUNCTION public.bt_index_check(index regclass,
+											  heapallindexed boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.bt_index_parent_check(index regclass,
+													 heapallindexed boolean default false,
+													 rootdescend boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_parent_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.verify_heapam(relation regclass,
+											 on_error_stop boolean default false,
+											 check_toast boolean default false,
+											 skip text default 'none',
+											 startblock bigint default null,
+											 endblock bigint default null,
+											 blkno OUT bigint,
+											 offnum OUT integer,
+											 attnum OUT integer,
+											 msg OUT text)
+		RETURNS SETOF record AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong verify_heapam!';
+		END;
+		$$ LANGUAGE plpgsql;
+	));
+
+	# Create schemas, tables and indexes in five separate
+	# schemas.  The schemas are all identical to start, but
+	# we will corrupt them differently later.
+	#
+	for my $schema (qw(s1 s2 s3 s4 s5))
+	{
+		$node->safe_psql($dbname, qq(
+			CREATE SCHEMA $schema;
+			CREATE SEQUENCE $schema.seq1;
+			CREATE SEQUENCE $schema.seq2;
+			CREATE TABLE $schema.t1 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE TABLE $schema.t2 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE VIEW $schema.t2_view AS (
+				SELECT i*2, t FROM $schema.t2
+			);
+			ALTER TABLE $schema.t2
+				ALTER COLUMN t
+				SET STORAGE EXTERNAL;
+
+			INSERT INTO $schema.t1 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			INSERT INTO $schema.t2 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			CREATE MATERIALIZED VIEW $schema.t1_mv AS SELECT * FROM $schema.t1;
+			CREATE MATERIALIZED VIEW $schema.t2_mv AS SELECT * FROM $schema.t2;
+
+			create table $schema.p1 (a int, b int) PARTITION BY list (a);
+			create table $schema.p2 (a int, b int) PARTITION BY list (a);
+
+			create table $schema.p1_1 partition of $schema.p1 for values in (1, 2, 3);
+			create table $schema.p1_2 partition of $schema.p1 for values in (4, 5, 6);
+			create table $schema.p2_1 partition of $schema.p2 for values in (1, 2, 3);
+			create table $schema.p2_2 partition of $schema.p2 for values in (4, 5, 6);
+
+			CREATE INDEX t1_btree ON $schema.t1 USING BTREE (i);
+			CREATE INDEX t2_btree ON $schema.t2 USING BTREE (i);
+
+			CREATE INDEX t1_hash ON $schema.t1 USING HASH (i);
+			CREATE INDEX t2_hash ON $schema.t2 USING HASH (i);
+
+			CREATE INDEX t1_brin ON $schema.t1 USING BRIN (i);
+			CREATE INDEX t2_brin ON $schema.t2 USING BRIN (i);
+
+			CREATE INDEX t1_gist ON $schema.t1 USING GIST (b);
+			CREATE INDEX t2_gist ON $schema.t2 USING GIST (b);
+
+			CREATE INDEX t1_gin ON $schema.t1 USING GIN (ia);
+			CREATE INDEX t2_gin ON $schema.t2 USING GIN (ia);
+
+			CREATE INDEX t1_spgist ON $schema.t1 USING SPGIST (ir);
+			CREATE INDEX t2_spgist ON $schema.t2 USING SPGIST (ir);
+		));
+	}
+}
+
+# Database 'db1' corruptions
+#
+
+# Corrupt indexes in schema "s1"
+plan_to_remove_relation_file('db1', 's1.t1_btree');
+plan_to_corrupt_first_page('db1', 's1.t2_btree');
+
+# Corrupt tables in schema "s2"
+plan_to_remove_relation_file('db1', 's2.t1');
+plan_to_corrupt_first_page('db1', 's2.t2');
+
+# Corrupt tables, partitions, matviews, and btrees in schema "s3"
+plan_to_remove_relation_file('db1', 's3.t1');
+plan_to_corrupt_first_page('db1', 's3.t2');
+
+plan_to_remove_relation_file('db1', 's3.t1_mv');
+plan_to_remove_relation_file('db1', 's3.p1_1');
+
+plan_to_corrupt_first_page('db1', 's3.t2_mv');
+plan_to_corrupt_first_page('db1', 's3.p2_1');
+
+plan_to_remove_relation_file('db1', 's3.t1_btree');
+plan_to_corrupt_first_page('db1', 's3.t2_btree');
+
+# Corrupt toast table, partitions, and materialized views in schema "s4"
+plan_to_remove_toast_file('db1', 's4.t2');
+
+# Corrupt all other object types in schema "s5".  We don't have amcheck support
+# for these types, but we check that their corruption does not trigger any
+# errors in pg_amcheck
+plan_to_remove_relation_file('db1', 's5.seq1');
+plan_to_remove_relation_file('db1', 's5.t1_hash');
+plan_to_remove_relation_file('db1', 's5.t1_gist');
+plan_to_remove_relation_file('db1', 's5.t1_gin');
+plan_to_remove_relation_file('db1', 's5.t1_brin');
+plan_to_remove_relation_file('db1', 's5.t1_spgist');
+
+plan_to_corrupt_first_page('db1', 's5.seq2');
+plan_to_corrupt_first_page('db1', 's5.t2_hash');
+plan_to_corrupt_first_page('db1', 's5.t2_gist');
+plan_to_corrupt_first_page('db1', 's5.t2_gin');
+plan_to_corrupt_first_page('db1', 's5.t2_brin');
+plan_to_corrupt_first_page('db1', 's5.t2_spgist');
+
+
+# Database 'db2' corruptions
+#
+plan_to_remove_relation_file('db2', 's1.t1');
+plan_to_remove_relation_file('db2', 's1.t1_btree');
+
+
+# Leave 'db3' uncorrupted
+#
+
+# Perform the corruptions we planned above using only a single database restart.
+#
+perform_all_corruptions();
+
+
+# Standard first arguments to TestLib functions
+my @cmd = ('pg_amcheck', '--quiet', '-p', $port);
+
+# Regular expressions to match various expected output
+my $no_output_re = qr/^$/;
+my $line_pointer_corruption_re = qr/line pointer/;
+my $missing_file_re = qr/could not open file ".*": No such file or directory/;
+my $index_missing_relation_fork_re = qr/index ".*" lacks a main relation fork/;
+
+# Checking databases with amcheck installed and corrupt relations, pg_amcheck
+# command itself should return exit status = 2, because tables and indexes are
+# corrupt, not exit status = 1, which would mean the pg_amcheck command itself
+# failed.  Corruption messages should go to stdout, and nothing to stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in database db1');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-d', 'db2', '-d', 'db3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in databases db1, db2, and db3');
+
+# Scans of indexes in s1 should detect the specific corruption that we created
+# above.  For missing relation forks, we know what the error message looks
+# like.  For corrupted index pages, the error might vary depending on how the
+# page was formatted on disk, including variations due to alignment differences
+# between platforms, so we accept any non-empty error message.
+#
+# If we don't limit the check to databases with amcheck installed, we expect
+# complaint on stderr, but otherwise stderr should be quiet.
+#
+$node->command_checks_all(
+	[ @cmd, '--all', '-s', 's1', '-i', 't1_btree' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ qr/pg_amcheck: warning: skipping database "postgres": amcheck is not installed/ ],
+	'pg_amcheck index s1.t1_btree reports missing main relation fork');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't2_btree' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ $no_output_re ],
+	'pg_amcheck index s1.s2 reports index corruption');
+
+# Checking db1.s1 with indexes excluded should show no corruptions because we
+# did not corrupt any tables in db1.s1.  Verify that both stdout and stderr
+# are quiet.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db1.s1 excluding indexes');
+
+# Checking db2.s1 should show table corruptions if indexes are excluded
+#
+$node->command_checks_all(
+	[ @cmd, 'db2', '-t', 's1.*', '--no-dependent-indexes' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db2.s1 excluding indexes');
+
+# In schema db1.s3, the tables and indexes are both corrupt.  We should see
+# corruption messages on stdout, and nothing on stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck schema s3 reports table and index errors');
+
+# In schema db1.s4, only toast tables are corrupt.  Check that under default
+# options the toast corruption is reported, but when excluding toast we get no
+# error reports.
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's4' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 reports toast corruption');
+
+$node->command_checks_all(
+	[ @cmd, '--no-dependent-toast', '--exclude-toast-pointers', 'db1', '-s', 's4' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 excluding toast reports no corruption');
+
+# Check that no corruption is reported in schema db1.s5
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's5' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s5 reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we exclude
+# the indexes, no corruption is reported about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-I', 't1_btree', '-I', 't2_btree' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with corrupt indexes excluded reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we provide only
+# table inclusions, and disable index expansion, no corruption is reported
+# about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with all indexes excluded reports no corruption');
+
+# In schema db1.s2, only tables are corrupt.  Verify that when we exclude those
+# tables that no corruption is reported.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's2', '-T', 't1', '-T', 't2' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s2 with corrupt tables excluded reports no corruption');
+
+# Check errors about bad block range command line arguments.  We use schema s5
+# to avoid getting messages about corrupt tables or indexes.
+#
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', 'junk' ],
+	qr/invalid start block/,
+	'pg_amcheck rejects garbage startblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--endblock', '1234junk' ],
+	qr/invalid end block/,
+	'pg_amcheck rejects garbage endblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', '5', '--endblock', '4' ],
+	qr/end block precedes start block/,
+	'pg_amcheck rejects invalid block range');
+
+# Check bt_index_parent_check alternates.  We don't create any index corruption
+# that would behave differently under these modes, so just smoke test that the
+# arguments are handled sensibly.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--parent-check' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --parent-check');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --heapallindexed --rootdescend');
diff --git a/contrib/pg_amcheck/t/004_verify_heapam.pl b/contrib/pg_amcheck/t/004_verify_heapam.pl
new file mode 100644
index 0000000000..8ba1c4aea6
--- /dev/null
+++ b/contrib/pg_amcheck/t/004_verify_heapam.pl
@@ -0,0 +1,517 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+
+use Test::More;
+
+# This regression test demonstrates that the pg_amcheck binary supplied with
+# the pg_amcheck contrib module correctly identifies specific kinds of
+# corruption within pages.  To test this, we need a mechanism to create corrupt
+# pages with predictable, repeatable corruption.  The postgres backend cannot
+# be expected to help us with this, as its design is not consistent with the
+# goal of intentionally corrupting pages.
+#
+# Instead, we create a table to corrupt, and with careful consideration of how
+# postgresql lays out heap pages, we seek to offsets within the page and
+# overwrite deliberately chosen bytes with specific values calculated to
+# corrupt the page in expected ways.  We then verify that pg_amcheck reports
+# the corruption, and that it runs without crashing.  Note that the backend
+# cannot simply be started to run queries against the corrupt table, as the
+# backend will crash, at least for some of the corruption types we generate.
+#
+# Autovacuum potentially touching the table in the background makes the exact
+# behavior of this test harder to reason about.  We turn it off to keep things
+# simpler.  We use a "belt and suspenders" approach, turning it off for the
+# system generally in postgresql.conf, and turning it off specifically for the
+# test table.
+#
+# This test depends on the table being written to the heap file exactly as we
+# expect it to be, so we take care to arrange the columns of the table, and
+# insert rows of the table, that give predictable sizes and locations within
+# the table page.
+#
+# The HeapTupleHeaderData has 23 bytes of fixed size fields before the variable
+# length t_bits[] array.  We have exactly 3 columns in the table, so natts = 3,
+# t_bits is 1 byte long, and t_hoff = MAXALIGN(23 + 1) = 24.
+#
+# We're not too fussy about which datatypes we use for the test, but we do care
+# about some specific properties.  We'd like to test both fixed size and
+# varlena types.  We'd like some varlena data inline and some toasted.  And
+# we'd like the layout of the table such that the datums land at predictable
+# offsets within the tuple.  We choose a structure without padding on all
+# supported architectures:
+#
+# 	a BIGINT
+#	b TEXT
+#	c TEXT
+#
+# We always insert a 7-ascii character string into field 'b', which with a
+# 1-byte varlena header gives an 8 byte inline value.  We always insert a long
+# text string in field 'c', long enough to force toast storage.
+#
+# We choose to read and write binary copies of our table's tuples, using perl's
+# pack() and unpack() functions.  Perl uses a packing code system in which:
+#
+#	L = "Unsigned 32-bit Long",
+#	S = "Unsigned 16-bit Short",
+#	C = "Unsigned 8-bit Octet",
+#	c = "signed 8-bit octet",
+#	q = "signed 64-bit quadword"
+#
+# Each tuple in our table has a layout as follows:
+#
+#    xx xx xx xx            t_xmin: xxxx		offset = 0		L
+#    xx xx xx xx            t_xmax: xxxx		offset = 4		L
+#    xx xx xx xx          t_field3: xxxx		offset = 8		L
+#    xx xx                   bi_hi: xx			offset = 12		S
+#    xx xx                   bi_lo: xx			offset = 14		S
+#    xx xx                ip_posid: xx			offset = 16		S
+#    xx xx             t_infomask2: xx			offset = 18		S
+#    xx xx              t_infomask: xx			offset = 20		S
+#    xx                     t_hoff: x			offset = 22		C
+#    xx                     t_bits: x			offset = 23		C
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		Cccccccc
+#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		SSSS
+#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued	SSSS
+#    xx xx                        : xx      	 ...continued	S
+#
+# We could choose to read and write columns 'b' and 'c' in other ways, but
+# it is convenient enough to do it this way.  We define packing code
+# constants here, where they can be compared easily against the layout.
+
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCcccccccSSSSSSSSS';
+use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
+
+# Read a tuple of our table from a heap page.
+#
+# Takes an open filehandle to the heap file, and the offset of the tuple.
+#
+# Rather than returning the binary data from the file, unpacks the data into a
+# perl hash with named fields.  These fields exactly match the ones understood
+# by write_tuple(), below.  Returns a reference to this hash.
+#
+sub read_tuple ($$)
+{
+	my ($fh, $offset) = @_;
+	my ($buffer, %tup);
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(sysread($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("sysread failed: $!");
+
+	@_ = unpack(HEAPTUPLE_PACK_CODE, $buffer);
+	%tup = (t_xmin => shift,
+			t_xmax => shift,
+			t_field3 => shift,
+			bi_hi => shift,
+			bi_lo => shift,
+			ip_posid => shift,
+			t_infomask2 => shift,
+			t_infomask => shift,
+			t_hoff => shift,
+			t_bits => shift,
+			a => shift,
+			b_header => shift,
+			b_body1 => shift,
+			b_body2 => shift,
+			b_body3 => shift,
+			b_body4 => shift,
+			b_body5 => shift,
+			b_body6 => shift,
+			b_body7 => shift,
+			c1 => shift,
+			c2 => shift,
+			c3 => shift,
+			c4 => shift,
+			c5 => shift,
+			c6 => shift,
+			c7 => shift,
+			c8 => shift,
+			c9 => shift);
+	# Stitch together the text for column 'b'
+	$tup{b} = join('', map { chr($tup{"b_body$_"}) } (1..7));
+	return \%tup;
+}
+
+# Write a tuple of our table to a heap page.
+#
+# Takes an open filehandle to the heap file, the offset of the tuple, and a
+# reference to a hash with the tuple values, as returned by read_tuple().
+# Writes the tuple fields from the hash into the heap file.
+#
+# The purpose of this function is to write a tuple back to disk with some
+# subset of fields modified.  The function does no error checking.  Use
+# cautiously.
+#
+sub write_tuple($$$)
+{
+	my ($fh, $offset, $tup) = @_;
+	my $buffer = pack(HEAPTUPLE_PACK_CODE,
+					$tup->{t_xmin},
+					$tup->{t_xmax},
+					$tup->{t_field3},
+					$tup->{bi_hi},
+					$tup->{bi_lo},
+					$tup->{ip_posid},
+					$tup->{t_infomask2},
+					$tup->{t_infomask},
+					$tup->{t_hoff},
+					$tup->{t_bits},
+					$tup->{a},
+					$tup->{b_header},
+					$tup->{b_body1},
+					$tup->{b_body2},
+					$tup->{b_body3},
+					$tup->{b_body4},
+					$tup->{b_body5},
+					$tup->{b_body6},
+					$tup->{b_body7},
+					$tup->{c1},
+					$tup->{c2},
+					$tup->{c3},
+					$tup->{c4},
+					$tup->{c5},
+					$tup->{c6},
+					$tup->{c7},
+					$tup->{c8},
+					$tup->{c9});
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(syswrite($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("syswrite failed: $!");;
+	return;
+}
+
+# Set umask so test directories and files are created with default permissions
+umask(0077);
+
+# Set up the node.  Once we create and corrupt the table,
+# autovacuum workers visiting the table could crash the backend.
+# Disable autovacuum so that won't happen.
+my $node = get_new_node('test');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
+
+# Start the node and load the extensions.  We depend on both
+# amcheck and pageinspect for this test.
+$node->start;
+my $port = $node->port;
+my $pgdata = $node->data_dir;
+$node->safe_psql('postgres', "CREATE EXTENSION amcheck");
+$node->safe_psql('postgres', "CREATE EXTENSION pageinspect");
+
+# Get a non-zero datfrozenxid
+$node->safe_psql('postgres', qq(VACUUM FREEZE));
+
+# Create the test table with precisely the schema that our corruption function
+# expects.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.test (a BIGINT, b TEXT, c TEXT);
+		ALTER TABLE public.test SET (autovacuum_enabled=false);
+		ALTER TABLE public.test ALTER COLUMN c SET STORAGE EXTERNAL;
+		CREATE INDEX test_idx ON public.test(a, b);
+	));
+
+# We want (0 < datfrozenxid < test.relfrozenxid).  To achieve this, we freeze
+# an otherwise unused table, public.junk, prior to inserting data and freezing
+# public.test
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.junk AS SELECT 'junk'::TEXT AS junk_column;
+		ALTER TABLE public.junk SET (autovacuum_enabled=false);
+		VACUUM FREEZE public.junk
+	));
+
+my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.test')));
+my $relpath = "$pgdata/$rel";
+
+# Insert data and freeze public.test
+use constant ROWCOUNT => 16;
+$node->safe_psql('postgres', qq(
+	INSERT INTO public.test (a, b, c)
+		VALUES (
+			12345678,
+			'abcdefg',
+			repeat('w', 10000)
+		);
+	VACUUM FREEZE public.test
+	)) for (1..ROWCOUNT);
+
+my $relfrozenxid = $node->safe_psql('postgres',
+	q(select relfrozenxid from pg_class where relname = 'test'));
+my $datfrozenxid = $node->safe_psql('postgres',
+	q(select datfrozenxid from pg_database where datname = 'postgres'));
+
+# Sanity check that our 'test' table has a relfrozenxid newer than the
+# datfrozenxid for the database, and that the datfrozenxid is greater than the
+# first normal xid.  We rely on these invariants in some of our tests.
+if ($datfrozenxid <= 3 || $datfrozenxid >= $relfrozenxid)
+{
+	$node->clean_node;
+	plan skip_all => "Xid thresholds not as expected: got datfrozenxid = $datfrozenxid, relfrozenxid = $relfrozenxid";
+	exit;
+}
+
+# Find where each of the tuples is located on the page.
+my @lp_off;
+for my $tup (0..ROWCOUNT-1)
+{
+	push (@lp_off, $node->safe_psql('postgres', qq(
+select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
+	offset $tup limit 1)));
+}
+
+# Sanity check that our 'test' table on disk layout matches expectations.  If
+# this is not so, we will have to skip the test until somebody updates the test
+# to work on this platform.
+$node->stop;
+my $file;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	# Sanity-check that the data appears on the page where we expect.
+	my $a = $tup->{a};
+	my $b = $tup->{b};
+	if ($a ne '12345678' || $b ne 'abcdefg')
+	{
+		close($file);  # ignore errors on close; we're exiting anyway
+		$node->clean_node;
+		plan skip_all => qq(Page layout differs from our expectations: expected (12345678, "abcdefg"), got ($a, "$b"));
+		exit;
+	}
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Ok, Xids and page layout look ok.  We can run corruption tests.
+plan tests => 20;
+
+# Check that pg_amcheck runs against the uncorrupted table without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table, prior to corruption');
+
+# Check that pg_amcheck runs against the uncorrupted table and index without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table and index, prior to corruption');
+
+$node->stop;
+
+# Some #define constants from access/htup_details.h for use while corrupting.
+use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
+use constant HEAP_XMIN_COMMITTED     => 0x0100;
+use constant HEAP_XMIN_INVALID       => 0x0200;
+use constant HEAP_XMAX_COMMITTED     => 0x0400;
+use constant HEAP_XMAX_INVALID       => 0x0800;
+use constant HEAP_NATTS_MASK         => 0x07FF;
+use constant HEAP_XMAX_IS_MULTI      => 0x1000;
+use constant HEAP_KEYS_UPDATED       => 0x2000;
+
+# Helper function to generate a regular expression matching the header we
+# expect verify_heapam() to return given which fields we expect to be non-null.
+sub header
+{
+	my ($blkno, $offnum, $attnum) = @_;
+	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum, attribute $attnum:\s+/ms
+		if (defined $attnum);
+	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum:\s+/ms
+		if (defined $offnum);
+	return qr/heap table "postgres"\."public"\."test", block $blkno:\s+/ms
+		if (defined $blkno);
+	return qr/heap table "postgres"\."public"\."test":\s+/ms;
+}
+
+# Corrupt the tuples, one type of corruption per tuple.  Some types of
+# corruption cause verify_heapam to skip to the next tuple without
+# performing any remaining checks, so we can't exercise the system properly if
+# we focus all our corruption on a single tuple.
+#
+my @expected;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	my $header = header(0, $offnum, undef);
+	if ($offnum == 1)
+	{
+		# Corruptly set xmin < relfrozenxid
+		my $xmin = $relfrozenxid - 1;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		# Expected corruption report
+		push @expected,
+			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
+	}
+	if ($offnum == 2)
+	{
+		# Corruptly set xmin < datfrozenxid
+		my $xmin = 3;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin $xmin precedes oldest valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 3)
+	{
+		# Corruptly set xmin < datfrozenxid, further back, noting circularity
+		# of xid comparison.  For a new cluster with epoch = 0, the corrupt
+		# xmin will be interpreted as in the future
+		$tup->{t_xmin} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 4)
+	{
+		# Corruptly set xmax < relminmxid;
+		$tup->{t_xmax} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMAX_INVALID;
+
+		push @expected,
+			qr/${$header}xmax 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 5)
+	{
+		# Corrupt the tuple t_hoff, but keep it aligned properly
+		$tup->{t_hoff} += 128;
+
+		push @expected,
+			qr/${$header}data begins at offset 152 beyond the tuple length 58/,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 152 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 6)
+	{
+		# Corrupt the tuple t_hoff, wrong alignment
+		$tup->{t_hoff} += 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 27 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 7)
+	{
+		# Corrupt the tuple t_hoff, underflow but correct alignment
+		$tup->{t_hoff} -= 8;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 16 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 8)
+	{
+		# Corrupt the tuple t_hoff, underflow and wrong alignment
+		$tup->{t_hoff} -= 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 21 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 9)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, not just 3
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+
+		push @expected,
+			qr/${$header}number of attributes 2047 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 10)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, some of
+		# them null.  This falsely creates the impression that the t_bits
+		# array is longer than just one byte, but t_hoff still says otherwise.
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+		$tup->{t_bits} = 0xAA;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 280, but actually begins at byte 24 \(2047 attributes, has nulls\)/;
+	}
+	elsif ($offnum == 11)
+	{
+		# Same as above, but this time t_hoff plays along
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= (HEAP_NATTS_MASK & 0x40);
+		$tup->{t_bits} = 0xAA;
+		$tup->{t_hoff} = 32;
+
+		push @expected,
+			qr/${$header}number of attributes 67 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 12)
+	{
+		# Corrupt the bits in column 'b' 1-byte varlena header
+		$tup->{b_header} = 0x80;
+
+		$header = header(0, $offnum, 1);
+		push @expected,
+			qr/${header}attribute 1 with length 4294967295 ends at offset 416848000 beyond total tuple length 58/;
+	}
+	elsif ($offnum == 13)
+	{
+		# Corrupt the bits in column 'c' toast pointer
+		$tup->{c6} = 41;
+		$tup->{c7} = 41;
+
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}final toast chunk number 0 differs from expected value 6/,
+			qr/${header}toasted value for attribute 2 missing from toast table/;
+	}
+	elsif ($offnum == 14)
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4;
+
+		push @expected,
+			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
+	}
+	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4000000000;
+
+		push @expected,
+			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
+	}
+	write_tuple($file, $offset, $tup);
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Run pg_amcheck against the corrupt table with epoch=0, comparing actual
+# corruption messages against the expected messages
+$node->command_checks_all(
+	['pg_amcheck', '--no-dependent-indexes', '-p', $port, 'postgres'],
+	2,
+	[ @expected ],
+	[ ],
+	'Expected corruption message output');
+
+$node->teardown_node;
+$node->clean_node;
diff --git a/contrib/pg_amcheck/t/005_opclass_damage.pl b/contrib/pg_amcheck/t/005_opclass_damage.pl
new file mode 100644
index 0000000000..eba8ea9cae
--- /dev/null
+++ b/contrib/pg_amcheck/t/005_opclass_damage.pl
@@ -0,0 +1,54 @@
+# This regression test checks the behavior of the btree validation in the
+# presence of breaking sort order changes.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create a custom operator class and an index which uses it.
+$node->safe_psql('postgres', q(
+	CREATE EXTENSION amcheck;
+
+	CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;
+
+	CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
+	    OPERATOR 1 < (int4, int4), OPERATOR 2 <= (int4, int4),
+	    OPERATOR 3 = (int4, int4), OPERATOR 4 >= (int4, int4),
+	    OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);
+
+	CREATE TABLE int4tbl (i int4);
+	INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);
+	CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
+));
+
+# We have not yet broken the index, so we should get no corruption
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $node->port, 'postgres' ],
+	qr/^$/,
+	'pg_amcheck all schemas, tables and indexes reports no corruption');
+
+# Change the operator class to use a function which sorts in a different
+# order to corrupt the btree index
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION int4_desc_cmp (int4, int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN -1 ELSE 1 END; $$;
+	UPDATE pg_catalog.pg_amproc
+		SET amproc = 'int4_desc_cmp'::regproc
+		WHERE amproc = 'int4_asc_cmp'::regproc
+));
+
+# Index corruption should now be reported
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $node->port, 'postgres' ],
+	2,
+	[ qr/item order invariant violated for index "fickleidx"/ ],
+	[ ],
+	'pg_amcheck all schemas, tables and indexes reports fickleidx corruption'
+);
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d3ca4b6932..7e101f7c11 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -185,6 +185,7 @@ pages.
   </para>
 
  &oid2name;
+ &pgamcheck;
  &vacuumlo;
  </sect1>
 
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index db1d369743..5115cb03d0 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY oldsnapshot     SYSTEM "oldsnapshot.sgml">
 <!ENTITY pageinspect     SYSTEM "pageinspect.sgml">
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
+<!ENTITY pgamcheck       SYSTEM "pgamcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
diff --git a/doc/src/sgml/pgamcheck.sgml b/doc/src/sgml/pgamcheck.sgml
new file mode 100644
index 0000000000..b960c47305
--- /dev/null
+++ b/doc/src/sgml/pgamcheck.sgml
@@ -0,0 +1,713 @@
+<!-- doc/src/sgml/pgamcheck.sgml -->
+
+<refentry id="pgamcheck">
+ <indexterm zone="pgamcheck">
+  <primary>pg_amcheck</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle><application>pg_amcheck</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_amcheck</refname>
+  <refpurpose>checks for corruption in one or more
+  <productname>PostgreSQL</productname> databases</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_amcheck</command>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+   <arg><replaceable>dbname</replaceable></arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <application>pg_amcheck</application> supports running
+   <xref linkend="amcheck"/>'s corruption checking functions against one or
+   more databases, with options to select which schemas, tables and indexes to
+   check, which kinds of checking to perform, and whether to perform the checks
+   in parallel, and if so, the number of parallel connections to establish and
+   use.
+  </para>
+
+  <para>
+   Only table relations and btree indexes are currently supported.  Other
+   relation types are silently skipped.
+  </para>
+
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <para>
+   pg_amcheck accepts the following command-line arguments:
+
+   <variablelist>
+
+    <varlistentry>
+     <term><option><replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the name of a database to be checked, or a connection string
+       to use while connecting.
+      </para>
+      <para>
+       If no <replaceable>dbname</replaceable> is specified, and if
+       <option>-a</option> <option>--all</option> is not used, the database name
+       is read from the environment variable <envar>PGDATABASE</envar>.  If
+       that is not set, the user name specified for the connection is used.
+       The <replaceable>dbname</replaceable> can be a <link
+       linkend="libpq-connstring">connection string</link>.  If so, connection
+       string parameters will override any conflicting command line options,
+       and connection string parameters other than the database
+       name itself will be re-used when connecting to other databases.
+      </para>
+      <para>
+       If a connection string is given which contains no database name, the other
+       parameters of the string will be used while the database name to use is
+       determined as described above.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-a</option></term>
+     <term><option>--all</option></term>
+       <listitem>
+      <para>
+       Perform checking in all databases which are not otherwise excluded.
+      </para>
+      <para>
+       In the absence of any other options, selects all objects across all
+       schemas and databases.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-database</option> takes
+       precedence over <option>-a</option> <option>--all</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-d <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking in databases matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       that are not otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern.  By default, all objects in all matching databases will be
+       checked.
+      </para>
+      <para>
+       If <option>-a</option> <option>--all</option> is also specified,
+       <option>-d</option> <option>--database</option> has no effect.
+      </para>
+      <para>
+       Option <option>-D</option> <option>--exclude-database</option> takes
+       precedence over <option>-d</option> <option>--database</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-D <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Do not include databases matching other patterns or included by option
+       <option>-a</option> <option>--all</option> if they also match the
+       specified exclusion
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This does not exclude any database that was listed explicitly as a
+       <replaceable>dbname</replaceable> on the command line, nor does it exclude
+       the database chosen in the absence of any
+       <replaceable>dbname</replaceable> argument.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       exclusion pattern.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-e</option></term>
+     <term><option>--echo</option></term>
+     <listitem>
+      <para>
+       Print to stdout all commands and queries being executed against the
+       server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--endblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Check table blocks up to and including the specified ending block
+       number.  An error will occur if the table relation being checked has
+       fewer than this number of blocks.
+      </para>
+      <para>
+       By default, checking is performed up to and including the final block.
+       This option will be applied to all table relations that are checked,
+       including toast tables, but note that unless
+       <option>--exclude-toast-pointers</option> is given, toast pointers found
+       in the main table will be followed into the toast table without regard
+       to the location in the toast table.
+      </para>
+      <para>
+       This option does not apply to indexes, and is probably only useful when
+       checking a single table relation.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--exclude-toast-pointers</option></term>
+     <listitem>
+      <para>
+       When checking main relations, do not look up entries in toast tables
+       corresponding to toast pointers in the main relation.
+      </para>
+      <para>
+       The default behavior checks each toast pointer encountered in the main
+       table to verify, as much as possible, that the pointer points at
+       something in the toast table that is reasonable.  Toast pointers which
+       point beyond the end of the toast table, or to the middle (rather than
+       the beginning) of a toast entry, are identified as corrupt.
+      </para>
+      <para>
+       The process by which <xref linkend="amcheck"/>'s
+       <function>verify_heapam</function> function checks each toast pointer is
+       slow and may be improved in a future release.  Some users may wish to
+       disable this check to save time.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--heapallindexed</option></term>
+     <listitem>
+      <para>
+       For each index checked, verify the presence of all heap tuples as index
+       tuples in the index using <xref linkend="amcheck"/>'s
+       <option>heapallindexed</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-?</option></term>
+     <term><option>--help</option></term>
+     <listitem>
+      <para>
+       Show help about <application>pg_amcheck</application> command line
+       arguments, and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-h <replaceable class="parameter">hostname</replaceable></option></term>
+     <term><option>--host=<replaceable class="parameter">hostname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the host name of the machine on which the server is running.
+       If the value begins with a slash, it is used as the directory for the
+       Unix domain socket.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-i <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checks on indexes which match the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       unless they are otherwise excluded.
+      </para>
+      <para>
+       This is similar to the <option>-r</option> <option>--relation</option>
+       option, except that it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-I <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on the indexes which match the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This is similar to the <option>-R</option>
+       <option>--exclude-relation</option> option, except that it applies only
+       to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-j <replaceable class="parameter">num</replaceable></option></term>
+     <term><option>--jobs=<replaceable class="parameter">num</replaceable></option></term>
+     <listitem>
+      <para>
+       Use <replaceable>num</replaceable> concurrent connections to the server,
+       or one per object to be checked, whichever number is smaller.
+      </para>
+      <para>
+       The default is to use a single connection.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--maintenance-db=<replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the name of the database to connect to to discover which
+       databases should be checked, when
+       <option>-a</option>/<option>--all</option> is used.  If not specified,
+       the <literal>postgres</literal> database will be used, or if that does
+       not exist, <literal>template1</literal> will be used.  This can be a
+       <link linkend="libpq-connstring">connection string</link>.  If so,
+       connection string parameters will override any conflicting command line
+       options.  Also, connection string parameters other than the database
+       name itself will be re-used when connecting to other databases.
+      </para>
+      <para>
+       If a connection string is given which contains no database name, the other
+       parameters of the string will be used while the database name to use is
+       determined as described above.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-indexes</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include btree indexes associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated btree indexes, if any.  If this option is given, only
+       those indexes which match a <option>--relation</option> or
+       <option>--index</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-strict-names</option></term>
+     <listitem>
+      <para>
+       When calculating the list of databases to check, and the objects within
+       those databases to be checked, do not raise an error for database,
+       schema, relation, table, or index inclusion patterns which match no
+       corresponding objects.
+      </para>
+      <para>
+       Exclusion patterns are not required to match any objects, but by
+       default unmatched inclusion patterns raise an error, including when
+       they fail to match as a result of an exclusion pattern having
+       prohibited them matching an existent object, and when they fail to
+       match a database because it is unconnectable (datallowconn is false).
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-toast</option></term>
+     <listitem>
+      <para>
+       When including a table relation in the list of relations to check, do
+       not automatically include toast tables associated with table. 
+      </para>
+      <para>
+       By default, all tables to be checked will also have checks performed on
+       their associated toast tables, if any.  If this option is given, only
+       those toast tables which match a <option>--relation</option> or
+       <option>--table</option> pattern will be checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--on-error-stop</option></term>
+     <listitem>
+      <para>
+       After reporting all corruptions on the first page of a table where
+       corruptions are found, stop processing that table relation and move on
+       to the next table or index.
+      </para>
+      <para>
+       Note that index checking always stops after the first corrupt page.
+       This option only has meaning relative to table relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--parent-check</option></term>
+     <listitem>
+      <para>
+       For each btree index checked, use <xref linkend="amcheck"/>'s
+       <function>bt_index_parent_check</function> function, which performs
+       additional checks of parent/child relationships during index checking.
+      </para>
+      <para>
+       The default is to use <application>amcheck</application>'s
+       <function>bt_index_check</function> function, but note that use of the
+       <option>--rootdescend</option> option implicitly selects
+       <function>bt_index_parent_check</function>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-p <replaceable class="parameter">port</replaceable></option></term>
+     <term><option>--port=<replaceable class="parameter">port</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the TCP port or local Unix domain socket file extension on
+       which the server is listening for connections.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-P</option></term>
+     <term><option>--progress</option></term>
+     <listitem>
+      <para>
+       Show progress information about how many relations have been checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-q</option></term>
+     <term><option>--quiet</option></term>
+     <listitem>
+      <para>
+       Do not write additional messages beyond those about corruption.
+      </para>
+      <para>
+       This option does not quiet any output specifically due to the use of
+       the <option>-e</option> <option>--echo</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-r <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking on all relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       unless they are otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern.
+      </para>
+      <para>
+       Patterns may be unqualified, or they may be schema-qualified or
+       database- and schema-qualified, such as
+       <literal>"my*relation"</literal>,
+       <literal>"my*schema*.my*relation*"</literal>, or
+       <literal>"my*database.my*schema.my*relation</literal>.  There is no
+       problem specifying relation patterns that match databases that are not
+       otherwise included, as the relation in the matching database will still
+       be checked.
+      </para>
+      <para>
+       The <option>-R</option> <option>--exclude-relation</option> option takes
+       precedence over <option>-r</option> <option>--relation</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-R <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       As with <option>-r</option> <option>--relation</option>, the
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> may be unqualified, schema-qualified,
+       or database- and schema-qualified.
+      </para>
+      <para>
+       The <option>-R</option> <option>--exclude-relation</option> option takes
+       precedence over <option>-r</option> <option>--relation</option>,
+       <option>-t</option> <option>--table</option> and <option>-i</option>
+       <option>--index</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--rootdescend</option></term>
+     <listitem>
+      <para>
+       For each index checked, re-find tuples on the leaf level by performing a
+       new search from the root page for each tuple using
+       <xref linkend="amcheck"/>'s <option>rootdescend</option> option.
+      </para>
+      <para>
+       Use of this option implicitly also selects the
+       <option>--parent-check</option> option.
+      </para>
+      <para>
+       This form of verification was originally written to help in the
+       development of btree index features.  It may be of limited use or even
+       of no use in helping detect the kinds of corruption that occur in
+       practice.  It may also cause corruption checking to take considerably
+       longer and consume considerably more resources on the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-s <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checking in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> that are not otherwise excluded.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern for checking.  By default, all objects in all matching schema(s)
+       will be checked.
+      </para>
+      <para>
+       Option <option>-S</option> <option>--exclude-schema</option> takes
+       precedence over <option>-s</option> <option>--schema</option>.
+      </para>
+      <para>
+       Note that both tables and indexes are included using this option, which
+       might not be what you want if you are also using
+       <option>--no-dependent-indexes</option>.  To specify all tables in a
+       schema without also specifying all indexes, <option>--table</option> can
+       be used with a pattern that specifies the schema.  For example, to check
+       all tables in schema <literal>corp</literal>, the option
+       <literal>--table="corp.*"</literal> may be used.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-S <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Do not perform checking in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This option may be specified multiple times to list more than one
+       pattern for exclusion.
+      </para>
+      <para>
+       If a schema which is included using
+       <option>-s</option> <option>--schema</option> is also excluded using
+       <option>-S</option> <option>--exclude-schema</option>, the schema will
+       be excluded.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--skip=<replaceable class="parameter">option</replaceable></option></term>
+     <listitem>
+      <para>
+       If <literal>"all-frozen"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all frozen.
+      </para>
+      <para>
+       If <literal>"all-visible"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all visible.
+      </para>
+      <para>
+       By default, no pages are skipped.  This can be specified as
+       <literal>"none"</literal>, but since this is the default, it need not be
+       mentioned.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--startblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Check table blocks beginning with the specified block number.  An error
+       will occur if the table relation being checked has fewer than this number
+       of blocks.
+      </para>
+      <para>
+       By default, checking begins with block zero.  This option will be applied to all
+       table relations that are checked, including toast tables, but note
+       that unless <option>--exclude-toast-pointers</option> is given, toast
+       pointers found in the main table will be followed into the toast table
+       without regard to the location in the toast table.
+      </para>
+      <para>
+       This option does not apply to indexes, and is probably only useful when
+       checking a single table.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-t <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Perform checks on all tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       unless they are otherwise excluded.
+      </para>
+      <para>
+       This is similar to the <option>-r</option> <option>--relation</option>
+       option, except that it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-T <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude checks on tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+      </para>
+      <para>
+       This is similar to the <option>-R</option>
+       <option>--exclude-relation</option> option, except that it applies only
+       to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-U</option></term>
+     <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
+     <listitem>
+      <para>
+       User name to connect as.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-v</option></term>
+     <term><option>--verbose</option></term>
+     <listitem>
+      <para>
+       Increases the log level verbosity.  This option may be given more than
+       once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-V</option></term>
+     <term><option>--version</option></term>
+     <listitem>
+      <para>
+       Print the <application>pg_amcheck</application> version and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-w</option></term>
+     <term><option>--no-password</option></term>
+     <listitem>
+      <para>
+       Never issue a password prompt.  If the server requires password
+       authentication and a password is not available by other means such as
+       a <filename>.pgpass</filename> file, the connection attempt will fail.
+       This option can be useful in batch jobs and scripts where no user is
+       present to enter a password.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-W</option></term>
+     <term><option>--password</option></term>
+     <listitem>
+      <para>
+       Force <application>pg_amcheck</application> to prompt for a password
+       before connecting to a database.
+      </para>
+      <para>
+       This option is never essential, since
+       <application>pg_amcheck</application> will automatically prompt for a
+       password if the server demands password authentication.  However,
+       <application>pg_amcheck</application> will waste a connection attempt
+       finding out that the server wants a password.  In some cases it is
+       worth typing <option>-W</option> to avoid the extra connection attempt.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+   <application>pg_amcheck</application> is designed to work with
+   <productname>PostgreSQL</productname> 14.0 and later.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Author</title>
+
+  <para>
+   Mark Dilger <email>mark.dilger@enterprisedb.com</email>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="amcheck"/></member>
+  </simplelist>
+ </refsect1>
+</refentry>
diff --git a/src/tools/msvc/Install.pm b/src/tools/msvc/Install.pm
index ea3af48777..49ad558b74 100644
--- a/src/tools/msvc/Install.pm
+++ b/src/tools/msvc/Install.pm
@@ -18,7 +18,7 @@ our (@ISA, @EXPORT_OK);
 @EXPORT_OK = qw(Install);
 
 my $insttype;
-my @client_contribs = ('oid2name', 'pgbench', 'vacuumlo');
+my @client_contribs = ('oid2name', 'pg_amcheck', 'pgbench', 'vacuumlo');
 my @client_program_files = (
 	'clusterdb',      'createdb',   'createuser',    'dropdb',
 	'dropuser',       'ecpg',       'libecpg',       'libecpg_compat',
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 49614106dc..f680544e07 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -33,9 +33,9 @@ my @unlink_on_exit;
 
 # Set of variables for modules in contrib/ and src/test/modules/
 my $contrib_defines = { 'refint' => 'REFINT_VERBOSE' };
-my @contrib_uselibpq = ('dblink', 'oid2name', 'postgres_fdw', 'vacuumlo');
-my @contrib_uselibpgport   = ('oid2name', 'vacuumlo');
-my @contrib_uselibpgcommon = ('oid2name', 'vacuumlo');
+my @contrib_uselibpq = ('dblink', 'oid2name', 'pg_amcheck', 'postgres_fdw', 'vacuumlo');
+my @contrib_uselibpgport   = ('oid2name', 'pg_amcheck', 'vacuumlo');
+my @contrib_uselibpgcommon = ('oid2name', 'pg_amcheck', 'vacuumlo');
 my $contrib_extralibs      = undef;
 my $contrib_extraincludes = { 'dblink' => ['src/backend'] };
 my $contrib_extrasource = {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e017557e3e..202673d37f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -101,6 +101,7 @@ AlterUserMappingStmt
 AlteredTableInfo
 AlternativeSubPlan
 AlternativeSubPlanState
+AmcheckOptions
 AnalyzeAttrComputeStatsFunc
 AnalyzeAttrFetchFunc
 AnalyzeForeignTable_function
@@ -500,6 +501,7 @@ DSA
 DWORD
 DataDumperPtr
 DataPageDeleteStack
+DatabaseInfo
 DateADT
 Datum
 DatumTupleFields
@@ -1803,6 +1805,8 @@ PathHashStack
 PathKey
 PathKeysComparison
 PathTarget
+PatternInfo
+PatternInfoArray
 Pattern_Prefix_Status
 Pattern_Type
 PendingFsyncEntry
@@ -2085,6 +2089,7 @@ RelToCluster
 RelabelType
 Relation
 RelationData
+RelationInfo
 RelationPtr
 RelationSyncEntry
 RelcacheCallbackFunction
-- 
2.21.1 (Apple Git-122.3)

v45-0003-Extending-PostgresNode-to-test-corruption.patchapplication/octet-stream; name=v45-0003-Extending-PostgresNode-to-test-corruption.patch; x-unix-mode=0644Download
From 9a68cf04ce1f847b2c4f02ed8e5e88fce997b664 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Feb 2021 12:37:58 -0800
Subject: [PATCH v45 3/3] Extending PostgresNode to test corruption.

PostgresNode now has functions for overwriting relation files
with full or partial prior versions of those files, creating
corruption beyond merely twiddling the bits of a heap relation
file.

Adding a regression test for pg_amcheck based on this new
functionality.
---
 contrib/pg_amcheck/t/006_relfile_damage.pl    | 145 ++++++++++
 src/test/modules/Makefile                     |   1 +
 src/test/modules/corruption/Makefile          |  16 ++
 .../modules/corruption/t/001_corruption.pl    |  83 ++++++
 src/test/perl/PostgresNode.pm                 | 265 ++++++++++++++++++
 5 files changed, 510 insertions(+)
 create mode 100644 contrib/pg_amcheck/t/006_relfile_damage.pl
 create mode 100644 src/test/modules/corruption/Makefile
 create mode 100644 src/test/modules/corruption/t/001_corruption.pl

diff --git a/contrib/pg_amcheck/t/006_relfile_damage.pl b/contrib/pg_amcheck/t/006_relfile_damage.pl
new file mode 100644
index 0000000000..45ad223531
--- /dev/null
+++ b/contrib/pg_amcheck/t/006_relfile_damage.pl
@@ -0,0 +1,145 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 22;
+use PostgresNode;
+
+my ($node, $port);
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT ct.relname
+			FROM pg_catalog.pg_class cr, pg_catalog.pg_class ct
+			WHERE cr.oid = '$relname'::regclass
+			  AND cr.reltoastrelid = ct.oid
+			));
+	return undef unless defined $rel;
+	return "pg_toast.$rel";
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+# Create a table with a btree index.  Use a fillfactor for the table and index
+# that will allow some fraction of updates to be on the original pages and some
+# on new pages.
+#
+$node->safe_psql('postgres', qq(
+create schema t;
+create table t.t1 (id integer, t text) with (fillfactor=75);
+alter table t.t1 alter column t set storage external;
+insert into t.t1 select gs, repeat('x',gs) from generate_series(9990,10000) gs;
+create index t1_idx on t.t1 (id) with (fillfactor=75);
+));
+
+my $toastrel = relation_toast('postgres', 't.t1');
+
+# Flush relation files to disk and take snapshots of the toast and index
+#
+$node->restart;
+$node->take_relfile_snapshot_minimal('postgres', 'idx', 't.t1_idx');
+$node->take_relfile_snapshot_minimal('postgres', 'toast', $toastrel);
+
+# Insert new data into the table and index
+#
+$node->safe_psql('postgres', qq(
+insert into t.t1 select gs, repeat('y',gs) from generate_series(10001,10100) gs;
+));
+
+# Revert index.  The reverted snapshot file is not corrupt, but it also
+# does not match the current contents of the table.
+#
+$node->stop;
+$node->revert_to_snapshot('idx');
+
+# Restart the node and check table and index with varying options.
+#
+$node->start;
+
+# Checks which do not reconcile the index and table via --heapallindexed will
+# not notice any problems
+#
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	qr/^$/,
+	'pg_amcheck reverted index at default checking level');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--parent-check' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --parent-check');
+
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--rootdescend' ],
+	qr/^$/,
+	'pg_amcheck reverted index with --rootdescend');
+
+# Checks which do reconcile the index and table via --heapallindexed will
+# notice the mismatch in their contents
+#
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ qr/heap tuple .* from table "t1" lacks matching index tuple within index "t1_idx"/ ],
+	[ ],
+	'pg_amcheck reverted index with --heapallindexed --rootdescend');
+
+# Revert the toast.  The reverted toast table is not corrupt, but it does not
+# have entries for all toast pointers in the main table
+#
+$node->stop;
+$node->revert_to_snapshot('toast');
+
+# Restart the node and check table and toast with varying options.  When
+# checking the toast pointers, we may get errors produced by verify_heapam, but
+# we may also get errors from failure to read toast blocks that are beyond the
+# end of the toast table, of the form /ERROR:  could not read block/.  To avoid
+# having a brittle test, we accept any error message.
+#
+$node->start;
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', $toastrel ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck reverted toast table');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*', '--exclude-toast-pointers' ],
+	0,
+	[ qr/^$/ ],
+	[ ],
+	'pg_amcheck with reverted toast using --exclude-toast-pointers');
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '--quiet', '-p', $port, '-r', 'postgres.t.*' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ ],
+	'pg_amcheck with reverted toast and default checking');
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 5391f461a2..c92d1702b4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  corruption \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/corruption/Makefile b/src/test/modules/corruption/Makefile
new file mode 100644
index 0000000000..ba461c645d
--- /dev/null
+++ b/src/test/modules/corruption/Makefile
@@ -0,0 +1,16 @@
+# src/test/modules/corruption/Makefile
+
+# EXTRA_INSTALL = contrib/pg_amcheck
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/corruption
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/corruption/t/001_corruption.pl b/src/test/modules/corruption/t/001_corruption.pl
new file mode 100644
index 0000000000..ae4a262e06
--- /dev/null
+++ b/src/test/modules/corruption/t/001_corruption.pl
@@ -0,0 +1,83 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 10;
+use PostgresNode;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create something non-trivial for the first snapshot
+$node->safe_psql('postgres', qq(
+create table t1 (id integer, short_text text, long_text text);
+insert into t1 (id, short_text, long_text)
+	(select gs, 'foo', repeat('x', gs)
+		from generate_series(1,10000) gs);
+create unique index idx1 on t1 (id, short_text);
+vacuum freeze;
+));
+
+# Flush relation files to disk and take snapshot of them
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap1', 'public.t1');
+
+# Update data in the table, toast table, and index
+$node->safe_psql('postgres', qq(
+update t1 set
+	short_text = 'bar',
+	long_text = repeat('y', id);
+));
+
+# Flush relation files to disk and take second snapshot
+$node->restart;
+$node->take_relfile_snapshot('postgres', 'snap2', 'public.t1');
+
+# Revert the first page of t1 using a torn snapshot.  This should be a partial
+# and corrupt reverting of the update.
+$node->stop;
+$node->revert_to_torn_relfile_snapshot('snap1', 8192);
+
+# Restart the node and count the number of rows in t1 with the original
+# (pre-update) values.  It should not be zero, but nor will it be the full
+# 10000.
+$node->start;
+my ($old, $new, $oldtoast, $newtoast) = counts();
+ok($old > 0 && $old < 10000, "Torn snapshot reverts some of the main updates");
+ok($new > 0 && $new <= 10000, "Torn snapshot retains some of the main updates");
+
+# Revert t1 fully to the first snapshot.  This should fully restore the
+# original (pre-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap1');
+
+# Restart the node and verify only old values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 10000, "Full snapshot restores all the old main values");
+is($oldtoast, 10000, "Full snapshot restores all the old toast values");
+is($new, 0, "Full snapshot reverts all the new main values");
+is($newtoast, 0, "Full snapshot reverts all the new toast values");
+
+# Restore t1 fully to the second snapshot.  This should fully restore the
+# new (post-update) values.
+$node->stop;
+$node->revert_to_snapshot('snap2');
+
+# Restart the node and verify only new values remain
+$node->start;
+($old, $new, $oldtoast, $newtoast) = counts();
+is($old, 0, "Full snapshot reverts all the old main values");
+is($oldtoast, 0, "Full snapshot reverts all the old toast values");
+is($new, 10000, "Full snapshot restores all the new main values");
+is($newtoast, 10000, "Full snapshot restores all the new toast values");
+
+sub counts {
+	return map {
+		$node->safe_psql('postgres', qq(select count(*) from t1 where $_))
+	} ("short_text = 'foo'",
+	   "short_text = 'bar'",
+	   "long_text ~ 'x'",
+	   "long_text ~ 'y'");
+}
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 9667f7667e..5402d020f1 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -2225,6 +2225,271 @@ sub pg_recvlogical_upto
 
 =back
 
+=head1 DATABASE CORRUPTION METHODS
+
+=over
+
+=item $node->relfile_snapshot_repository()
+
+The path to the parent directory of all directories storing snapshots of
+relation backing files.
+
+=cut
+
+sub relfile_snapshot_repository
+{
+	my ($self) = @_;
+	my $snaprepo = join('/', $self->basedir, 'snapshot');
+	unless (-d $snaprepo)
+	{
+		mkdir $snaprepo
+			or $!{EEXIST}
+			or BAIL_OUT("could not create snapshot repository directory \"$snaprepo\": $!");
+	}
+	return $snaprepo;
+}
+
+=pod
+
+=item $node->relfile_snapshot_directory(snapname)
+
+The path to the directory for storing the named snapshot.
+
+=cut
+
+sub relfile_snapshot_directory
+{
+	my ($self, $snapname) = @_;
+
+	join("/", $self->relfile_snapshot_repository(), $snapname);
+}
+
+=pod
+
+=item $node->take_relfile_snapshot($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relname>, the associated
+toast relations (if any), and all associated indexes (if any).  No attempt is
+made to flush these files to disk, meaning the snapshot taken could be stale
+unless the caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relations.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+=pod
+
+=item $node->take_relfile_snapshot_minimal($self, $dbname, $snapname, @relnames)
+
+Makes a copy of the files backing the relations B<@relnames>.  No attempt is made
+to flush these files to disk, meaning the snapshot taken could be stale unless the
+caller ensures these files have been flushed prior to calling.
+
+Dies on failure to invoke psql.
+
+Dies on missing relation.
+
+Dies if the given B<$snapname> is already in use.
+
+=cut
+
+sub take_relfile_snapshot
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 1, @relnames);
+}
+
+sub take_relfile_snapshot_minimal
+{
+	my ($self, $dbname, $snapname, @relnames) = @_;
+	$self->take_relfile_snapshot_helper($dbname, $snapname, 0, @relnames);
+}
+
+sub take_relfile_snapshot_helper
+{
+	my ($self, $dbname, $snapname, $extended, @relnames) = @_;
+
+	croak "dbname must be specified" unless defined $dbname;
+	croak "relnames must be defined" unless scalar(grep { defined $_ } @relnames);
+	croak "snapname must be specified" unless defined $snapname;
+	croak "snapname must be unique" if exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snapdir = $self->relfile_snapshot_directory($snapname);
+	croak "snapname directory name already in use: $snapdir" if (-e $snapdir);
+	mkdir $snapdir
+		or BAIL_OUT("could not create snapshot directory \"$snapdir\": $!");
+
+	my @relpaths = map {
+		$self->safe_psql($dbname,
+			qq(SELECT pg_relation_filepath('$_')));
+	} @relnames;
+
+	my (@toastpaths, @idxpaths);
+	if ($extended)
+	{
+		for my $relname (@relnames)
+		{
+			push (@toastpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(c.reltoastrelid)
+					FROM pg_catalog.pg_class c
+					WHERE c.oid = '$relname'::regclass
+					AND c.reltoastrelid != 0::oid))));
+			push (@idxpaths, grep /\w/, split(/(?:\s*\r?\n\s*)+/, $self->safe_psql($dbname,
+				qq(SELECT pg_relation_filepath(i.indexrelid)
+					FROM pg_catalog.pg_index i
+					WHERE i.indrelid = '$relname'::regclass))));
+		}
+	}
+
+	$self->{snapshot}->{$snapname} = {};
+	for my $path (@relpaths, grep { defined($_) } @toastpaths, @idxpaths)
+	{
+		croak "file backing relation is missing: $pgdata/$path" unless -f "$pgdata/$path";
+		copy_file($snapdir, $pgdata, 0, $path);
+		$self->{snapshot}->{$snapname}->{$path} = 1;
+	}
+}
+
+=pod
+
+=item $node->revert_to_snapshot($self, $snapname)
+
+Overwrites the database's relation files with files previously saved in
+B<$snapname>.
+
+Dies if the given B<$snapname> does not exist.
+
+=cut
+
+=pod
+
+=item $node->revert_to_torn_relfile_snapshot($self, $snapname, $bytes)
+
+Partially overwrites the database's relation files using prefixes of the given
+number of bytes from the files saved in B<$snapname>.  If B<$bytes> is
+negative, uses suffixes of the given byte length rather than prefixes.
+
+If B<$bytes> is null, replaces the database's relation files using the saved
+files in the B<$snapname>, which unlike for non-undef values, means the file
+may become shorter if the saved file is shorter than the current file.
+
+=cut
+
+sub revert_to_snapshot
+{
+	my ($self, $snapname) = @_;
+	$self->revert_to_torn_relfile_snapshot($snapname, undef);
+}
+
+sub revert_to_torn_relfile_snapshot
+{
+	my ($self, $snapname, $bytes) = @_;
+
+	croak "no such snapshot" unless exists $self->{snapshot}->{$snapname};
+
+	my $pgdata = $self->data_dir;
+	my $snaprepo = join('/', $self->relfile_snapshot_repository, $snapname);
+	croak "snapname directory missing: $snaprepo" unless (-d $snaprepo);
+
+	if (defined $bytes)
+	{
+		tear_file($pgdata, $snaprepo, $bytes, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+	else
+	{
+		copy_file($pgdata, $snaprepo, 1, $_)
+			for (keys %{$self->{snapshot}->{$snapname}});
+	}
+}
+
+sub copy_file
+{
+	my ($dstdir, $srcdir, $overwrite, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	foreach my $part (split(m{/}, $path))
+	{
+		my $srcpart = "$srcdir/$part";
+		my $dstpart = "$dstdir/$part";
+
+		if (-d $srcpart)
+		{
+			$srcdir = $srcpart;
+			$dstdir = $dstpart;
+			die "$dstdir is in the way" if (-e $dstdir && ! -d $dstdir);
+			unless (-d $dstdir)
+			{
+				mkdir $dstdir
+					or BAIL_OUT("could not create directory \"$dstdir\": $!");
+			}
+		}
+		elsif (-f $srcpart)
+		{
+			die "$dstdir/$part is in the way" if (!$overwrite && -e "$dstdir/$part");
+
+			File::Copy::copy($srcpart, "$dstdir/$part");
+		}
+	}
+}
+
+sub tear_file
+{
+	my ($dstdir, $srcdir, $bytes, $path) = @_;
+
+	croak "No such directory: $dstdir" unless -d $dstdir;
+	croak "No such directory: $srcdir" unless -d $srcdir;
+
+	my $srcfile = "$srcdir/$path";
+	my $dstfile = "$dstdir/$path";
+
+	croak "No such file: $srcfile" unless -f $srcfile;
+	croak "No such file: $dstfile" unless -f $dstfile;
+
+	my ($srcfh, $dstfh);
+	open($srcfh, '<', $srcfile) or die "Cannot read $srcfile: $!";
+	open($dstfh, '+<', $dstfile) or die "Cannot modify $dstfile: $!";
+	binmode($srcfh);
+	binmode($dstfh);
+
+	my $buffer;
+	if ($bytes < 0)
+	{
+		$bytes *= -1;		# Easier to use positive value
+		my $srcsize = (stat($srcfh))[7];
+		my $offset = $srcsize - $bytes;
+		seek($srcfh, $offset, 0) or die "seek failed: $!";
+		seek($dstfh, $offset, 0) or die "seek failed: $!";
+		defined(sysread($srcfh, $buffer, $bytes))
+			or die "sysread failed: $!";
+		defined(syswrite($dstfh, $buffer, $bytes))
+			or die "syswrite failed: $!";
+	}
+	else
+	{
+		seek($srcfh, 0, 0) or die "seek failed: $!";
+		seek($dstfh, 0, 0) or die "seek failed: $!";
+		defined(sysread($srcfh, $buffer, $bytes))
+			or die "sysread failed: $!";
+		defined(syswrite($dstfh, $buffer, $bytes))
+			or die "syswrite failed: $!";
+	}
+
+	close($srcfh);
+	close($dstfh);
+}
+
+=pod
+
+=back
+
 =cut
 
 1;
-- 
2.21.1 (Apple Git-122.3)

#14Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Mark Dilger (#13)
Re: pg_amcheck contrib application

(I'm still not a fan of adding more client-side tools whose sole task is
to execute server-side functionality in a slightly filtered way, but it
seems people are really interested in this, so ...)

I want to register, if we are going to add this, it ought to be in
src/bin/. If we think it's a useful tool, it should be there with all
the other useful tools.

I realize there is a dependency on a module in contrib, and it's
probably now not the time to re-debate reorganizing contrib. But if we
ever get to that, this program should be the prime example why the
current organization is problematic, and we should be prepared to make
the necessary moves then.

#15Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Peter Eisentraut (#14)
Re: pg_amcheck contrib application

11 марта 2021 г., в 13:12, Peter Eisentraut <peter.eisentraut@enterprisedb.com> написал(а):

client-side tools whose sole task is to execute server-side functionality in a slightly filtered way

By the way, can we teach pg_amcheck to verify database without creating local PGDATA and using bare minimum of file system quota?

We can implement a way for a pg_amcheck to ask for some specific file, which will be downloaded by backup tool and streamed to pg_amcheck.
E.g. pg_amcheck could have a restore_file_command = 'backup-tool bring-my-file %backup_id %file_name' and probably list_files_command='backup-tool list-files %backup_id'. And pg_amcheck could then fetch bare minimum of what is needed.

I see that this is somewhat orthogonal idea, but from my POV interesting one.

Best regards, Andrey Borodin.

#16Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Peter Eisentraut (#14)
Re: pg_amcheck contrib application

On Mar 11, 2021, at 12:12 AM, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:

(I'm still not a fan of adding more client-side tools whose sole task is to execute server-side functionality in a slightly filtered way, but it seems people are really interested in this, so ...)

I want to register, if we are going to add this, it ought to be in src/bin/. If we think it's a useful tool, it should be there with all the other useful tools.

I considered putting it in src/bin/scripts where reindexdb and vacuumdb also live. It seems most similar to those two tools.

I realize there is a dependency on a module in contrib, and it's probably now not the time to re-debate reorganizing contrib. But if we ever get to that, this program should be the prime example why the current organization is problematic, and we should be prepared to make the necessary moves then.

Before settling on contrib/pg_amcheck as the location, I checked whether any tools under src/bin had dependencies on a contrib module, and couldn't find any current examples. (There seems to have been one in the past, though I forget which that was at the moment.)

I have no argument with changing the location of this tool before it gets committed, but I wonder if we should do that now, or wait until some future time when contrib gets reorganized? I can't quite tell which you prefer from your comments above.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Andrey Borodin (#15)
Re: pg_amcheck contrib application

On Mar 11, 2021, at 3:36 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

11 марта 2021 г., в 13:12, Peter Eisentraut <peter.eisentraut@enterprisedb.com> написал(а):

client-side tools whose sole task is to execute server-side functionality in a slightly filtered way

By the way, can we teach pg_amcheck to verify database without creating local PGDATA and using bare minimum of file system quota?

pg_amcheck does not need a local data directory to check a remote database server, though it does need to connect to that server. The local file system quota should not be a problem, as pg_amcheck does not download and save any data to disk. I am uncertain if this answers your question. If you are imagining pg_amcheck running on the same server as the database cluster, then of course running pg_amcheck puts a burden on the server to read all the relation files necessary, much as running queries over the same relations would do.

We can implement a way for a pg_amcheck to ask for some specific file, which will be downloaded by backup tool and streamed to pg_amcheck.
E.g. pg_amcheck could have a restore_file_command = 'backup-tool bring-my-file %backup_id %file_name' and probably list_files_command='backup-tool list-files %backup_id'. And pg_amcheck could then fetch bare minimum of what is needed.

I see that this is somewhat orthogonal idea, but from my POV interesting one.

pg_amcheck is not designed to detect corruption directly, but rather to open one or more connections to the database and execute sql queries which employ the contrib/amcheck sql functions.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#14)
Re: pg_amcheck contrib application

On Thu, Mar 11, 2021 at 3:12 AM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:

(I'm still not a fan of adding more client-side tools whose sole task is
to execute server-side functionality in a slightly filtered way, but it
seems people are really interested in this, so ...)

I want to register, if we are going to add this, it ought to be in
src/bin/. If we think it's a useful tool, it should be there with all
the other useful tools.

I think this provides a pretty massive gain in usability. If you
wanted to check all of your tables and btree indexes without this, or
worse yet some subset of them that satisfied certain criteria, it
would be a real nuisance. You don't want to run all of the check
commands in a single transaction, because that keeps snapshots open,
and there's a good chance you do want to use parallelism. Even if you
ignore all that, the output you're going to get from running the
queries individually in psql is not going to be easy to sort through,
whereas the tool is going to distill that down to what you really need
to know.

Perhaps we should try to think of some way that some of these tools
could be unified, since it does seem a bit silly to have reindexdb,
vacuumdb, and pg_amcheck all as separate commands basically doing the
same kind of thing but for different maintenance operations, but I
don't think getting rid of them entirely is the way - and I don't
think that unifying them is a v14 project.

I also had the thought that maybe this should go in src/bin, because I
think this is going to be awfully handy for a lot of people. However,
I don't think there's a rule that binaries can't go in contrib --
oid2name and vacuumlo are existing precedents. But I guess that's only
2 out of quite a large number of binaries that we ship, so maybe it's
best not to add to it, especially for a tool which I at least suspect
is going to get a lot more use than either of those.

Anyone else want to vote for or against moving this to src/bin?

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#13)
Re: pg_amcheck contrib application

On Wed, Mar 10, 2021 at 11:02 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

The documentation says that -D "does exclude any database that was
listed explicitly as dbname on the command line, nor does it exclude
the database chosen in the absence of any dbname argument." The first
part of this makes complete sense to me, but I'm not sure about the
second part. If I type pg_amcheck --all -D 'r*', I think I'm expecting
that "rhaas" won't be checked. Likewise, if I say pg_amcheck -d
'bob*', I think I only want to check the bob-related databases and not
rhaas.

I think it's a tricky definitional problem. I'll argue the other side for the moment:

If you say `pg_amcheck bob`, I think it is fair to assume that "bob" gets checked. If you say `pg_amcheck bob -d="b*" -D="bo*"`, it is fair to expect all databases starting with /b/ to be checked, except those starting with /bo/, except that since you *explicitly* asked for "bob", that "bob" gets checked. We both agree on this point, I think.

+1.

If you say `pg_amcheck --maintenance-db=bob -d="b*" -D="bo*", you don't expect "bob" to get checked, even though it was explicitly stated.

I expect that specifying --maintenance-db has zero effect on what gets
checked. The only thing that should do is tell me which database to
use to get the list of databases that I am going to check, just in
case the default is unsuitable and will fail.

If you are named "bob", and run `pg_amcheck`, you expect it to get your name "bob" from the environment, and check database "bob". It's implicit rather than explicit, but that doesn't change what you expect to happen. It's just a short-hand for saying `pg_amcheck bob`.

+1.

Saying that `pg_amcheck -d="b*" -D="bo*" should not check "bob" implies that the database being retrieved from the environment is acting like a maintenance-db. But that's not how it is treated when you just say `pg_amcheck` with no arguments. I think treating it as a maintenance-db in some situations but not in others is strangely non-orthogonal.

I don't think I agree with this. A maintenance DB in my mind doesn't
mean "a database we're not actually checking," but rather "a database
that we're using to get a list of other databases."

TBH, I guess I actually don't know why we ever treat a bare
command-line argument as a maintenance DB. I probably wouldn't do
that. We should only need a maintenance DB if we need to query for a
list of database to check, and if the user has explicitly named the
database to check, then we do not need to do that... unless they've
also done something like -D or -d, but then the explicitly-specified
database name is playing a double role. It is both one of the
databases we will check, and also the database we will use to figure
out what other databases to check. I think that's why this seems
non-orthogonal.

Here's my proposal:

1. If there are options present which require querying for a list of
databases (e.g. --all, -d, -D) then use connectMaintenanceDatabase()
and go figure out what they mean. The cparams passed to that function
are only affected by the use of --maintenance-db, not by any bare
command line arguments. If there are no arguments present which
require querying for a list of databases, then --maintenance-db has no
effect.

2. If there is a bare command line argument, add the named database to
the list of databases to be checked. This might be empty if no
relevant options were specified in step 1, or if those options matched
nothing. It might be a noop if the named database was already selected
by the options mentioned in step 1.

3. If there were no options present which required querying for a list
of databases, and if there is also no bare command line argument, then
default to the checking whatever database we connect to by default.

With this approach, --maintenance-db only ever affects how we get the
list of databases to check, and a bare command-line argument only ever
specifies a database to be checked. That seems cleaner.

An alternate possibility would be to say that there should only ever
be EITHER a bare command-line argument OR options that require
querying for a list of databases OR neither BUT NOT both. Then it's
simple:

0. If you have both options which require querying for a list of
databases and also a bare database name, error and die.
1. As above.
2. As above except the only possibility is now increasing the list of
target databases from length 0 to length 1.
3. As above.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Mark Dilger (#17)
Re: pg_amcheck contrib application

11 марта 2021 г., в 20:30, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

pg_amcheck does not need a local data directory to check a remote database server, though it does need to connect to that server.

No, I mean it it would be great if we did not need to materialise whole DB anywhere. Let's say I have a backup of 10Tb cluster in S3. And don't have that clusters hardware anymore. I want to spawn tiny VM with few GiBs of RAM and storage no larger than biggest index within DB + WAL from start to end. And stream-check all backup, mark it safe and sleep well. It would be perfect if we could do backup verification at cost of corruption monitoring (and not vice versa, which is trivial).

Best regards, Andrey Borodin.

#21Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Andrey Borodin (#20)
Re: pg_amcheck contrib application

On Mar 11, 2021, at 9:10 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

11 марта 2021 г., в 20:30, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

pg_amcheck does not need a local data directory to check a remote database server, though it does need to connect to that server.

No, I mean it it would be great if we did not need to materialise whole DB anywhere. Let's say I have a backup of 10Tb cluster in S3. And don't have that clusters hardware anymore. I want to spawn tiny VM with few GiBs of RAM and storage no larger than biggest index within DB + WAL from start to end. And stream-check all backup, mark it safe and sleep well. It would be perfect if we could do backup verification at cost of corruption monitoring (and not vice versa, which is trivial).

Thanks for clarifying. I agree that would be useful. I don't see any way to make that part of this project, but maybe after the v14 cycle you'll look over the code a propose a way forward for that?'


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#13)
Re: pg_amcheck contrib application

On Wed, Mar 10, 2021 at 11:02 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

[ new patches ]

Seems like this is mostly ready to commit now, modulo exactly what to
do about the maintenance DB stuff, and whether to move it to src/bin.
Since neither of those affects 0001, I went ahead and committed that
part.

--
Robert Haas
EDB: http://www.enterprisedb.com

#23Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#19)
1 attachment(s)
Re: pg_amcheck contrib application

On Thu, Mar 11, 2021 at 11:09 AM Robert Haas <robertmhaas@gmail.com> wrote:

An alternate possibility would be to say that there should only ever
be EITHER a bare command-line argument OR options that require
querying for a list of databases OR neither BUT NOT both. Then it's
simple:

0. If you have both options which require querying for a list of
databases and also a bare database name, error and die.
1. As above.
2. As above except the only possibility is now increasing the list of
target databases from length 0 to length 1.
3. As above.

Here's a proposed incremental patch, applying on top of your last
version, that describes the above behavior, plus makes a lot of other
changes to the documentation that seemed like good ideas to me. Your
mileage may vary, but I think this version is substantially more
concise than what you have while basically containing the same
information.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

doc-hacking.patchapplication/octet-stream; name=doc-hacking.patchDownload
diff --git a/doc/src/sgml/pgamcheck.sgml b/doc/src/sgml/pgamcheck.sgml
index b960c47305..062e1e7a3f 100644
--- a/doc/src/sgml/pgamcheck.sgml
+++ b/doc/src/sgml/pgamcheck.sgml
@@ -42,6 +42,25 @@
    relation types are silently skipped.
   </para>
 
+  <para>
+   If <literal>dbname</literal> is specified, it should be the name of a
+   single database to check, and no other database selection options should
+   be present. Otherwise, if any database selection options are present,
+   all matching databases will be checked. If no such options are present,
+   the default database will be checked. Database selection options include
+   <option>--all</option>, <option>--database</option> and
+   <option>--exclude-database</option>. They also include
+   <option>--relation</option>, <option>--exclude-relation</option>,
+   <option>--table</option>, <option>--exclude-table</option>,
+   <option>--index</option>, and <option>--exclude-index</option>,
+   but only when such options are used with a three-part pattern
+   (e.g. <option>mydb*.myschema*.myrel*</option>).
+  </para>
+
+  <para>
+   <replaceable>dbname</replaceable> can also be a
+   <link linkend="libpq-connstring">connection string</link>.
+  </para>
  </refsect1>
 
  <refsect1>
@@ -51,47 +70,13 @@
    pg_amcheck accepts the following command-line arguments:
 
    <variablelist>
-
-    <varlistentry>
-     <term><option><replaceable class="parameter">dbname</replaceable></option></term>
-     <listitem>
-      <para>
-       Specifies the name of a database to be checked, or a connection string
-       to use while connecting.
-      </para>
-      <para>
-       If no <replaceable>dbname</replaceable> is specified, and if
-       <option>-a</option> <option>--all</option> is not used, the database name
-       is read from the environment variable <envar>PGDATABASE</envar>.  If
-       that is not set, the user name specified for the connection is used.
-       The <replaceable>dbname</replaceable> can be a <link
-       linkend="libpq-connstring">connection string</link>.  If so, connection
-       string parameters will override any conflicting command line options,
-       and connection string parameters other than the database
-       name itself will be re-used when connecting to other databases.
-      </para>
-      <para>
-       If a connection string is given which contains no database name, the other
-       parameters of the string will be used while the database name to use is
-       determined as described above.
-      </para>
-     </listitem>
-    </varlistentry>
-
     <varlistentry>
      <term><option>-a</option></term>
      <term><option>--all</option></term>
        <listitem>
       <para>
-       Perform checking in all databases which are not otherwise excluded.
-      </para>
-      <para>
-       In the absence of any other options, selects all objects across all
-       schemas and databases.
-      </para>
-      <para>
-       Option <option>-D</option> <option>--exclude-database</option> takes
-       precedence over <option>-a</option> <option>--all</option>.
+       Check all databases, except for any excluded via
+       <option>--exclude-database</option>.
       </para>
      </listitem>
     </varlistentry>
@@ -101,22 +86,10 @@
      <term><option>--database=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Perform checking in databases matching the specified
-       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
-       that are not otherwise excluded.
-      </para>
-      <para>
-       This option may be specified multiple times to list more than one
-       pattern.  By default, all objects in all matching databases will be
-       checked.
-      </para>
-      <para>
-       If <option>-a</option> <option>--all</option> is also specified,
-       <option>-d</option> <option>--database</option> has no effect.
-      </para>
-      <para>
-       Option <option>-D</option> <option>--exclude-database</option> takes
-       precedence over <option>-d</option> <option>--database</option>.
+       Check databases matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       except for any excluded by <option>--exclude-database</option>.
+       This option can be specified more than once.
       </para>
      </listitem>
     </varlistentry>
@@ -126,20 +99,9 @@
      <term><option>--exclude-database=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Do not include databases matching other patterns or included by option
-       <option>-a</option> <option>--all</option> if they also match the
-       specified exclusion
+       Exclude databases matching the given
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
-      </para>
-      <para>
-       This does not exclude any database that was listed explicitly as a
-       <replaceable>dbname</replaceable> on the command line, nor does it exclude
-       the database chosen in the absence of any
-       <replaceable>dbname</replaceable> argument.
-      </para>
-      <para>
-       This option may be specified multiple times to list more than one
-       exclusion pattern.
+       This option can be specified more than once.
       </para>
      </listitem>
     </varlistentry>
@@ -149,8 +111,7 @@
      <term><option>--echo</option></term>
      <listitem>
       <para>
-       Print to stdout all commands and queries being executed against the
-       server.
+      Echo to stdout all SQL sent to the server.
       </para>
      </listitem>
     </varlistentry>
@@ -159,21 +120,13 @@
      <term><option>--endblock=<replaceable class="parameter">block</replaceable></option></term>
      <listitem>
       <para>
-       Check table blocks up to and including the specified ending block
-       number.  An error will occur if the table relation being checked has
-       fewer than this number of blocks.
-      </para>
-      <para>
-       By default, checking is performed up to and including the final block.
-       This option will be applied to all table relations that are checked,
-       including toast tables, but note that unless
-       <option>--exclude-toast-pointers</option> is given, toast pointers found
-       in the main table will be followed into the toast table without regard
-       to the location in the toast table.
-      </para>
-      <para>
+       End checking at the specified block number.  An error will occur if the
+       table relation being checked has fewer than this number of blocks.
        This option does not apply to indexes, and is probably only useful when
-       checking a single table relation.
+       checking a single table relation. If both a regular table and a toast
+       table are checked, this option will apply to both, but higher-numbered
+       toast blocks may still be accessed while validating toast pointers,
+       unless that is suppresed using <option>--exclude-toast-pointers</option>.
       </para>
      </listitem>
     </varlistentry>
@@ -182,21 +135,10 @@
      <term><option>--exclude-toast-pointers</option></term>
      <listitem>
       <para>
-       When checking main relations, do not look up entries in toast tables
-       corresponding to toast pointers in the main relation.
-      </para>
-      <para>
-       The default behavior checks each toast pointer encountered in the main
-       table to verify, as much as possible, that the pointer points at
-       something in the toast table that is reasonable.  Toast pointers which
-       point beyond the end of the toast table, or to the middle (rather than
-       the beginning) of a toast entry, are identified as corrupt.
-      </para>
-      <para>
-       The process by which <xref linkend="amcheck"/>'s
-       <function>verify_heapam</function> function checks each toast pointer is
-       slow and may be improved in a future release.  Some users may wish to
-       disable this check to save time.
+       By default, whenever a toast pointer is encountered in a table,
+       a lookup is performed to ensure that it references apparently-valid
+       entries in the toast table. These checks can be quite slow, and this
+       option can be used to skip them.
       </para>
      </listitem>
     </varlistentry>
@@ -240,13 +182,14 @@
      <term><option>--index=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Perform checks on indexes which match the specified
-       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       Check indexes matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
        unless they are otherwise excluded.
+       This option can be specified more than once.
       </para>
       <para>
-       This is similar to the <option>-r</option> <option>--relation</option>
-       option, except that it applies only to indexes, not tables.
+       This is similar to the <option>--relation</option> option, except that
+       it applies only to indexes, not tables.
       </para>
      </listitem>
     </varlistentry>
@@ -256,13 +199,13 @@
      <term><option>--exclude-index=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Exclude checks on the indexes which match the specified
+       Exclude indexes matching the specified
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
       </para>
       <para>
-       This is similar to the <option>-R</option>
-       <option>--exclude-relation</option> option, except that it applies only
-       to indexes, not tables.
+       This is similar to the <option>--exclude-relation</option> option,
+       except that it applies only to indexes, not tables.
       </para>
      </listitem>
     </varlistentry>
@@ -273,7 +216,7 @@
      <listitem>
       <para>
        Use <replaceable>num</replaceable> concurrent connections to the server,
-       or one per object to be checked, whichever number is smaller.
+       or one per object to be checked, whichever is less.
       </para>
       <para>
        The default is to use a single connection.
@@ -285,20 +228,17 @@
      <term><option>--maintenance-db=<replaceable class="parameter">dbname</replaceable></option></term>
      <listitem>
       <para>
-       Specifies the name of the database to connect to to discover which
-       databases should be checked, when
-       <option>-a</option>/<option>--all</option> is used.  If not specified,
-       the <literal>postgres</literal> database will be used, or if that does
-       not exist, <literal>template1</literal> will be used.  This can be a
-       <link linkend="libpq-connstring">connection string</link>.  If so,
-       connection string parameters will override any conflicting command line
-       options.  Also, connection string parameters other than the database
-       name itself will be re-used when connecting to other databases.
-      </para>
-      <para>
-       If a connection string is given which contains no database name, the other
-       parameters of the string will be used while the database name to use is
-       determined as described above.
+       Specifies a database or
+       <link linkend="libpq-connstring">connection string</link> to be
+       used to discover the list of databases to be checked. If neither
+       <option>--all</option> nor any option including a database pattern is
+       used, no such connection is required and this option does nothing.
+       Otherwise, any connection string parameters other than
+       the database name which are included in the value for this option
+       will also be used when connecting to the databases
+       being checked. If this option is omitted, the default is
+       <literal>postgres</literal> or, if that fails,
+       <literal>template1</literal>.
       </para>
      </listitem>
     </varlistentry>
@@ -307,14 +247,10 @@
      <term><option>--no-dependent-indexes</option></term>
      <listitem>
       <para>
-       When including a table relation in the list of relations to check, do
-       not automatically include btree indexes associated with table. 
-      </para>
-      <para>
-       By default, all tables to be checked will also have checks performed on
-       their associated btree indexes, if any.  If this option is given, only
-       those indexes which match a <option>--relation</option> or
-       <option>--index</option> pattern will be checked.
+       By default, if a table is checked, any btree indexes of that table
+       will also be checked, even if they are not explicitly selected by
+       an option such as <literal>--index</literal> or
+       <literal>--relation</literal>. This option suppresses that behavior.
       </para>
      </listitem>
     </varlistentry>
@@ -323,17 +259,12 @@
      <term><option>--no-strict-names</option></term>
      <listitem>
       <para>
-       When calculating the list of databases to check, and the objects within
-       those databases to be checked, do not raise an error for database,
-       schema, relation, table, or index inclusion patterns which match no
-       corresponding objects.
-      </para>
-      <para>
-       Exclusion patterns are not required to match any objects, but by
-       default unmatched inclusion patterns raise an error, including when
-       they fail to match as a result of an exclusion pattern having
-       prohibited them matching an existent object, and when they fail to
-       match a database because it is unconnectable (datallowconn is false).
+       By default, if an argument to <literal>--database</literal>,
+       <literal>--table</literal>, <literal>--index</literal>,
+       or <literal>--relation</literal> matches no objects, it is a fatal
+       error. This option downgrades that error to a warning.
+       If this option is used with <literal>--quiet</literal>, the warning
+       will be supressed as well.
       </para>
      </listitem>
     </varlistentry>
@@ -342,14 +273,10 @@
      <term><option>--no-dependent-toast</option></term>
      <listitem>
       <para>
-       When including a table relation in the list of relations to check, do
-       not automatically include toast tables associated with table. 
-      </para>
-      <para>
-       By default, all tables to be checked will also have checks performed on
-       their associated toast tables, if any.  If this option is given, only
-       those toast tables which match a <option>--relation</option> or
-       <option>--table</option> pattern will be checked.
+       By default, if a table is checked, its toast table, if any, will also
+       be checked, even if it is not explicitly selected by an option
+       such as <literal>--table</literal> or <literal>--relation</literal>.
+       This option suppresses that behavior.
       </para>
      </listitem>
     </varlistentry>
@@ -359,7 +286,7 @@
      <listitem>
       <para>
        After reporting all corruptions on the first page of a table where
-       corruptions are found, stop processing that table relation and move on
+       corruption is found, stop processing that table relation and move on
        to the next table or index.
       </para>
       <para>
@@ -402,7 +329,11 @@
      <term><option>--progress</option></term>
      <listitem>
       <para>
-       Show progress information about how many relations have been checked.
+       Show progress information. Progress information includes the number
+       of relations for which checking has been completed, and the total
+       size of those relations. It also includes the total number of relations
+       that will eventually be checked, and the estimated size of those
+       relations.
       </para>
      </listitem>
     </varlistentry>
@@ -412,11 +343,7 @@
      <term><option>--quiet</option></term>
      <listitem>
       <para>
-       Do not write additional messages beyond those about corruption.
-      </para>
-      <para>
-       This option does not quiet any output specifically due to the use of
-       the <option>-e</option> <option>--echo</option> option.
+       Print fewer messages, and less detail regarding any server errors.
       </para>
      </listitem>
     </varlistentry>
@@ -426,27 +353,18 @@
      <term><option>--relation=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Perform checking on all relations matching the specified
-       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       Check relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
        unless they are otherwise excluded.
+       This option can be specified more than once.
       </para>
       <para>
-       This option may be specified multiple times to list more than one
-       pattern.
-      </para>
-      <para>
-       Patterns may be unqualified, or they may be schema-qualified or
-       database- and schema-qualified, such as
-       <literal>"my*relation"</literal>,
-       <literal>"my*schema*.my*relation*"</literal>, or
-       <literal>"my*database.my*schema.my*relation</literal>.  There is no
-       problem specifying relation patterns that match databases that are not
-       otherwise included, as the relation in the matching database will still
-       be checked.
-      </para>
-      <para>
-       The <option>-R</option> <option>--exclude-relation</option> option takes
-       precedence over <option>-r</option> <option>--relation</option>.
+       Patterns may be unqualified, e.g. <literal>myrel*</literal>, or they
+       may be schema-qualified, e.g. <literal>myschema*.myrel*</literal> or
+       database-qualified and schema-qualified, e.g.
+       <literal>mydb*.myscheam*.myrel*</literal>. A database-qualified
+       pattern will add matching databases to the list of databases to be
+       checked.
       </para>
      </listitem>
     </varlistentry>
@@ -456,20 +374,15 @@
      <term><option>--exclude-relation=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Exclude checks on relations matching the specified
+       Exclude relations matching the specified
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
       </para>
       <para>
        As with <option>-r</option> <option>--relation</option>, the
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> may be unqualified, schema-qualified,
        or database- and schema-qualified.
       </para>
-      <para>
-       The <option>-R</option> <option>--exclude-relation</option> option takes
-       precedence over <option>-r</option> <option>--relation</option>,
-       <option>-t</option> <option>--table</option> and <option>-i</option>
-       <option>--index</option>.
-      </para>
      </listitem>
     </varlistentry>
 
@@ -500,26 +413,16 @@
      <term><option>--schema=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Perform checking in schemas matching the specified
-       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> that are not otherwise excluded.
+       Check tables and indexes in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>, unless they are otherwise excluded.
+       This option can be specified more than once.
       </para>
       <para>
-       This option may be specified multiple times to list more than one
-       pattern for checking.  By default, all objects in all matching schema(s)
-       will be checked.
-      </para>
-      <para>
-       Option <option>-S</option> <option>--exclude-schema</option> takes
-       precedence over <option>-s</option> <option>--schema</option>.
-      </para>
-      <para>
-       Note that both tables and indexes are included using this option, which
-       might not be what you want if you are also using
-       <option>--no-dependent-indexes</option>.  To specify all tables in a
-       schema without also specifying all indexes, <option>--table</option> can
-       be used with a pattern that specifies the schema.  For example, to check
-       all tables in schema <literal>corp</literal>, the option
-       <literal>--table="corp.*"</literal> may be used.
+       To select only tables in schemas matching a particular pattern,
+       consider using something like
+       <literal>--table=SCHEMAPAT.* --no-dependent-indexes</literal>.
+       To select only indexes, consider using something like
+       <literal>--index=SCHEMAPAT.*</literal>.
       </para>
      </listitem>
     </varlistentry>
@@ -529,18 +432,9 @@
      <term><option>--exclude-schema=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Do not perform checking in schemas matching the specified
+       Exclude tables and indexes in schemas matching the specified
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
-      </para>
-      <para>
-       This option may be specified multiple times to list more than one
-       pattern for exclusion.
-      </para>
-      <para>
-       If a schema which is included using
-       <option>-s</option> <option>--schema</option> is also excluded using
-       <option>-S</option> <option>--exclude-schema</option>, the schema will
-       be excluded.
+       This option can be specified more than once.
       </para>
      </listitem>
     </varlistentry>
@@ -553,12 +447,12 @@
        will skip over pages in all tables that are marked as all frozen.
       </para>
       <para>
-       If <literal>"all-visible"</literal> is given, table corruption checks
+       If <literal>all-visible</literal> is given, table corruption checks
        will skip over pages in all tables that are marked as all visible.
       </para>
       <para>
        By default, no pages are skipped.  This can be specified as
-       <literal>"none"</literal>, but since this is the default, it need not be
+       <literal>none</literal>, but since this is the default, it need not be
        mentioned.
       </para>
      </listitem>
@@ -568,20 +462,11 @@
      <term><option>--startblock=<replaceable class="parameter">block</replaceable></option></term>
      <listitem>
       <para>
-       Check table blocks beginning with the specified block number.  An error
-       will occur if the table relation being checked has fewer than this number
-       of blocks.
-      </para>
-      <para>
-       By default, checking begins with block zero.  This option will be applied to all
-       table relations that are checked, including toast tables, but note
-       that unless <option>--exclude-toast-pointers</option> is given, toast
-       pointers found in the main table will be followed into the toast table
-       without regard to the location in the toast table.
-      </para>
-      <para>
-       This option does not apply to indexes, and is probably only useful when
-       checking a single table.
+       Start checking at the specified block number. An error will occur if
+       the table relation being checked has fewer than this number of blocks.
+       This option does not apply to indexes, and is probably only useful
+       when checking a single table relation. See <literal>--endblock</literal>
+       for further caveats.
       </para>
      </listitem>
     </varlistentry>
@@ -591,13 +476,14 @@
      <term><option>--table=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Perform checks on all tables matching the specified
-       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>
+       Check tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
        unless they are otherwise excluded.
+       This option can be specified more than once.
       </para>
       <para>
-       This is similar to the <option>-r</option> <option>--relation</option>
-       option, except that it applies only to tables, not indexes.
+       This is similar to the <option>--relation</option> option, except that
+       it applies only to tables, not indexes.
       </para>
      </listitem>
     </varlistentry>
@@ -607,13 +493,13 @@
      <term><option>--exclude-table=<replaceable class="parameter">pattern</replaceable></option></term>
      <listitem>
       <para>
-       Exclude checks on tables matching the specified
+       Exclude tables matching the specified
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
       </para>
       <para>
-       This is similar to the <option>-R</option>
-       <option>--exclude-relation</option> option, except that it applies only
-       to tables, not indexes.
+       This is similar to the <option>--exclude-relation</option> option,
+       except that it applies only to tables, not indexes.
       </para>
      </listitem>
     </varlistentry>
@@ -633,8 +519,9 @@
      <term><option>--verbose</option></term>
      <listitem>
       <para>
-       Increases the log level verbosity.  This option may be given more than
-       once.
+       Print more messages. In particular, this will print a message for
+       each relation being checked, and will increase the level of detail
+       shown for server errors.
       </para>
      </listitem>
     </varlistentry>
#24Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#23)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 11, 2021, at 1:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 11, 2021 at 11:09 AM Robert Haas <robertmhaas@gmail.com> wrote:

An alternate possibility would be to say that there should only ever
be EITHER a bare command-line argument OR options that require
querying for a list of databases OR neither BUT NOT both. Then it's
simple:

0. If you have both options which require querying for a list of
databases and also a bare database name, error and die.
1. As above.
2. As above except the only possibility is now increasing the list of
target databases from length 0 to length 1.
3. As above.

Here's a proposed incremental patch, applying on top of your last
version, that describes the above behavior, plus makes a lot of other
changes to the documentation that seemed like good ideas to me. Your
mileage may vary, but I think this version is substantially more
concise than what you have while basically containing the same
information.

Your proposal is used in this next version of the patch, along with a resolution to the solution to the -D option handling, discussed before, and a change to make --schema and --exclude-schema options accept "database.schema" patterns as well as "schema" patterns. It previously only interpreted the parameter as a schema without treating embedded dots as separators, but that seems strangely inconsistent with the way all the other pattern options work, so I made it consistent. (I think the previous behavior was defensible, but harder to explain and perhaps less intuitive.)

Attachments:

v46-0001-Adding-contrib-module-pg_amcheck.patchapplication/octet-stream; name=v46-0001-Adding-contrib-module-pg_amcheck.patch; x-unix-mode=0644Download
From 81cf3aef373ea27b213e4da2c7735dc2cf232b20 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 2 Mar 2021 08:34:40 -0800
Subject: [PATCH v46] Adding contrib module pg_amcheck

Adding new contrib module pg_amcheck, which is a command line
interface for running amcheck's verifications against tables and
indexes.
---
 contrib/Makefile                           |    1 +
 contrib/pg_amcheck/.gitignore              |    3 +
 contrib/pg_amcheck/Makefile                |   29 +
 contrib/pg_amcheck/pg_amcheck.c            | 2134 ++++++++++++++++++++
 contrib/pg_amcheck/t/001_basic.pl          |    9 +
 contrib/pg_amcheck/t/002_nonesuch.pl       |  248 +++
 contrib/pg_amcheck/t/003_check.pl          |  504 +++++
 contrib/pg_amcheck/t/004_verify_heapam.pl  |  517 +++++
 contrib/pg_amcheck/t/005_opclass_damage.pl |   54 +
 doc/src/sgml/contrib.sgml                  |    1 +
 doc/src/sgml/filelist.sgml                 |    1 +
 doc/src/sgml/pgamcheck.sgml                |  600 ++++++
 src/tools/msvc/Install.pm                  |    2 +-
 src/tools/msvc/Mkvcbuild.pm                |    6 +-
 src/tools/pgindent/typedefs.list           |    5 +
 15 files changed, 4110 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pg_amcheck/.gitignore
 create mode 100644 contrib/pg_amcheck/Makefile
 create mode 100644 contrib/pg_amcheck/pg_amcheck.c
 create mode 100644 contrib/pg_amcheck/t/001_basic.pl
 create mode 100644 contrib/pg_amcheck/t/002_nonesuch.pl
 create mode 100644 contrib/pg_amcheck/t/003_check.pl
 create mode 100644 contrib/pg_amcheck/t/004_verify_heapam.pl
 create mode 100644 contrib/pg_amcheck/t/005_opclass_damage.pl
 create mode 100644 doc/src/sgml/pgamcheck.sgml

diff --git a/contrib/Makefile b/contrib/Makefile
index f27e458482..a72dcf7304 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -30,6 +30,7 @@ SUBDIRS = \
 		old_snapshot	\
 		pageinspect	\
 		passwordcheck	\
+		pg_amcheck	\
 		pg_buffercache	\
 		pg_freespacemap \
 		pg_prewarm	\
diff --git a/contrib/pg_amcheck/.gitignore b/contrib/pg_amcheck/.gitignore
new file mode 100644
index 0000000000..c21a14de31
--- /dev/null
+++ b/contrib/pg_amcheck/.gitignore
@@ -0,0 +1,3 @@
+pg_amcheck
+
+/tmp_check/
diff --git a/contrib/pg_amcheck/Makefile b/contrib/pg_amcheck/Makefile
new file mode 100644
index 0000000000..bc61ee7970
--- /dev/null
+++ b/contrib/pg_amcheck/Makefile
@@ -0,0 +1,29 @@
+# contrib/pg_amcheck/Makefile
+
+PGFILEDESC = "pg_amcheck - detects corruption within database relations"
+PGAPPICON = win32
+
+PROGRAM = pg_amcheck
+OBJS = \
+	$(WIN32RES) \
+	pg_amcheck.o
+
+REGRESS_OPTS += --load-extension=amcheck --load-extension=pageinspect
+EXTRA_INSTALL += contrib/amcheck contrib/pageinspect
+
+TAP_TESTS = 1
+
+PG_CPPFLAGS = -I$(libpq_srcdir)
+PG_LIBS_INTERNAL = -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+SHLIB_PREREQS = submake-libpq
+subdir = contrib/pg_amcheck
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_amcheck/pg_amcheck.c b/contrib/pg_amcheck/pg_amcheck.c
new file mode 100644
index 0000000000..5045ea59af
--- /dev/null
+++ b/contrib/pg_amcheck/pg_amcheck.c
@@ -0,0 +1,2134 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_amcheck.c
+ *		Detects corruption within database relations.
+ *
+ * Copyright (c) 2017-2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  contrib/pg_amcheck/pg_amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <time.h>
+
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_namespace_d.h"
+#include "common/logging.h"
+#include "common/username.h"
+#include "fe_utils/cancel.h"
+#include "fe_utils/option_utils.h"
+#include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
+#include "fe_utils/simple_list.h"
+#include "fe_utils/string_utils.h"
+#include "getopt_long.h"		/* pgrminclude ignore */
+#include "pgtime.h"
+#include "storage/block.h"
+
+typedef struct PatternInfo
+{
+	const char *pattern;		/* Unaltered pattern from the command line */
+	char	   *db_regex;		/* Database regexp parsed from pattern, or
+								 * NULL */
+	char	   *nsp_regex;		/* Schema regexp parsed from pattern, or NULL */
+	char	   *rel_regex;		/* Relation regexp parsed from pattern, or
+								 * NULL */
+	bool		heap_only;		/* true if rel_regex should only match heap
+								 * tables */
+	bool		btree_only;		/* true if rel_regex should only match btree
+								 * indexes */
+	bool		matched;		/* true if the pattern matched in any database */
+} PatternInfo;
+
+typedef struct PatternInfoArray
+{
+	PatternInfo *data;
+	size_t		len;
+} PatternInfoArray;
+
+/* pg_amcheck command line options controlled by user flags */
+typedef struct AmcheckOptions
+{
+	bool		dbpattern;
+	bool		alldb;
+	bool		echo;
+	bool		quiet;
+	bool		verbose;
+	bool		strict_names;
+	bool		show_progress;
+	int			jobs;
+
+	/* Objects to check or not to check, as lists of PatternInfo structs. */
+	PatternInfoArray include;
+	PatternInfoArray exclude;
+
+	/*
+	 * As an optimization, if any pattern in the exclude list applies to heap
+	 * tables, or similarly if any such pattern applies to btree indexes, or
+	 * to schemas, then these will be true, otherwise false.  These should
+	 * always agree with what you'd conclude by grep'ing through the exclude
+	 * list.
+	 */
+	bool		excludetbl;
+	bool		excludeidx;
+	bool		excludensp;
+
+	/*
+	 * If any inclusion pattern exists, then we should only be checking
+	 * matching relations rather than all relations, so this is true iff
+	 * include is empty.
+	 */
+	bool		allrel;
+
+	/* heap table checking options */
+	bool		no_toast_expansion;
+	bool		reconcile_toast;
+	bool		on_error_stop;
+	int64		startblock;
+	int64		endblock;
+	const char *skip;
+
+	/* btree index checking options */
+	bool		parent_check;
+	bool		rootdescend;
+	bool		heapallindexed;
+
+	/* heap and btree hybrid option */
+	bool		no_btree_expansion;
+} AmcheckOptions;
+
+static AmcheckOptions opts = {
+	.dbpattern = false,
+	.alldb = false,
+	.echo = false,
+	.quiet = false,
+	.verbose = false,
+	.strict_names = true,
+	.show_progress = false,
+	.jobs = 1,
+	.include = {NULL, 0},
+	.exclude = {NULL, 0},
+	.excludetbl = false,
+	.excludeidx = false,
+	.excludensp = false,
+	.allrel = true,
+	.no_toast_expansion = false,
+	.reconcile_toast = true,
+	.on_error_stop = false,
+	.startblock = -1,
+	.endblock = -1,
+	.skip = "none",
+	.parent_check = false,
+	.rootdescend = false,
+	.heapallindexed = false,
+	.no_btree_expansion = false
+};
+
+static const char *progname = NULL;
+
+/* Whether all relations have so far passed their corruption checks */
+static bool all_checks_pass = true;
+
+/* Time last progress report was displayed */
+static pg_time_t last_progress_report = 0;
+static bool progress_since_last_stderr = false;
+
+typedef struct DatabaseInfo
+{
+	char	   *datname;
+	char	   *amcheck_schema; /* escaped, quoted literal */
+} DatabaseInfo;
+
+typedef struct RelationInfo
+{
+	const DatabaseInfo *datinfo;	/* shared by other relinfos */
+	Oid			reloid;
+	bool		is_heap;		/* true if heap, false if btree */
+	char	   *nspname;
+	char	   *relname;
+	int			relpages;
+	int			blocks_to_check;
+	char	   *sql;			/* set during query run, pg_free'd after */
+} RelationInfo;
+
+/*
+ * Query for determining if contrib's amcheck is installed.  If so, selects the
+ * namespace name where amcheck's functions can be found.
+ */
+static const char *amcheck_sql =
+"SELECT n.nspname, x.extversion FROM pg_catalog.pg_extension x"
+"\nJOIN pg_catalog.pg_namespace n ON x.extnamespace = n.oid"
+"\nWHERE x.extname = 'amcheck'";
+
+static void prepare_heap_command(PQExpBuffer sql, RelationInfo *rel,
+								 PGconn *conn);
+static void prepare_btree_command(PQExpBuffer sql, RelationInfo *rel,
+								  PGconn *conn);
+static void run_command(ParallelSlot *slot, const char *sql);
+static bool verify_heap_slot_handler(PGresult *res, PGconn *conn,
+									 void *context);
+static bool verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context);
+static void help(const char *progname);
+static void progress_report(uint64 relations_total, uint64 relations_checked,
+							uint64 relpages_total, uint64 relpages_checked,
+							const char *datname, bool force, bool finished);
+
+static void append_database_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_schema_pattern(PatternInfoArray *pia, const char *pattern,
+								  int encoding);
+static void append_relation_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_heap_pattern(PatternInfoArray *pia, const char *pattern,
+								int encoding);
+static void append_btree_pattern(PatternInfoArray *pia, const char *pattern,
+								 int encoding);
+static void compile_database_list(PGconn *conn, SimplePtrList *databases,
+								  const char *initial_dbname);
+static void compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+										 const DatabaseInfo *datinfo,
+										 uint64 *pagecount);
+
+#define log_no_match(...) do { \
+		if (opts.strict_names) \
+			pg_log_generic(PG_LOG_ERROR, __VA_ARGS__); \
+		else \
+			pg_log_generic(PG_LOG_WARNING, __VA_ARGS__); \
+	} while(0)
+
+#define FREE_AND_SET_NULL(x) do { \
+	pg_free(x); \
+	(x) = NULL; \
+	} while (0)
+
+int
+main(int argc, char *argv[])
+{
+	PGconn	   *conn = NULL;
+	SimplePtrListCell *cell;
+	SimplePtrList databases = {NULL, NULL};
+	SimplePtrList relations = {NULL, NULL};
+	bool		failed = false;
+	const char *latest_datname;
+	int			parallel_workers;
+	ParallelSlotArray *sa;
+	PQExpBufferData sql;
+	uint64		reltotal = 0;
+	uint64		pageschecked = 0;
+	uint64		pagestotal = 0;
+	uint64		relprogress = 0;
+	int			pattern_id;
+
+	static struct option long_options[] = {
+		/* Connection options */
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"maintenance-db", required_argument, NULL, 1},
+
+		/* check options */
+		{"all", no_argument, NULL, 'a'},
+		{"database", required_argument, NULL, 'd'},
+		{"exclude-database", required_argument, NULL, 'D'},
+		{"echo", no_argument, NULL, 'e'},
+		{"index", required_argument, NULL, 'i'},
+		{"exclude-index", required_argument, NULL, 'I'},
+		{"jobs", required_argument, NULL, 'j'},
+		{"progress", no_argument, NULL, 'P'},
+		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'r'},
+		{"exclude-relation", required_argument, NULL, 'R'},
+		{"schema", required_argument, NULL, 's'},
+		{"exclude-schema", required_argument, NULL, 'S'},
+		{"table", required_argument, NULL, 't'},
+		{"exclude-table", required_argument, NULL, 'T'},
+		{"verbose", no_argument, NULL, 'v'},
+		{"no-dependent-indexes", no_argument, NULL, 2},
+		{"no-dependent-toast", no_argument, NULL, 3},
+		{"exclude-toast-pointers", no_argument, NULL, 4},
+		{"on-error-stop", no_argument, NULL, 5},
+		{"skip", required_argument, NULL, 6},
+		{"startblock", required_argument, NULL, 7},
+		{"endblock", required_argument, NULL, 8},
+		{"rootdescend", no_argument, NULL, 9},
+		{"no-strict-names", no_argument, NULL, 10},
+		{"heapallindexed", no_argument, NULL, 11},
+		{"parent-check", no_argument, NULL, 12},
+
+		{NULL, 0, NULL, 0}
+	};
+
+	int			optindex;
+	int			c;
+
+	const char *db = NULL;
+	const char *maintenance_db = NULL;
+
+	const char *host = NULL;
+	const char *port = NULL;
+	const char *username = NULL;
+	enum trivalue prompt_password = TRI_DEFAULT;
+	int			encoding = pg_get_encoding_from_locale(NULL, false);
+	ConnParams	cparams;
+
+	pg_logging_init(argv[0]);
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("contrib"));
+
+	handle_help_version_opts(argc, argv, progname, help);
+
+	/* process command-line options */
+	while ((c = getopt_long(argc, argv, "ad:D:eh:Hi:I:j:p:Pqr:R:s:S:t:T:U:wWv",
+							long_options, &optindex)) != -1)
+	{
+		char	   *endptr;
+
+		switch (c)
+		{
+			case 'a':
+				opts.alldb = true;
+				break;
+			case 'd':
+				opts.dbpattern = true;
+				append_database_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'D':
+				opts.dbpattern = true;
+				append_database_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'e':
+				opts.echo = true;
+				break;
+			case 'h':
+				host = pg_strdup(optarg);
+				break;
+			case 'i':
+				opts.allrel = false;
+				append_btree_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'I':
+				opts.excludeidx = true;
+				append_btree_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'j':
+				opts.jobs = atoi(optarg);
+				if (opts.jobs < 1)
+				{
+					fprintf(stderr,
+							"number of parallel jobs must be at least 1\n");
+					exit(1);
+				}
+				break;
+			case 'p':
+				port = pg_strdup(optarg);
+				break;
+			case 'P':
+				opts.show_progress = true;
+				break;
+			case 'q':
+				opts.quiet = true;
+				break;
+			case 'r':
+				opts.allrel = false;
+				append_relation_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'R':
+				opts.excludeidx = true;
+				opts.excludetbl = true;
+				append_relation_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 's':
+				opts.allrel = false;
+				append_schema_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'S':
+				opts.excludensp = true;
+				append_schema_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 't':
+				opts.allrel = false;
+				append_heap_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'T':
+				opts.excludetbl = true;
+				append_heap_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'U':
+				username = pg_strdup(optarg);
+				break;
+			case 'w':
+				prompt_password = TRI_NO;
+				break;
+			case 'W':
+				prompt_password = TRI_YES;
+				break;
+			case 'v':
+				opts.verbose = true;
+				pg_logging_increase_verbosity();
+				break;
+			case 1:
+				maintenance_db = pg_strdup(optarg);
+				break;
+			case 2:
+				opts.no_btree_expansion = true;
+				break;
+			case 3:
+				opts.no_toast_expansion = true;
+				break;
+			case 4:
+				opts.reconcile_toast = false;
+				break;
+			case 5:
+				opts.on_error_stop = true;
+				break;
+			case 6:
+				if (pg_strcasecmp(optarg, "all-visible") == 0)
+					opts.skip = "all visible";
+				else if (pg_strcasecmp(optarg, "all-frozen") == 0)
+					opts.skip = "all frozen";
+				else
+				{
+					fprintf(stderr, "invalid skip option\n");
+					exit(1);
+				}
+				break;
+			case 7:
+				opts.startblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"invalid start block\n");
+					exit(1);
+				}
+				if (opts.startblock > MaxBlockNumber || opts.startblock < 0)
+				{
+					fprintf(stderr,
+							"start block out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 8:
+				opts.endblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"invalid end block\n");
+					exit(1);
+				}
+				if (opts.endblock > MaxBlockNumber || opts.endblock < 0)
+				{
+					fprintf(stderr,
+							"end block out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 9:
+				opts.rootdescend = true;
+				opts.parent_check = true;
+				break;
+			case 10:
+				opts.strict_names = false;
+				break;
+			case 11:
+				opts.heapallindexed = true;
+				break;
+			case 12:
+				opts.parent_check = true;
+				break;
+			default:
+				fprintf(stderr,
+						"Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+		}
+	}
+
+	if (opts.endblock >= 0 && opts.endblock < opts.startblock)
+	{
+		fprintf(stderr,
+				"end block precedes start block\n");
+		exit(1);
+	}
+
+	/*
+	 * A single non-option arguments specifies a database name or connection
+	 * string.
+	 */
+	if (optind < argc)
+	{
+		db = argv[optind];
+		optind++;
+	}
+
+	if (optind < argc)
+	{
+		pg_log_error("too many command-line arguments (first is \"%s\")",
+					 argv[optind]);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+		exit(1);
+	}
+
+	/* fill cparams except for dbname, which is set below */
+	cparams.pghost = host;
+	cparams.pgport = port;
+	cparams.pguser = username;
+	cparams.prompt_password = prompt_password;
+	cparams.dbname = NULL;
+	cparams.override_dbname = NULL;
+
+	setup_cancel_handler(NULL);
+
+	/* choose the database for our initial connection */
+	if (opts.alldb)
+	{
+		if (db != NULL)
+		{
+			pg_log_error("Cannot check all databases and a specific one at the same time");
+			exit(1);
+		}
+		cparams.dbname = maintenance_db;
+	}
+	else if (db != NULL)
+	{
+		if (opts.dbpattern)
+		{
+			pg_log_error("Cannot check a specific database and specify database patterns at the same time");
+			exit(1);
+		}
+		cparams.dbname = db;
+	}
+
+	if (opts.alldb || opts.dbpattern)
+	{
+		conn = connectMaintenanceDatabase(&cparams, progname, opts.echo);
+		compile_database_list(conn, &databases, NULL);
+	}
+	else
+	{
+		if (cparams.dbname == NULL)
+		{
+			if (getenv("PGDATABASE"))
+				cparams.dbname = getenv("PGDATABASE");
+			else if (getenv("PGUSER"))
+				cparams.dbname = getenv("PGUSER");
+			else
+				cparams.dbname = get_user_name_or_exit(progname);
+		}
+		conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+		compile_database_list(conn, &databases, PQdb(conn));
+	}
+
+	if (databases.head == NULL)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		pg_log_error("no databases to check");
+		exit(0);
+	}
+
+	/*
+	 * Compile a list of all relations spanning all databases to be checked.
+	 */
+	for (cell = databases.head; cell; cell = cell->next)
+	{
+		PGresult   *result;
+		int			ntups;
+		const char *amcheck_schema = NULL;
+		DatabaseInfo *dat = (DatabaseInfo *) cell->ptr;
+
+		cparams.override_dbname = dat->datname;
+		if (conn == NULL || strcmp(PQdb(conn), dat->datname) != 0)
+		{
+			if (conn != NULL)
+				disconnectDatabase(conn);
+			conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+		}
+
+		/*
+		 * Verify that amcheck is installed for this next database.  User
+		 * error could result in a database not having amcheck that should
+		 * have it, but we also could be iterating over multiple databases
+		 * where not all of them have amcheck installed (for example,
+		 * 'template1').
+		 */
+		result = executeQuery(conn, amcheck_sql, opts.echo);
+		if (PQresultStatus(result) != PGRES_TUPLES_OK)
+		{
+			/* Querying the catalog failed. */
+			pg_log_error("database \"%s\": %s",
+						 PQdb(conn), PQerrorMessage(conn));
+			pg_log_info("query was: %s", amcheck_sql);
+			PQclear(result);
+			disconnectDatabase(conn);
+			exit(1);
+		}
+		ntups = PQntuples(result);
+		if (ntups == 0)
+		{
+			/* Querying the catalog succeeded, but amcheck is missing. */
+			pg_log_warning("skipping database \"%s\": amcheck is not installed",
+						   PQdb(conn));
+			disconnectDatabase(conn);
+			conn = NULL;
+			continue;
+		}
+		amcheck_schema = PQgetvalue(result, 0, 0);
+		if (opts.verbose)
+			pg_log_info("in database \"%s\": using amcheck version \"%s\" in schema \"%s\"",
+						PQdb(conn), PQgetvalue(result, 0, 1), amcheck_schema);
+		dat->amcheck_schema = PQescapeIdentifier(conn, amcheck_schema,
+												 strlen(amcheck_schema));
+		PQclear(result);
+
+		compile_relation_list_one_db(conn, &relations, dat, &pagestotal);
+	}
+
+	/*
+	 * Check that all inclusion patterns matched at least one schema or
+	 * relation that we can check.
+	 */
+	for (pattern_id = 0; pattern_id < opts.include.len; pattern_id++)
+	{
+		PatternInfo *pat = &opts.include.data[pattern_id];
+
+		if (!pat->matched && (pat->nsp_regex != NULL || pat->rel_regex != NULL))
+		{
+			failed = opts.strict_names;
+
+			if (!opts.quiet || failed)
+			{
+				if (pat->heap_only)
+					log_no_match("no heap tables to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->btree_only)
+					log_no_match("no btree indexes to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->rel_regex == NULL)
+					log_no_match("no relations to check in schemas matching \"%s\"",
+								 pat->pattern);
+				else
+					log_no_match("no relations to check matching \"%s\"",
+								 pat->pattern);
+			}
+		}
+	}
+
+	if (failed)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		exit(1);
+	}
+
+	/*
+	 * Set parallel_workers to the lesser of opts.jobs and the number of
+	 * relations.
+	 */
+	parallel_workers = 0;
+	for (cell = relations.head; cell; cell = cell->next)
+	{
+		reltotal++;
+		if (parallel_workers < opts.jobs)
+			parallel_workers++;
+	}
+
+	if (reltotal == 0)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		pg_log_error("no relations to check");
+		exit(1);
+	}
+	progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, false);
+
+	/*
+	 * Main event loop.
+	 *
+	 * We use server-side parallelism to check up to parallel_workers
+	 * relations in parallel.  The list of relations was computed in database
+	 * order, which minimizes the number of connects and disconnects as we
+	 * process the list.
+	 */
+	latest_datname = NULL;
+	sa = ParallelSlotsSetup(parallel_workers, &cparams, progname, opts.echo,
+							NULL);
+	if (conn != NULL)
+	{
+		ParallelSlotsAdoptConn(sa, conn);
+		conn = NULL;
+	}
+
+	initPQExpBuffer(&sql);
+	for (relprogress = 0, cell = relations.head; cell; cell = cell->next)
+	{
+		ParallelSlot *free_slot;
+		RelationInfo *rel;
+
+		rel = (RelationInfo *) cell->ptr;
+
+		if (CancelRequested)
+		{
+			failed = true;
+			break;
+		}
+
+		/*
+		 * The list of relations is in database sorted order.  If this next
+		 * relation is in a different database than the last one seen, we are
+		 * about to start checking this database.  Note that other slots may
+		 * still be working on relations from prior databases.
+		 */
+		latest_datname = rel->datinfo->datname;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, latest_datname, false, false);
+
+		relprogress++;
+		pageschecked += rel->blocks_to_check;
+
+		/*
+		 * Get a parallel slot for the next amcheck command, blocking if
+		 * necessary until one is available, or until a previously issued slot
+		 * command fails, indicating that we should abort checking the
+		 * remaining objects.
+		 */
+		free_slot = ParallelSlotsGetIdle(sa, rel->datinfo->datname);
+		if (!free_slot)
+		{
+			/*
+			 * Something failed.  We don't need to know what it was, because
+			 * the handler should already have emitted the necessary error
+			 * messages.
+			 */
+			failed = true;
+			break;
+		}
+
+		if (opts.verbose)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_VERBOSE);
+		else if (opts.quiet)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_TERSE);
+
+		/*
+		 * Execute the appropriate amcheck command for this relation using our
+		 * slot's database connection.  We do not wait for the command to
+		 * complete, nor do we perform any error checking, as that is done by
+		 * the parallel slots and our handler callback functions.
+		 */
+		if (rel->is_heap)
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+				pg_log_info("checking heap table \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_heap_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_heap_slot_handler, rel);
+			run_command(free_slot, rel->sql);
+		}
+		else
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+
+				pg_log_info("checking btree index \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_btree_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_btree_slot_handler, rel);
+			run_command(free_slot, rel->sql);
+		}
+	}
+	termPQExpBuffer(&sql);
+
+	if (!failed)
+	{
+
+		/*
+		 * Wait for all slots to complete, or for one to indicate that an
+		 * error occurred.  Like above, we rely on the handler emitting the
+		 * necessary error messages.
+		 */
+		if (sa && !ParallelSlotsWaitCompletion(sa))
+			failed = true;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, true);
+	}
+
+	if (sa)
+	{
+		ParallelSlotsTerminate(sa);
+		FREE_AND_SET_NULL(sa);
+	}
+
+	if (failed)
+		exit(1);
+
+	if (!all_checks_pass)
+		exit(2);
+}
+
+/*
+ * prepare_heap_command
+ *
+ * Creates a SQL command for running amcheck checking on the given heap
+ * relation.  The command is phrased as a SQL query, with column order and
+ * names matching the expectations of verify_heap_slot_handler, which will
+ * receive and handle each row returned from the verify_heapam() function.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the heap table to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_heap_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+	appendPQExpBuffer(sql,
+					  "SELECT blkno, offnum, attnum, msg FROM %s.verify_heapam("
+					  "\nrelation := %u, on_error_stop := %s, check_toast := %s, skip := '%s'",
+					  rel->datinfo->amcheck_schema,
+					  rel->reloid,
+					  opts.on_error_stop ? "true" : "false",
+					  opts.reconcile_toast ? "true" : "false",
+					  opts.skip);
+
+	if (opts.startblock >= 0)
+		appendPQExpBuffer(sql, ", startblock := " INT64_FORMAT, opts.startblock);
+	if (opts.endblock >= 0)
+		appendPQExpBuffer(sql, ", endblock := " INT64_FORMAT, opts.endblock);
+
+	appendPQExpBuffer(sql, ")");
+}
+
+/*
+ * prepare_btree_command
+ *
+ * Creates a SQL command for running amcheck checking on the given btree index
+ * relation.  The command does not select any columns, as btree checking
+ * functions do not return any, but rather return corruption information by
+ * raising errors, which verify_btree_slot_handler expects.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the index to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_btree_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+
+	/*
+	 * Embed the database, schema, and relation name in the query, so if the
+	 * check throws an error, the user knows which relation the error came
+	 * from.
+	 */
+	if (opts.parent_check)
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_parent_check("
+						  "index := '%u'::regclass, heapallindexed := %s, "
+						  "rootdescend := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"),
+						  (opts.rootdescend ? "true" : "false"));
+	else
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_check("
+						  "index := '%u'::regclass, heapallindexed := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"));
+}
+
+/*
+ * run_command
+ *
+ * Sends a command to the server without waiting for the command to complete.
+ * Logs an error if the command cannot be sent, but otherwise any errors are
+ * expected to be handled by a ParallelSlotHandler.
+ *
+ * If reconnecting to the database is necessary, the cparams argument may be
+ * modified.
+ *
+ * slot: slot with connection to the server we should use for the command
+ * sql: query to send
+ */
+static void
+run_command(ParallelSlot *slot, const char *sql)
+{
+	if (opts.echo)
+		printf("%s\n", sql);
+
+	if (PQsendQuery(slot->connection, sql) == 0)
+	{
+		pg_log_error("error sending command to database \"%s\": %s",
+					 PQdb(slot->connection),
+					 PQerrorMessage(slot->connection));
+		pg_log_error("command was: %s", sql);
+		exit(1);
+	}
+}
+
+/*
+ * should_processing_continue
+ *
+ * Checks a query result returned from a query (presumably issued on a slot's
+ * connection) to determine if parallel slots should continue issuing further
+ * commands.
+ *
+ * Note: Heap relation corruption is reported by verify_heapam() via the result
+ * set, rather than an ERROR, but running verify_heapam() on a corrupted heap
+ * table may still result in an error being returned from the server due to
+ * missing relation files, bad checksums, etc.  The btree corruption checking
+ * functions always use errors to communicate corruption messages.  We can't
+ * just abort processing because we got a mere ERROR.
+ *
+ * res: result from an executed sql query
+ */
+static bool
+should_processing_continue(PGresult *res)
+{
+	const char *severity;
+
+	switch (PQresultStatus(res))
+	{
+			/* These are expected and ok */
+		case PGRES_COMMAND_OK:
+		case PGRES_TUPLES_OK:
+		case PGRES_NONFATAL_ERROR:
+			break;
+
+			/* This is expected but requires closer scrutiny */
+		case PGRES_FATAL_ERROR:
+			severity = PQresultErrorField(res, PG_DIAG_SEVERITY_NONLOCALIZED);
+			if (strcmp(severity, "FATAL") == 0)
+				return false;
+			if (strcmp(severity, "PANIC") == 0)
+				return false;
+			break;
+
+			/* These are unexpected */
+		case PGRES_BAD_RESPONSE:
+		case PGRES_EMPTY_QUERY:
+		case PGRES_COPY_OUT:
+		case PGRES_COPY_IN:
+		case PGRES_COPY_BOTH:
+		case PGRES_SINGLE_TUPLE:
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Returns a copy of the argument string with all lines indented four spaces.
+ *
+ * The caller should pg_free the result when finished with it.
+ */
+static char *
+indent_lines(const char *str)
+{
+	PQExpBufferData buf;
+	const char *c;
+	char	   *result;
+
+	initPQExpBuffer(&buf);
+	appendPQExpBufferStr(&buf, "    ");
+	for (c = str; *c; c++)
+	{
+		appendPQExpBufferChar(&buf, *c);
+		if (c[0] == '\n' && c[1] != '\0')
+			appendPQExpBufferStr(&buf, "    ");
+	}
+	result = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+
+	return result;
+}
+
+/*
+ * verify_heap_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a heap table checking command
+ * created by prepare_heap_command and outputs the results for the user.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: the sql query being handled, as a cstring
+ */
+static bool
+verify_heap_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			i;
+		int			ntups = PQntuples(res);
+
+		if (ntups > 0)
+			all_checks_pass = false;
+
+		for (i = 0; i < ntups; i++)
+		{
+			const char *msg;
+
+			/* The message string should never be null, but check */
+			if (PQgetisnull(res, i, 3))
+				msg = "NO MESSAGE";
+			else
+				msg = PQgetvalue(res, i, 3);
+
+			if (!PQgetisnull(res, i, 2))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s, attribute %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   PQgetvalue(res, i, 2),	/* attnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 1))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 0))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   msg);
+
+			else
+				printf("heap table \"%s\".\"%s\".\"%s\":\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		}
+	}
+	else if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("heap table \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		FREE_AND_SET_NULL(msg);
+	}
+
+	FREE_AND_SET_NULL(rel->sql);
+	FREE_AND_SET_NULL(rel->nspname);
+	FREE_AND_SET_NULL(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * verify_btree_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a btree checking command
+ * created by prepare_btree_command and outputs them for the user.  The results
+ * from the btree checking command is assumed to be empty, but when the results
+ * are an error code, the useful information about the corruption is expected
+ * in the connection's error message.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: unused
+ */
+static bool
+verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			ntups = PQntuples(res);
+
+		if (ntups != 1)
+		{
+			/*
+			 * We expect the btree checking functions to return one void row
+			 * each, so we should output some sort of warning if we get
+			 * anything else, not because it indicates corruption, but because
+			 * it suggests a mismatch between amcheck and pg_amcheck versions.
+			 *
+			 * In conjunction with --progress, anything written to stderr at
+			 * this time would present strangely to the user without an extra
+			 * newline, so we print one.  If we were multithreaded, we'd have
+			 * to avoid splitting this across multiple calls, but we're in an
+			 * event loop, so it doesn't matter.
+			 */
+			if (opts.show_progress && progress_since_last_stderr)
+				fprintf(stderr, "\n");
+			pg_log_warning("btree index \"%s\".\"%s\".\"%s\": btree checking function returned unexpected number of rows: %d",
+						   rel->datinfo->datname, rel->nspname, rel->relname, ntups);
+			if (opts.verbose)
+				pg_log_info("query was: %s", rel->sql);
+			pg_log_warning("are %s's and amcheck's versions compatible?",
+						   progname);
+			progress_since_last_stderr = false;
+		}
+	}
+	else
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("btree index \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		FREE_AND_SET_NULL(msg);
+	}
+
+	FREE_AND_SET_NULL(rel->sql);
+	FREE_AND_SET_NULL(rel->nspname);
+	FREE_AND_SET_NULL(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_amcheck"
+ */
+static void
+help(const char *progname)
+{
+	printf("%s uses amcheck module to check objects in a PostgreSQL database for corruption.\n\n", progname);
+	printf("Usage:\n");
+	printf("  %s [OPTION]... [DBNAME]\n", progname);
+	printf("\nTarget Options:\n");
+	printf("  -a, --all                      check all databases\n");
+	printf("  -d, --database=PATTERN         check matching database(s)\n");
+	printf("  -D, --exclude-database=PATTERN do NOT check matching database(s)\n");
+	printf("  -i, --index=PATTERN            check matching index(es)\n");
+	printf("  -I, --exclude-index=PATTERN    do NOT check matching index(es)\n");
+	printf("  -r, --relation=PATTERN         check matching relation(s)\n");
+	printf("  -R, --exclude-relation=PATTERN do NOT check matching relation(s)\n");
+	printf("  -s, --schema=PATTERN           check matching schema(s)\n");
+	printf("  -S, --exclude-schema=PATTERN   do NOT check matching schema(s)\n");
+	printf("  -t, --table=PATTERN            check matching table(s)\n");
+	printf("  -T, --exclude-table=PATTERN    do NOT check matching table(s)\n");
+	printf("      --no-dependent-indexes     do NOT expand list of relations to include indexes\n");
+	printf("      --no-dependent-toast       do NOT expand list of relations to include toast\n");
+	printf("      --no-strict-names          do NOT require patterns to match objects\n");
+	printf("\nTable Checking Options:\n");
+	printf("      --exclude-toast-pointers   do NOT follow relation toast pointers\n");
+	printf("      --on-error-stop            stop checking at end of first corrupt page\n");
+	printf("      --skip=OPTION              do NOT check \"all-frozen\" or \"all-visible\" blocks\n");
+	printf("      --startblock=BLOCK         begin checking table(s) at the given block number\n");
+	printf("      --endblock=BLOCK           check table(s) only up to the given block number\n");
+	printf("\nBtree Index Checking Options:\n");
+	printf("      --heapallindexed           check all heap tuples are found within indexes\n");
+	printf("      --parent-check             check index parent/child relationships\n");
+	printf("      --rootdescend              search from root page to refind tuples\n");
+	printf("\nConnection options:\n");
+	printf("  -h, --host=HOSTNAME            database server host or socket directory\n");
+	printf("  -p, --port=PORT                database server port\n");
+	printf("  -U, --username=USERNAME        user name to connect as\n");
+	printf("  -w, --no-password              never prompt for password\n");
+	printf("  -W, --password                 force password prompt\n");
+	printf("      --maintenance-db=DBNAME    alternate maintenance database\n");
+	printf("\nOther Options:\n");
+	printf("  -e, --echo                     show the commands being sent to the server\n");
+	printf("  -j, --jobs=NUM                 use this many concurrent connections to the server\n");
+	printf("  -q, --quiet                    don't write any messages\n");
+	printf("  -v, --verbose                  write a lot of output\n");
+	printf("  -V, --version                  output version information, then exit\n");
+	printf("  -P, --progress                 show progress information\n");
+	printf("  -?, --help                     show this help, then exit\n");
+
+	printf("\nReport bugs to <%s>.\n", PACKAGE_BUGREPORT);
+	printf("%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Print a progress report based on the global variables.
+ *
+ * Progress report is written at maximum once per second, unless the force
+ * parameter is set to true.
+ *
+ * If finished is set to true, this is the last progress report. The cursor
+ * is moved to the next line.
+ */
+static void
+progress_report(uint64 relations_total, uint64 relations_checked,
+				uint64 relpages_total, uint64 relpages_checked,
+				const char *datname, bool force, bool finished)
+{
+	int			percent_rel = 0;
+	int			percent_pages = 0;
+	char		checked_rel[32];
+	char		total_rel[32];
+	char		checked_pages[32];
+	char		total_pages[32];
+	pg_time_t	now;
+
+	if (!opts.show_progress)
+		return;
+
+	now = time(NULL);
+	if (now == last_progress_report && !force && !finished)
+		return;					/* Max once per second */
+
+	last_progress_report = now;
+	if (relations_total)
+		percent_rel = (int) (relations_checked * 100 / relations_total);
+	if (relpages_total)
+		percent_pages = (int) (relpages_checked * 100 / relpages_total);
+
+	/*
+	 * Separate step to keep platform-dependent format code out of fprintf
+	 * calls.  We only test for INT64_FORMAT availability in snprintf, not
+	 * fprintf.
+	 */
+	snprintf(checked_rel, sizeof(checked_rel), INT64_FORMAT, relations_checked);
+	snprintf(total_rel, sizeof(total_rel), INT64_FORMAT, relations_total);
+	snprintf(checked_pages, sizeof(checked_pages), INT64_FORMAT, relpages_checked);
+	snprintf(total_pages, sizeof(total_pages), INT64_FORMAT, relpages_total);
+
+#define VERBOSE_DATNAME_LENGTH 35
+	if (opts.verbose)
+	{
+		if (!datname)
+
+			/*
+			 * No datname given, so clear the status line (used for first and
+			 * last call)
+			 */
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%) %*s",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+					VERBOSE_DATNAME_LENGTH + 2, "");
+		else
+		{
+			bool		truncate = (strlen(datname) > VERBOSE_DATNAME_LENGTH);
+
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%), (%s%-*.*s)",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+			/* Prefix with "..." if we do leading truncation */
+					truncate ? "..." : "",
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+			/* Truncate datname at beginning if it's too long */
+					truncate ? datname + strlen(datname) - VERBOSE_DATNAME_LENGTH + 3 : datname);
+		}
+	}
+	else
+		fprintf(stderr,
+				"%*s/%s relations (%d%%) %*s/%s pages (%d%%)",
+				(int) strlen(total_rel),
+				checked_rel, total_rel, percent_rel,
+				(int) strlen(total_pages),
+				checked_pages, total_pages, percent_pages);
+
+	/*
+	 * Stay on the same line if reporting to a terminal and we're not done
+	 * yet.
+	 */
+	if (!finished && isatty(fileno(stderr)))
+	{
+		fputc('\r', stderr);
+		progress_since_last_stderr = true;
+	}
+	else
+		fputc('\n', stderr);
+}
+
+/*
+ * Extend the pattern info array to hold one additional initialized pattern
+ * info entry.
+ *
+ * Returns a pointer to the new entry.
+ */
+static PatternInfo *
+extend_pattern_info_array(PatternInfoArray *pia)
+{
+	PatternInfo *result;
+
+	pia->len++;
+	pia->data = (PatternInfo *) pg_realloc(pia->data, pia->len * sizeof(PatternInfo));
+	result = &pia->data[pia->len - 1];
+	memset(result, 0, sizeof(*result));
+
+	return result;
+}
+
+/*
+ * append_database_pattern
+ *
+ * Adds the given pattern interpreted as a database name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the database name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_database_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->db_regex = pstrdup(buf.data);
+
+	termPQExpBuffer(&buf);
+}
+
+/*
+ * append_schema_pattern
+ *
+ * Adds the given pattern interpreted as a schema name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the schema name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_schema_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+
+	patternToSQLRegex(encoding, NULL, &dbbuf, &nspbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+	{
+		opts.dbpattern = true;
+		info->db_regex = pstrdup(dbbuf.data);
+	}
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+}
+
+/*
+ * append_relation_pattern_helper
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ * heap_only: whether the pattern should only be matched against heap tables
+ * btree_only: whether the pattern should only be matched against btree indexes
+ */
+static void
+append_relation_pattern_helper(PatternInfoArray *pia, const char *pattern,
+							   int encoding, bool heap_only, bool btree_only)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PQExpBufferData relbuf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+	initPQExpBuffer(&relbuf);
+
+	patternToSQLRegex(encoding, &dbbuf, &nspbuf, &relbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+	{
+		opts.dbpattern = true;
+		info->db_regex = pstrdup(dbbuf.data);
+	}
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+	if (relbuf.data[0])
+		info->rel_regex = pstrdup(relbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+	termPQExpBuffer(&relbuf);
+
+	info->heap_only = heap_only;
+	info->btree_only = btree_only;
+}
+
+/*
+ * append_relation_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched
+ * against both heap tables and btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_relation_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, false);
+}
+
+/*
+ * append_heap_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against heap tables.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_heap_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, true, false);
+}
+
+/*
+ * append_btree_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_btree_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, true);
+}
+
+/*
+ * append_db_pattern_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the database portions filtered from the list of patterns expressed as two
+ * columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     rgx: the database regular expression parsed from the pattern
+ *
+ * Patterns without a database portion are skipped.  Patterns with more than
+ * just a database portion are optionally skipped, depending on argument
+ * 'inclusive'.
+ *
+ * buf: the buffer to be appended
+ * pia: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ * inclusive: whether to include patterns with schema and/or relation parts
+ *
+ * Returns whether any database patterns were appended.
+ */
+static bool
+append_db_pattern_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+					  PGconn *conn, bool inclusive)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (info->db_regex != NULL &&
+			(inclusive || (info->nsp_regex == NULL && info->rel_regex == NULL)))
+		{
+			if (!have_values)
+				appendPQExpBufferStr(buf, "\nVALUES");
+			have_values = true;
+			appendPQExpBuffer(buf, "%s\n(%d, ", comma, pattern_id);
+			appendStringLiteralConn(buf, info->db_regex, conn);
+			appendPQExpBufferStr(buf, ")");
+			comma = ",";
+		}
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf, "\nSELECT NULL, NULL, NULL WHERE false");
+
+	return have_values;
+}
+
+/*
+ * compile_database_list
+ *
+ * If any database patterns exist, or if --all was given, compiles a distinct
+ * list of databases to check using a SQL query based on the patterns plus the
+ * literal initial database name, if given.  If no database patterns exist and
+ * --all was not given, the query is not necessary, and only the initial
+ * database name (if any) is added to the list.
+ *
+ * conn: connection to the initial database
+ * databases: the list onto which databases should be appended
+ * initial_dbname: an optional extra database name to include in the list
+ */
+static void
+compile_database_list(PGconn *conn, SimplePtrList *databases,
+					  const char *initial_dbname)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	bool		fatal;
+
+	if (initial_dbname)
+	{
+		DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+		/* This database is included.  Add to list */
+		if (opts.verbose)
+			pg_log_info("including database: \"%s\"", initial_dbname);
+
+		dat->datname = pstrdup(initial_dbname);
+		simple_ptr_list_append(databases, dat);
+	}
+
+	initPQExpBuffer(&sql);
+
+	/* Append the include patterns CTE. */
+	appendPQExpBufferStr(&sql, "WITH include_raw (pattern_id, rgx) AS (");
+	if (!append_db_pattern_cte(&sql, &opts.include, conn, true) &&
+		!opts.alldb)
+	{
+		/*
+		 * None of the inclusion patterns (if any) contain database portions,
+		 * so there is no need to query the database to resolve database
+		 * patterns.
+		 *
+		 * Since we're also not operating under --all, we don't need to query
+		 * the exhaustive list of connectable databases, either.
+		 */
+		termPQExpBuffer(&sql);
+		return;
+	}
+
+	/* Append the exclude patterns CTE. */
+	appendPQExpBufferStr(&sql, "),\nexclude_raw (pattern_id, rgx) AS (");
+	append_db_pattern_cte(&sql, &opts.exclude, conn, false);
+	appendPQExpBufferStr(&sql, "),");
+
+	/*
+	 * Append the database CTE, which includes whether each database is
+	 * connectable and also joins against exclude_raw to determine whether
+	 * each database is excluded.
+	 */
+	appendPQExpBufferStr(&sql,
+						 "\ndatabase (datname) AS ("
+						 "\nSELECT d.datname "
+						 "FROM pg_catalog.pg_database d "
+						 "LEFT OUTER JOIN exclude_raw e "
+						 "ON d.datname ~ e.rgx "
+						 "\nWHERE d.datallowconn "
+						 "AND e.pattern_id IS NULL"
+						 "),"
+
+	/*
+	 * Append the include_pat CTE, which joins the include_raw CTE against the
+	 * databases CTE to determine if all the inclusion patterns had matches,
+	 * and whether each matched pattern had the misfortune of only matching
+	 * excluded or unconnectable databases.
+	 */
+						 "\ninclude_pat (pattern_id, checkable) AS ("
+						 "\nSELECT i.pattern_id, "
+						 "COUNT(*) FILTER ("
+						 "WHERE d IS NOT NULL"
+						 ") AS checkable"
+						 "\nFROM include_raw i "
+						 "LEFT OUTER JOIN database d "
+						 "ON d.datname ~ i.rgx"
+						 "\nGROUP BY i.pattern_id"
+						 "),"
+
+	/*
+	 * Append the filtered_databases CTE, which selects from the database CTE
+	 * optionally joined against the include_raw CTE to only select databases
+	 * that match an inclusion pattern.  This appears to duplicate what the
+	 * include_pat CTE already did above, but here we want only databases, and
+	 * there we wanted patterns.
+	 */
+						 "\nfiltered_databases (datname) AS ("
+						 "\nSELECT DISTINCT d.datname "
+						 "FROM database d");
+	if (!opts.alldb)
+		appendPQExpBufferStr(&sql,
+							 " INNER JOIN include_raw i "
+							 "ON d.datname ~ i.rgx");
+	appendPQExpBufferStr(&sql,
+						 ")"
+
+	/*
+	 * Select the checkable databases and the unmatched inclusion patterns.
+	 */
+						 "\nSELECT pattern_id, datname FROM ("
+						 "\nSELECT pattern_id, NULL::TEXT AS datname "
+						 "FROM include_pat "
+						 "WHERE checkable = 0 "
+						 "UNION ALL"
+						 "\nSELECT NULL, datname "
+						 "FROM filtered_databases"
+						 ") AS combined_records"
+						 "\nORDER BY pattern_id NULLS LAST, datname");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (fatal = false, i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		const char *datname = NULL;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			datname = PQgetvalue(res, i, 1);
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern that matched no
+			 * checkable databases.
+			 */
+			fatal = opts.strict_names;
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected database pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+			log_no_match("no connectable databases to check matching \"%s\"",
+						 opts.include.data[pattern_id].pattern);
+		}
+		else
+		{
+			/* Current record pertains to a database */
+			Assert(datname != NULL);
+
+			/* Avoid entering a duplicate entry matching the initial_dbname */
+			if (initial_dbname != NULL && strcmp(initial_dbname, datname) == 0)
+				continue;
+
+			DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+			/* This database is included.  Add to list */
+			if (opts.verbose)
+				pg_log_info("including database: \"%s\"", datname);
+
+			dat->datname = pstrdup(datname);
+			simple_ptr_list_append(databases, dat);
+		}
+	}
+	PQclear(res);
+
+	if (fatal)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		exit(1);
+	}
+}
+
+/*
+ * append_rel_pattern_raw_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the given patterns as six columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     db_regex: the database regexp parsed from the pattern, or NULL if the
+ *               pattern had no database part
+ *     nsp_regex: the namespace regexp parsed from the pattern, or NULL if the
+ *                pattern had no namespace part
+ *     rel_regex: the relname regexp parsed from the pattern, or NULL if the
+ *                pattern had no relname part
+ *     heap_only: true if the pattern applies only to heap tables (not indexes)
+ *     btree_only: true if the pattern applies only to btree indexes (not tables)
+ *
+ * buf: the buffer to be appended
+ * patterns: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_raw_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+						   PGconn *conn)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (!have_values)
+			appendPQExpBufferStr(buf, "\nVALUES");
+		have_values = true;
+		appendPQExpBuffer(buf, "%s\n(%d::INTEGER, ", comma, pattern_id);
+		if (info->db_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->db_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->nsp_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->nsp_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->rel_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->rel_regex, conn);
+		if (info->heap_only)
+			appendPQExpBufferStr(buf, "::TEXT, true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, "::TEXT, false::BOOLEAN");
+		if (info->btree_only)
+			appendPQExpBufferStr(buf, ", true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, ", false::BOOLEAN");
+		appendPQExpBufferStr(buf, ")");
+		comma = ",";
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf,
+							 "\nSELECT NULL::INTEGER, NULL::TEXT, NULL::TEXT, "
+							 "NULL::TEXT, NULL::BOOLEAN, NULL::BOOLEAN "
+							 "WHERE false");
+}
+
+/*
+ * append_rel_pattern_filtered_cte
+ *
+ * Appends to the buffer a Common Table Expression (CTE) which selects
+ * all patterns from the named raw CTE, filtered by database.  All patterns
+ * which have no database portion or whose database portion matches our
+ * connection's database name are selected, with other patterns excluded.
+ *
+ * The basic idea here is that if we're connected to database "foo" and we have
+ * patterns "foo.bar.baz", "alpha.beta" and "one.two.three", we only want to
+ * use the first two while processing relations in this database, as the third
+ * one is not relevant.
+ *
+ * buf: the buffer to be appended
+ * raw: the name of the CTE to select from
+ * filtered: the name of the CTE to create
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_filtered_cte(PQExpBuffer buf, const char *raw,
+								const char *filtered, PGconn *conn)
+{
+	appendPQExpBuffer(buf,
+					  "\n%s (pattern_id, nsp_regex, rel_regex, heap_only, btree_only) AS ("
+					  "\nSELECT pattern_id, nsp_regex, rel_regex, heap_only, btree_only "
+					  "FROM %s r"
+					  "\nWHERE (r.db_regex IS NULL "
+					  "OR ",
+					  filtered, raw);
+	appendStringLiteralConn(buf, PQdb(conn), conn);
+	appendPQExpBufferStr(buf, " ~ r.db_regex)");
+	appendPQExpBufferStr(buf,
+						 " AND (r.nsp_regex IS NOT NULL"
+						 " OR r.rel_regex IS NOT NULL)"
+						 "),");
+}
+
+/*
+ * compile_relation_list_one_db
+ *
+ * Compiles a list of relations to check within the currently connected
+ * database based on the user supplied options, sorted by descending size,
+ * and appends them to the given list of relations.
+ *
+ * The cells of the constructed list contain all information about the relation
+ * necessary to connect to the database and check the object, including which
+ * database to connect to, where contrib/amcheck is installed, and the Oid and
+ * type of object (heap table vs. btree index).  Rather than duplicating the
+ * database details per relation, the relation structs use references to the
+ * same database object, provided by the caller.
+ *
+ * conn: connection to this next database, which should be the same as in 'dat'
+ * relations: list onto which the relations information should be appended
+ * dat: the database info struct for use by each relation
+ * pagecount: gets incremented by the number of blocks to check in all
+ * relations added
+ */
+static void
+compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+							 const DatabaseInfo *dat,
+							 uint64 *pagecount)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+
+	initPQExpBuffer(&sql);
+	appendPQExpBufferStr(&sql, "WITH");
+
+	/* Append CTEs for the relation inclusion patterns, if any */
+	if (!opts.allrel)
+	{
+		appendPQExpBufferStr(&sql,
+							 " include_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.include, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "include_raw", "include_pat", conn);
+	}
+
+	/* Append CTEs for the relation exclusion patterns, if any */
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+	{
+		appendPQExpBufferStr(&sql,
+							 " exclude_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.exclude, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "exclude_raw", "exclude_pat", conn);
+	}
+
+	/* Append the relation CTE. */
+	appendPQExpBufferStr(&sql,
+						 " relation (pattern_id, oid, nspname, relname, reltoastrelid, relpages, is_heap, is_btree) AS ("
+						 "\nSELECT DISTINCT ON (c.oid");
+	if (!opts.allrel)
+		appendPQExpBufferStr(&sql, ", ip.pattern_id) ip.pattern_id,");
+	else
+		appendPQExpBufferStr(&sql, ") NULL::INTEGER AS pattern_id,");
+	appendPQExpBuffer(&sql,
+					  "\nc.oid, n.nspname, c.relname, c.reltoastrelid, c.relpages, "
+					  "c.relam = %u AS is_heap, "
+					  "c.relam = %u AS is_btree"
+					  "\nFROM pg_catalog.pg_class c "
+					  "INNER JOIN pg_catalog.pg_namespace n "
+					  "ON c.relnamespace = n.oid",
+					  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (!opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nINNER JOIN include_pat ip"
+						  "\nON (n.nspname ~ ip.nsp_regex OR ip.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ip.rel_regex OR ip.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ip.heap_only)"
+						  "\nAND (c.relam = %u OR NOT ip.btree_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBuffer(&sql,
+						  "\nLEFT OUTER JOIN exclude_pat ep"
+						  "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.heap_only OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.btree_only OR ep.rel_regex IS NULL)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBufferStr(&sql, "\nWHERE ep.pattern_id IS NULL");
+	else
+		appendPQExpBufferStr(&sql, "\nWHERE true");
+
+	/*
+	 * We need to be careful not to break the --no-dependent-toast and
+	 * --no-dependent-indexes options.  By default, the btree indexes, toast
+	 * tables, and toast table btree indexes associated with primary heap
+	 * tables are included, using their own CTEs below.  We implement the
+	 * --exclude-* options by not creating those CTEs, but that's no use if
+	 * we've already selected the toast and indexes here.  On the other hand,
+	 * we want inclusion patterns that match indexes or toast tables to be
+	 * honored.  So, if inclusion patterns were given, we want to select all
+	 * tables, toast tables, or indexes that match the patterns.  But if no
+	 * inclusion patterns were given, and we're simply matching all relations,
+	 * then we only want to match the primary tables here.
+	 */
+	if (opts.allrel)
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind IN ('r', 'm', 't') "
+						  "AND c.relnamespace != %u",
+						  HEAP_TABLE_AM_OID, PG_TOAST_NAMESPACE);
+	else
+		appendPQExpBuffer(&sql,
+						  " AND c.relam IN (%u, %u)"
+						  "AND c.relkind IN ('r', 'm', 't', 'i') "
+						  "AND ((c.relam = %u AND c.relkind IN ('r', 'm', 't')) OR "
+						  "(c.relam = %u AND c.relkind = 'i'))",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID,
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	appendPQExpBufferStr(&sql,
+						 "\nORDER BY c.oid)");
+
+	if (!opts.no_toast_expansion)
+	{
+		/*
+		 * Include a CTE for toast tables associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * toast table names.
+		 */
+		appendPQExpBufferStr(&sql,
+							 ", toast (oid, nspname, relname, relpages) AS ("
+							 "\nSELECT t.oid, 'pg_toast', t.relname, t.relpages"
+							 "\nFROM pg_catalog.pg_class t "
+							 "INNER JOIN relation r "
+							 "ON r.reltoastrelid = t.oid");
+		if (opts.excludetbl || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (t.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.heap_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		appendPQExpBufferStr(&sql,
+							 "\n)");
+	}
+	if (!opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * btree index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, r.nspname, c.relname, c.relpages "
+						  "FROM relation r"
+						  "\nINNER JOIN pg_catalog.pg_index i "
+						  "ON r.oid = i.indrelid "
+						  "INNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nINNER JOIN pg_catalog.pg_namespace n "
+								 "ON c.relnamespace = n.oid"
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind = 'i'",
+						  BTREE_AM_OID);
+		if (opts.no_toast_expansion)
+			appendPQExpBuffer(&sql,
+							  " AND c.relnamespace != %u",
+							  PG_TOAST_NAMESPACE);
+		appendPQExpBufferStr(&sql, "\n)");
+	}
+
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with toast tables of
+		 * primary heap tables selected above, filtering by exclusion patterns
+		 * (if any) that match the toast index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", toast_index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, 'pg_toast', c.relname, c.relpages "
+						  "FROM toast t "
+						  "INNER JOIN pg_catalog.pg_index i "
+						  "ON t.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only "
+								 "WHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u"
+						  " AND c.relkind = 'i')",
+						  BTREE_AM_OID);
+	}
+
+	/*
+	 * Roll-up distinct rows from CTEs.
+	 *
+	 * Relations that match more than one pattern may occur more than once in
+	 * the list, and indexes and toast for primary relations may also have
+	 * matched in their own right, so we rely on UNION to deduplicate the
+	 * list.
+	 */
+	appendPQExpBuffer(&sql,
+					  "\nSELECT pattern_id, is_heap, is_btree, oid, nspname, relname, relpages "
+					  "FROM (");
+	appendPQExpBufferStr(&sql,
+	/* Inclusion patterns that failed to match */
+						 "\nSELECT pattern_id, is_heap, is_btree, "
+						 "NULL::OID AS oid, "
+						 "NULL::TEXT AS nspname, "
+						 "NULL::TEXT AS relname, "
+						 "NULL::INTEGER AS relpages"
+						 "\nFROM relation "
+						 "WHERE pattern_id IS NOT NULL "
+						 "UNION"
+	/* Primary relations */
+						 "\nSELECT NULL::INTEGER AS pattern_id, "
+						 "is_heap, is_btree, oid, nspname, relname, relpages "
+						 "FROM relation");
+	if (!opts.no_toast_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Toast tables for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, TRUE AS is_heap, "
+							 "FALSE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast");
+	if (!opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM index");
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for toast relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast_index");
+	appendPQExpBufferStr(&sql,
+						 "\n) AS combined_records "
+						 "ORDER BY relpages DESC NULLS FIRST, oid");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		bool		is_heap = false;
+		bool		is_btree = false;
+		Oid			oid = InvalidOid;
+		const char *nspname = NULL;
+		const char *relname = NULL;
+		int			relpages = 0;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			is_heap = (PQgetvalue(res, i, 1)[0] == 't');
+		if (!PQgetisnull(res, i, 2))
+			is_btree = (PQgetvalue(res, i, 2)[0] == 't');
+		if (!PQgetisnull(res, i, 3))
+			oid = atooid(PQgetvalue(res, i, 3));
+		if (!PQgetisnull(res, i, 4))
+			nspname = PQgetvalue(res, i, 4);
+		if (!PQgetisnull(res, i, 5))
+			relname = PQgetvalue(res, i, 5);
+		if (!PQgetisnull(res, i, 6))
+			relpages = atoi(PQgetvalue(res, i, 6));
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern.  Record that
+			 * it matched.
+			 */
+
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected relation pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+
+			opts.include.data[pattern_id].matched = true;
+		}
+		else
+		{
+			/* Current record pertains to a relation */
+
+			RelationInfo *rel = (RelationInfo *) pg_malloc0(sizeof(RelationInfo));
+
+			Assert(OidIsValid(oid));
+			Assert((is_heap && !is_btree) || (is_btree && !is_heap));
+
+			rel->datinfo = dat;
+			rel->reloid = oid;
+			rel->is_heap = is_heap;
+			rel->nspname = pstrdup(nspname);
+			rel->relname = pstrdup(relname);
+			rel->relpages = relpages;
+			rel->blocks_to_check = relpages;
+			if (is_heap && (opts.startblock >= 0 || opts.endblock >= 0))
+			{
+				/*
+				 * We apply --startblock and --endblock to heap tables, but
+				 * not btree indexes, and for progress purposes we need to
+				 * track how many blocks we expect to check.
+				 */
+				if (opts.endblock >= 0 && rel->blocks_to_check > opts.endblock)
+					rel->blocks_to_check = opts.endblock + 1;
+				if (opts.startblock >= 0)
+				{
+					if (rel->blocks_to_check > opts.startblock)
+						rel->blocks_to_check -= opts.startblock;
+					else
+						rel->blocks_to_check = 0;
+				}
+			}
+			*pagecount += rel->blocks_to_check;
+
+			simple_ptr_list_append(relations, rel);
+		}
+	}
+	PQclear(res);
+}
diff --git a/contrib/pg_amcheck/t/001_basic.pl b/contrib/pg_amcheck/t/001_basic.pl
new file mode 100644
index 0000000000..dfa0ae9e06
--- /dev/null
+++ b/contrib/pg_amcheck/t/001_basic.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_amcheck');
+program_version_ok('pg_amcheck');
+program_options_handling_ok('pg_amcheck');
diff --git a/contrib/pg_amcheck/t/002_nonesuch.pl b/contrib/pg_amcheck/t/002_nonesuch.pl
new file mode 100644
index 0000000000..b7d41c9b49
--- /dev/null
+++ b/contrib/pg_amcheck/t/002_nonesuch.pl
@@ -0,0 +1,248 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 76;
+
+# Test set-up
+my ($node, $port);
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+#########################################
+# Test non-existent databases
+
+# Failing to connect to the initial database is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/FATAL:  database "qqq" does not exist/ ],
+	'checking a non-existent database');
+
+# Failing to resolve a database pattern is an error by default.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'qqq', '-d', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern');
+
+# But only a warning under --no-strict-names
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-d', 'qqq', '-d', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern under --no-strict-names');
+
+# Check that a substring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'post', '-d', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "post"/ ],
+	'checking an unresolvable database pattern (substring of existent database)');
+
+# Check that a superstring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgresql', '-d', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "postgresql"/ ],
+	'checking an unresolvable database pattern (superstring of existent database)');
+
+#########################################
+# Test connecting with a non-existent user
+
+# Failing to connect to the initial database due to bad username is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user');
+
+# Failing to connect to the initial database due to bad username is an still an
+# error under --no-strict-names.
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user under --no-strict-names');
+
+#########################################
+# Test checking databases without amcheck installed
+
+# Attempting to check a database by name where amcheck is not installed should
+# raise a warning.  If all databases are skipped, having no relations to check
+# raises an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'checking a database by name without amcheck installed, no other databases');
+
+# Again, but this time with another database to check, so no error is raised.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'template1', '-d', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by name without amcheck installed, with other databases');
+
+# Again, but by way of checking all databases
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by pattern without amcheck installed, with other databases');
+
+#########################################
+# Test unreasonable patterns
+
+# Check three-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '-t', '..' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.\."/ ],
+	'checking table pattern ".."');
+
+# Again, but with non-trivial schema and relation parts
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '-t', '.foo.bar' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.foo\.bar"/ ],
+	'checking table pattern ".foo.bar"');
+
+# Check two-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '-t', '.' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no heap tables to check matching "\."/ ],
+	'checking table pattern "."');
+
+#########################################
+# Test checking non-existent databases, schemas, tables, and indexes
+
+# Use --no-strict-names and a single existent table so we only get warnings
+# about the failed pattern matches
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names',
+		'-t', 'no_such_table',
+		'-t', 'no*such*table',
+		'-i', 'no_such_index',
+		'-i', 'no*such*index',
+		'-r', 'no_such_relation',
+		'-r', 'no*such*relation',
+		'-d', 'no_such_database',
+		'-d', 'no*such*database',
+		'-r', 'none.none',
+		'-r', 'none.none.none',
+		'-r', 'this.is.a.really.long.dotted.string',
+		'-r', 'postgres.none.none',
+		'-r', 'postgres.long.dotted.string',
+		'-r', 'postgres.pg_catalog.none',
+		'-r', 'postgres.none.pg_class',
+		'-t', 'postgres.pg_catalog.pg_class',	# This exists
+	],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no heap tables to check matching "no_such_table"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no_such_index"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no\*such\*index"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no_such_relation"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no\*such\*relation"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no\*such\*database"/,
+	  qr/pg_amcheck: warning: no relations to check matching "none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "none\.none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "this\.is\.a\.really\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.pg_catalog\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.pg_class"/,
+	],
+	'many unmatched patterns and one matched pattern under --no-strict-names');
+
+#########################################
+# Test checking otherwise existent objects but in databases where they do not exist
+
+$node->safe_psql('postgres', q(
+	CREATE TABLE public.foo (f integer);
+	CREATE INDEX foo_idx ON foo(f);
+));
+$node->safe_psql('postgres', q(CREATE DATABASE another_db));
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '--no-strict-names',
+		'-t', 'template1.public.foo',
+		'-t', 'another_db.public.foo',
+		'-t', 'no_such_database.public.foo',
+		'-i', 'template1.public.foo_idx',
+		'-i', 'another_db.public.foo_idx',
+		'-i', 'no_such_database.public.foo_idx',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "template1\.public\.foo"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "another_db\.public\.foo"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "template1\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "another_db\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo_idx"/,
+	  qr/pg_amcheck: error: no relations to check/,
+	],
+	'checking otherwise existent objets in the wrong databases');
+
+
+#########################################
+# Test schema exclusion patterns
+
+# Check with only schema exclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-S', 'public',
+		'-S', 'pg_catalog',
+		'-S', 'pg_toast',
+		'-S', 'information_schema',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion patterns exclude all relations');
+
+# Check with schema exclusion patterns overriding relation and schema inclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-s', 'public',
+		'-s', 'pg_catalog',
+		'-s', 'pg_toast',
+		'-s', 'information_schema',
+		'-t', 'pg_catalog.pg_class',
+		'-S', '*'
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion pattern overrides all inclusion patterns');
diff --git a/contrib/pg_amcheck/t/003_check.pl b/contrib/pg_amcheck/t/003_check.pl
new file mode 100644
index 0000000000..e43ffe7ed6
--- /dev/null
+++ b/contrib/pg_amcheck/t/003_check.pl
@@ -0,0 +1,504 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 60;
+
+my ($node, $port, %corrupt_page, %remove_relation);
+
+# Returns the filesystem path for the named relation.
+#
+# Assumes the test node is running
+sub relation_filepath($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $pgdata = $node->data_dir;
+	my $rel = $node->safe_psql($dbname,
+							   qq(SELECT pg_relation_filepath('$relname')));
+	die "path not found for relation $relname" unless defined $rel;
+	return "$pgdata/$rel";
+}
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT c.reltoastrelid::regclass
+			FROM pg_catalog.pg_class c
+			WHERE c.oid = '$relname'::regclass
+			  AND c.reltoastrelid != 0
+			));
+	return undef unless defined $rel;
+	return $rel;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of overwriting junk in the first page.
+#
+# Assumes the test node is running.
+sub plan_to_corrupt_first_page($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$corrupt_page{$relpath} = 1;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of removing the file..
+#
+# Assumes the test node is running
+sub plan_to_remove_relation_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$remove_relation{$relpath} = 1;
+}
+
+# For the given (dbname, relname), if a corresponding toast table
+# exists, adds that toast table's relation file to the list to be
+# corrupted by means of removing the file.
+#
+# Assumes the test node is running.
+sub plan_to_remove_toast_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $toastname = relation_toast($dbname, $relname);
+	plan_to_remove_relation_file($dbname, $toastname) if ($toastname);
+}
+
+# Corrupts the first page of the given file path
+sub corrupt_first_page($)
+{
+	my ($relpath) = @_;
+
+	my $fh;
+	open($fh, '+<', $relpath)
+		or BAIL_OUT("open failed: $!");
+	binmode $fh;
+
+	# Corrupt some line pointers.  The values are chosen to hit the
+	# various line-pointer-corruption checks in verify_heapam.c
+	# on both little-endian and big-endian architectures.
+	seek($fh, 32, 0)
+		or BAIL_OUT("seek failed: $!");
+	syswrite(
+		$fh,
+		pack("L*",
+			0xAAA15550, 0xAAA0D550, 0x00010000,
+			0x00008000, 0x0000800F, 0x001e8000,
+			0xFFFFFFFF)
+	) or BAIL_OUT("syswrite failed: $!");
+	close($fh)
+		or BAIL_OUT("close failed: $!");
+}
+
+# Stops the node, performs all the corruptions previously planned, and
+# starts the node again.
+#
+sub perform_all_corruptions()
+{
+	$node->stop();
+	for my $relpath (keys %corrupt_page)
+	{
+		corrupt_first_page($relpath);
+	}
+	for my $relpath (keys %remove_relation)
+	{
+		unlink($relpath);
+	}
+	$node->start;
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+for my $dbname (qw(db1 db2 db3))
+{
+	# Create the database
+	$node->safe_psql('postgres', qq(CREATE DATABASE $dbname));
+
+	# Load the amcheck extension, upon which pg_amcheck depends.  Put the
+	# extension in an unexpected location to test that pg_amcheck finds it
+	# correctly.  Create tables with names that look like pg_catalog names to
+	# check that pg_amcheck does not get confused by them.  Create functions in
+	# schema public that look like amcheck functions to check that pg_amcheck
+	# does not use them.
+	$node->safe_psql($dbname, q(
+		CREATE SCHEMA amcheck_schema;
+		CREATE EXTENSION amcheck WITH SCHEMA amcheck_schema;
+		CREATE TABLE amcheck_schema.pg_database (junk text);
+		CREATE TABLE amcheck_schema.pg_namespace (junk text);
+		CREATE TABLE amcheck_schema.pg_class (junk text);
+		CREATE TABLE amcheck_schema.pg_operator (junk text);
+		CREATE TABLE amcheck_schema.pg_proc (junk text);
+		CREATE TABLE amcheck_schema.pg_tablespace (junk text);
+
+		CREATE FUNCTION public.bt_index_check(index regclass,
+											  heapallindexed boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.bt_index_parent_check(index regclass,
+													 heapallindexed boolean default false,
+													 rootdescend boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_parent_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.verify_heapam(relation regclass,
+											 on_error_stop boolean default false,
+											 check_toast boolean default false,
+											 skip text default 'none',
+											 startblock bigint default null,
+											 endblock bigint default null,
+											 blkno OUT bigint,
+											 offnum OUT integer,
+											 attnum OUT integer,
+											 msg OUT text)
+		RETURNS SETOF record AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong verify_heapam!';
+		END;
+		$$ LANGUAGE plpgsql;
+	));
+
+	# Create schemas, tables and indexes in five separate
+	# schemas.  The schemas are all identical to start, but
+	# we will corrupt them differently later.
+	#
+	for my $schema (qw(s1 s2 s3 s4 s5))
+	{
+		$node->safe_psql($dbname, qq(
+			CREATE SCHEMA $schema;
+			CREATE SEQUENCE $schema.seq1;
+			CREATE SEQUENCE $schema.seq2;
+			CREATE TABLE $schema.t1 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE TABLE $schema.t2 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE VIEW $schema.t2_view AS (
+				SELECT i*2, t FROM $schema.t2
+			);
+			ALTER TABLE $schema.t2
+				ALTER COLUMN t
+				SET STORAGE EXTERNAL;
+
+			INSERT INTO $schema.t1 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			INSERT INTO $schema.t2 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			CREATE MATERIALIZED VIEW $schema.t1_mv AS SELECT * FROM $schema.t1;
+			CREATE MATERIALIZED VIEW $schema.t2_mv AS SELECT * FROM $schema.t2;
+
+			create table $schema.p1 (a int, b int) PARTITION BY list (a);
+			create table $schema.p2 (a int, b int) PARTITION BY list (a);
+
+			create table $schema.p1_1 partition of $schema.p1 for values in (1, 2, 3);
+			create table $schema.p1_2 partition of $schema.p1 for values in (4, 5, 6);
+			create table $schema.p2_1 partition of $schema.p2 for values in (1, 2, 3);
+			create table $schema.p2_2 partition of $schema.p2 for values in (4, 5, 6);
+
+			CREATE INDEX t1_btree ON $schema.t1 USING BTREE (i);
+			CREATE INDEX t2_btree ON $schema.t2 USING BTREE (i);
+
+			CREATE INDEX t1_hash ON $schema.t1 USING HASH (i);
+			CREATE INDEX t2_hash ON $schema.t2 USING HASH (i);
+
+			CREATE INDEX t1_brin ON $schema.t1 USING BRIN (i);
+			CREATE INDEX t2_brin ON $schema.t2 USING BRIN (i);
+
+			CREATE INDEX t1_gist ON $schema.t1 USING GIST (b);
+			CREATE INDEX t2_gist ON $schema.t2 USING GIST (b);
+
+			CREATE INDEX t1_gin ON $schema.t1 USING GIN (ia);
+			CREATE INDEX t2_gin ON $schema.t2 USING GIN (ia);
+
+			CREATE INDEX t1_spgist ON $schema.t1 USING SPGIST (ir);
+			CREATE INDEX t2_spgist ON $schema.t2 USING SPGIST (ir);
+		));
+	}
+}
+
+# Database 'db1' corruptions
+#
+
+# Corrupt indexes in schema "s1"
+plan_to_remove_relation_file('db1', 's1.t1_btree');
+plan_to_corrupt_first_page('db1', 's1.t2_btree');
+
+# Corrupt tables in schema "s2"
+plan_to_remove_relation_file('db1', 's2.t1');
+plan_to_corrupt_first_page('db1', 's2.t2');
+
+# Corrupt tables, partitions, matviews, and btrees in schema "s3"
+plan_to_remove_relation_file('db1', 's3.t1');
+plan_to_corrupt_first_page('db1', 's3.t2');
+
+plan_to_remove_relation_file('db1', 's3.t1_mv');
+plan_to_remove_relation_file('db1', 's3.p1_1');
+
+plan_to_corrupt_first_page('db1', 's3.t2_mv');
+plan_to_corrupt_first_page('db1', 's3.p2_1');
+
+plan_to_remove_relation_file('db1', 's3.t1_btree');
+plan_to_corrupt_first_page('db1', 's3.t2_btree');
+
+# Corrupt toast table, partitions, and materialized views in schema "s4"
+plan_to_remove_toast_file('db1', 's4.t2');
+
+# Corrupt all other object types in schema "s5".  We don't have amcheck support
+# for these types, but we check that their corruption does not trigger any
+# errors in pg_amcheck
+plan_to_remove_relation_file('db1', 's5.seq1');
+plan_to_remove_relation_file('db1', 's5.t1_hash');
+plan_to_remove_relation_file('db1', 's5.t1_gist');
+plan_to_remove_relation_file('db1', 's5.t1_gin');
+plan_to_remove_relation_file('db1', 's5.t1_brin');
+plan_to_remove_relation_file('db1', 's5.t1_spgist');
+
+plan_to_corrupt_first_page('db1', 's5.seq2');
+plan_to_corrupt_first_page('db1', 's5.t2_hash');
+plan_to_corrupt_first_page('db1', 's5.t2_gist');
+plan_to_corrupt_first_page('db1', 's5.t2_gin');
+plan_to_corrupt_first_page('db1', 's5.t2_brin');
+plan_to_corrupt_first_page('db1', 's5.t2_spgist');
+
+
+# Database 'db2' corruptions
+#
+plan_to_remove_relation_file('db2', 's1.t1');
+plan_to_remove_relation_file('db2', 's1.t1_btree');
+
+
+# Leave 'db3' uncorrupted
+#
+
+# Perform the corruptions we planned above using only a single database restart.
+#
+perform_all_corruptions();
+
+
+# Standard first arguments to TestLib functions
+my @cmd = ('pg_amcheck', '--quiet', '-p', $port);
+
+# Regular expressions to match various expected output
+my $no_output_re = qr/^$/;
+my $line_pointer_corruption_re = qr/line pointer/;
+my $missing_file_re = qr/could not open file ".*": No such file or directory/;
+my $index_missing_relation_fork_re = qr/index ".*" lacks a main relation fork/;
+
+# Checking databases with amcheck installed and corrupt relations, pg_amcheck
+# command itself should return exit status = 2, because tables and indexes are
+# corrupt, not exit status = 1, which would mean the pg_amcheck command itself
+# failed.  Corruption messages should go to stdout, and nothing to stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in database db1');
+
+$node->command_checks_all(
+	[ @cmd, '-d', 'db1', '-d', 'db2', '-d', 'db3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in databases db1, db2, and db3');
+
+# Scans of indexes in s1 should detect the specific corruption that we created
+# above.  For missing relation forks, we know what the error message looks
+# like.  For corrupted index pages, the error might vary depending on how the
+# page was formatted on disk, including variations due to alignment differences
+# between platforms, so we accept any non-empty error message.
+#
+# If we don't limit the check to databases with amcheck installed, we expect
+# complaint on stderr, but otherwise stderr should be quiet.
+#
+$node->command_checks_all(
+	[ @cmd, '--all', '-s', 's1', '-i', 't1_btree' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ qr/pg_amcheck: warning: skipping database "postgres": amcheck is not installed/ ],
+	'pg_amcheck index s1.t1_btree reports missing main relation fork');
+
+$node->command_checks_all(
+	[ @cmd, '-d', 'db1', '-s', 's1', '-i', 't2_btree' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ $no_output_re ],
+	'pg_amcheck index s1.s2 reports index corruption');
+
+# Checking db1.s1 with indexes excluded should show no corruptions because we
+# did not corrupt any tables in db1.s1.  Verify that both stdout and stderr
+# are quiet.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db1.s1 excluding indexes');
+
+# Checking db2.s1 should show table corruptions if indexes are excluded
+#
+$node->command_checks_all(
+	[ @cmd, 'db2', '-t', 's1.*', '--no-dependent-indexes' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db2.s1 excluding indexes');
+
+# In schema db1.s3, the tables and indexes are both corrupt.  We should see
+# corruption messages on stdout, and nothing on stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck schema s3 reports table and index errors');
+
+# In schema db1.s4, only toast tables are corrupt.  Check that under default
+# options the toast corruption is reported, but when excluding toast we get no
+# error reports.
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's4' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 reports toast corruption');
+
+$node->command_checks_all(
+	[ @cmd, '--no-dependent-toast', '--exclude-toast-pointers', 'db1', '-s', 's4' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 excluding toast reports no corruption');
+
+# Check that no corruption is reported in schema db1.s5
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's5' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s5 reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we exclude
+# the indexes, no corruption is reported about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-I', 't1_btree', '-I', 't2_btree' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with corrupt indexes excluded reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we provide only
+# table inclusions, and disable index expansion, no corruption is reported
+# about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with all indexes excluded reports no corruption');
+
+# In schema db1.s2, only tables are corrupt.  Verify that when we exclude those
+# tables that no corruption is reported.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's2', '-T', 't1', '-T', 't2' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s2 with corrupt tables excluded reports no corruption');
+
+# Check errors about bad block range command line arguments.  We use schema s5
+# to avoid getting messages about corrupt tables or indexes.
+#
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', 'junk' ],
+	qr/invalid start block/,
+	'pg_amcheck rejects garbage startblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--endblock', '1234junk' ],
+	qr/invalid end block/,
+	'pg_amcheck rejects garbage endblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', '5', '--endblock', '4' ],
+	qr/end block precedes start block/,
+	'pg_amcheck rejects invalid block range');
+
+# Check bt_index_parent_check alternates.  We don't create any index corruption
+# that would behave differently under these modes, so just smoke test that the
+# arguments are handled sensibly.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--parent-check' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --parent-check');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --heapallindexed --rootdescend');
+
+$node->command_checks_all(
+	[ @cmd, '-d', 'db1', '-d', 'db2', '-d', 'db3', '-S', 's*' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck excluding all corrupt schemas');
diff --git a/contrib/pg_amcheck/t/004_verify_heapam.pl b/contrib/pg_amcheck/t/004_verify_heapam.pl
new file mode 100644
index 0000000000..8ba1c4aea6
--- /dev/null
+++ b/contrib/pg_amcheck/t/004_verify_heapam.pl
@@ -0,0 +1,517 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+
+use Test::More;
+
+# This regression test demonstrates that the pg_amcheck binary supplied with
+# the pg_amcheck contrib module correctly identifies specific kinds of
+# corruption within pages.  To test this, we need a mechanism to create corrupt
+# pages with predictable, repeatable corruption.  The postgres backend cannot
+# be expected to help us with this, as its design is not consistent with the
+# goal of intentionally corrupting pages.
+#
+# Instead, we create a table to corrupt, and with careful consideration of how
+# postgresql lays out heap pages, we seek to offsets within the page and
+# overwrite deliberately chosen bytes with specific values calculated to
+# corrupt the page in expected ways.  We then verify that pg_amcheck reports
+# the corruption, and that it runs without crashing.  Note that the backend
+# cannot simply be started to run queries against the corrupt table, as the
+# backend will crash, at least for some of the corruption types we generate.
+#
+# Autovacuum potentially touching the table in the background makes the exact
+# behavior of this test harder to reason about.  We turn it off to keep things
+# simpler.  We use a "belt and suspenders" approach, turning it off for the
+# system generally in postgresql.conf, and turning it off specifically for the
+# test table.
+#
+# This test depends on the table being written to the heap file exactly as we
+# expect it to be, so we take care to arrange the columns of the table, and
+# insert rows of the table, that give predictable sizes and locations within
+# the table page.
+#
+# The HeapTupleHeaderData has 23 bytes of fixed size fields before the variable
+# length t_bits[] array.  We have exactly 3 columns in the table, so natts = 3,
+# t_bits is 1 byte long, and t_hoff = MAXALIGN(23 + 1) = 24.
+#
+# We're not too fussy about which datatypes we use for the test, but we do care
+# about some specific properties.  We'd like to test both fixed size and
+# varlena types.  We'd like some varlena data inline and some toasted.  And
+# we'd like the layout of the table such that the datums land at predictable
+# offsets within the tuple.  We choose a structure without padding on all
+# supported architectures:
+#
+# 	a BIGINT
+#	b TEXT
+#	c TEXT
+#
+# We always insert a 7-ascii character string into field 'b', which with a
+# 1-byte varlena header gives an 8 byte inline value.  We always insert a long
+# text string in field 'c', long enough to force toast storage.
+#
+# We choose to read and write binary copies of our table's tuples, using perl's
+# pack() and unpack() functions.  Perl uses a packing code system in which:
+#
+#	L = "Unsigned 32-bit Long",
+#	S = "Unsigned 16-bit Short",
+#	C = "Unsigned 8-bit Octet",
+#	c = "signed 8-bit octet",
+#	q = "signed 64-bit quadword"
+#
+# Each tuple in our table has a layout as follows:
+#
+#    xx xx xx xx            t_xmin: xxxx		offset = 0		L
+#    xx xx xx xx            t_xmax: xxxx		offset = 4		L
+#    xx xx xx xx          t_field3: xxxx		offset = 8		L
+#    xx xx                   bi_hi: xx			offset = 12		S
+#    xx xx                   bi_lo: xx			offset = 14		S
+#    xx xx                ip_posid: xx			offset = 16		S
+#    xx xx             t_infomask2: xx			offset = 18		S
+#    xx xx              t_infomask: xx			offset = 20		S
+#    xx                     t_hoff: x			offset = 22		C
+#    xx                     t_bits: x			offset = 23		C
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		Cccccccc
+#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		SSSS
+#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued	SSSS
+#    xx xx                        : xx      	 ...continued	S
+#
+# We could choose to read and write columns 'b' and 'c' in other ways, but
+# it is convenient enough to do it this way.  We define packing code
+# constants here, where they can be compared easily against the layout.
+
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCcccccccSSSSSSSSS';
+use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
+
+# Read a tuple of our table from a heap page.
+#
+# Takes an open filehandle to the heap file, and the offset of the tuple.
+#
+# Rather than returning the binary data from the file, unpacks the data into a
+# perl hash with named fields.  These fields exactly match the ones understood
+# by write_tuple(), below.  Returns a reference to this hash.
+#
+sub read_tuple ($$)
+{
+	my ($fh, $offset) = @_;
+	my ($buffer, %tup);
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(sysread($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("sysread failed: $!");
+
+	@_ = unpack(HEAPTUPLE_PACK_CODE, $buffer);
+	%tup = (t_xmin => shift,
+			t_xmax => shift,
+			t_field3 => shift,
+			bi_hi => shift,
+			bi_lo => shift,
+			ip_posid => shift,
+			t_infomask2 => shift,
+			t_infomask => shift,
+			t_hoff => shift,
+			t_bits => shift,
+			a => shift,
+			b_header => shift,
+			b_body1 => shift,
+			b_body2 => shift,
+			b_body3 => shift,
+			b_body4 => shift,
+			b_body5 => shift,
+			b_body6 => shift,
+			b_body7 => shift,
+			c1 => shift,
+			c2 => shift,
+			c3 => shift,
+			c4 => shift,
+			c5 => shift,
+			c6 => shift,
+			c7 => shift,
+			c8 => shift,
+			c9 => shift);
+	# Stitch together the text for column 'b'
+	$tup{b} = join('', map { chr($tup{"b_body$_"}) } (1..7));
+	return \%tup;
+}
+
+# Write a tuple of our table to a heap page.
+#
+# Takes an open filehandle to the heap file, the offset of the tuple, and a
+# reference to a hash with the tuple values, as returned by read_tuple().
+# Writes the tuple fields from the hash into the heap file.
+#
+# The purpose of this function is to write a tuple back to disk with some
+# subset of fields modified.  The function does no error checking.  Use
+# cautiously.
+#
+sub write_tuple($$$)
+{
+	my ($fh, $offset, $tup) = @_;
+	my $buffer = pack(HEAPTUPLE_PACK_CODE,
+					$tup->{t_xmin},
+					$tup->{t_xmax},
+					$tup->{t_field3},
+					$tup->{bi_hi},
+					$tup->{bi_lo},
+					$tup->{ip_posid},
+					$tup->{t_infomask2},
+					$tup->{t_infomask},
+					$tup->{t_hoff},
+					$tup->{t_bits},
+					$tup->{a},
+					$tup->{b_header},
+					$tup->{b_body1},
+					$tup->{b_body2},
+					$tup->{b_body3},
+					$tup->{b_body4},
+					$tup->{b_body5},
+					$tup->{b_body6},
+					$tup->{b_body7},
+					$tup->{c1},
+					$tup->{c2},
+					$tup->{c3},
+					$tup->{c4},
+					$tup->{c5},
+					$tup->{c6},
+					$tup->{c7},
+					$tup->{c8},
+					$tup->{c9});
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(syswrite($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("syswrite failed: $!");;
+	return;
+}
+
+# Set umask so test directories and files are created with default permissions
+umask(0077);
+
+# Set up the node.  Once we create and corrupt the table,
+# autovacuum workers visiting the table could crash the backend.
+# Disable autovacuum so that won't happen.
+my $node = get_new_node('test');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
+
+# Start the node and load the extensions.  We depend on both
+# amcheck and pageinspect for this test.
+$node->start;
+my $port = $node->port;
+my $pgdata = $node->data_dir;
+$node->safe_psql('postgres', "CREATE EXTENSION amcheck");
+$node->safe_psql('postgres', "CREATE EXTENSION pageinspect");
+
+# Get a non-zero datfrozenxid
+$node->safe_psql('postgres', qq(VACUUM FREEZE));
+
+# Create the test table with precisely the schema that our corruption function
+# expects.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.test (a BIGINT, b TEXT, c TEXT);
+		ALTER TABLE public.test SET (autovacuum_enabled=false);
+		ALTER TABLE public.test ALTER COLUMN c SET STORAGE EXTERNAL;
+		CREATE INDEX test_idx ON public.test(a, b);
+	));
+
+# We want (0 < datfrozenxid < test.relfrozenxid).  To achieve this, we freeze
+# an otherwise unused table, public.junk, prior to inserting data and freezing
+# public.test
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.junk AS SELECT 'junk'::TEXT AS junk_column;
+		ALTER TABLE public.junk SET (autovacuum_enabled=false);
+		VACUUM FREEZE public.junk
+	));
+
+my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.test')));
+my $relpath = "$pgdata/$rel";
+
+# Insert data and freeze public.test
+use constant ROWCOUNT => 16;
+$node->safe_psql('postgres', qq(
+	INSERT INTO public.test (a, b, c)
+		VALUES (
+			12345678,
+			'abcdefg',
+			repeat('w', 10000)
+		);
+	VACUUM FREEZE public.test
+	)) for (1..ROWCOUNT);
+
+my $relfrozenxid = $node->safe_psql('postgres',
+	q(select relfrozenxid from pg_class where relname = 'test'));
+my $datfrozenxid = $node->safe_psql('postgres',
+	q(select datfrozenxid from pg_database where datname = 'postgres'));
+
+# Sanity check that our 'test' table has a relfrozenxid newer than the
+# datfrozenxid for the database, and that the datfrozenxid is greater than the
+# first normal xid.  We rely on these invariants in some of our tests.
+if ($datfrozenxid <= 3 || $datfrozenxid >= $relfrozenxid)
+{
+	$node->clean_node;
+	plan skip_all => "Xid thresholds not as expected: got datfrozenxid = $datfrozenxid, relfrozenxid = $relfrozenxid";
+	exit;
+}
+
+# Find where each of the tuples is located on the page.
+my @lp_off;
+for my $tup (0..ROWCOUNT-1)
+{
+	push (@lp_off, $node->safe_psql('postgres', qq(
+select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
+	offset $tup limit 1)));
+}
+
+# Sanity check that our 'test' table on disk layout matches expectations.  If
+# this is not so, we will have to skip the test until somebody updates the test
+# to work on this platform.
+$node->stop;
+my $file;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	# Sanity-check that the data appears on the page where we expect.
+	my $a = $tup->{a};
+	my $b = $tup->{b};
+	if ($a ne '12345678' || $b ne 'abcdefg')
+	{
+		close($file);  # ignore errors on close; we're exiting anyway
+		$node->clean_node;
+		plan skip_all => qq(Page layout differs from our expectations: expected (12345678, "abcdefg"), got ($a, "$b"));
+		exit;
+	}
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Ok, Xids and page layout look ok.  We can run corruption tests.
+plan tests => 20;
+
+# Check that pg_amcheck runs against the uncorrupted table without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table, prior to corruption');
+
+# Check that pg_amcheck runs against the uncorrupted table and index without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table and index, prior to corruption');
+
+$node->stop;
+
+# Some #define constants from access/htup_details.h for use while corrupting.
+use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
+use constant HEAP_XMIN_COMMITTED     => 0x0100;
+use constant HEAP_XMIN_INVALID       => 0x0200;
+use constant HEAP_XMAX_COMMITTED     => 0x0400;
+use constant HEAP_XMAX_INVALID       => 0x0800;
+use constant HEAP_NATTS_MASK         => 0x07FF;
+use constant HEAP_XMAX_IS_MULTI      => 0x1000;
+use constant HEAP_KEYS_UPDATED       => 0x2000;
+
+# Helper function to generate a regular expression matching the header we
+# expect verify_heapam() to return given which fields we expect to be non-null.
+sub header
+{
+	my ($blkno, $offnum, $attnum) = @_;
+	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum, attribute $attnum:\s+/ms
+		if (defined $attnum);
+	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum:\s+/ms
+		if (defined $offnum);
+	return qr/heap table "postgres"\."public"\."test", block $blkno:\s+/ms
+		if (defined $blkno);
+	return qr/heap table "postgres"\."public"\."test":\s+/ms;
+}
+
+# Corrupt the tuples, one type of corruption per tuple.  Some types of
+# corruption cause verify_heapam to skip to the next tuple without
+# performing any remaining checks, so we can't exercise the system properly if
+# we focus all our corruption on a single tuple.
+#
+my @expected;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	my $header = header(0, $offnum, undef);
+	if ($offnum == 1)
+	{
+		# Corruptly set xmin < relfrozenxid
+		my $xmin = $relfrozenxid - 1;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		# Expected corruption report
+		push @expected,
+			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
+	}
+	if ($offnum == 2)
+	{
+		# Corruptly set xmin < datfrozenxid
+		my $xmin = 3;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin $xmin precedes oldest valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 3)
+	{
+		# Corruptly set xmin < datfrozenxid, further back, noting circularity
+		# of xid comparison.  For a new cluster with epoch = 0, the corrupt
+		# xmin will be interpreted as in the future
+		$tup->{t_xmin} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 4)
+	{
+		# Corruptly set xmax < relminmxid;
+		$tup->{t_xmax} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMAX_INVALID;
+
+		push @expected,
+			qr/${$header}xmax 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 5)
+	{
+		# Corrupt the tuple t_hoff, but keep it aligned properly
+		$tup->{t_hoff} += 128;
+
+		push @expected,
+			qr/${$header}data begins at offset 152 beyond the tuple length 58/,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 152 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 6)
+	{
+		# Corrupt the tuple t_hoff, wrong alignment
+		$tup->{t_hoff} += 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 27 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 7)
+	{
+		# Corrupt the tuple t_hoff, underflow but correct alignment
+		$tup->{t_hoff} -= 8;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 16 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 8)
+	{
+		# Corrupt the tuple t_hoff, underflow and wrong alignment
+		$tup->{t_hoff} -= 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 21 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 9)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, not just 3
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+
+		push @expected,
+			qr/${$header}number of attributes 2047 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 10)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, some of
+		# them null.  This falsely creates the impression that the t_bits
+		# array is longer than just one byte, but t_hoff still says otherwise.
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+		$tup->{t_bits} = 0xAA;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 280, but actually begins at byte 24 \(2047 attributes, has nulls\)/;
+	}
+	elsif ($offnum == 11)
+	{
+		# Same as above, but this time t_hoff plays along
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= (HEAP_NATTS_MASK & 0x40);
+		$tup->{t_bits} = 0xAA;
+		$tup->{t_hoff} = 32;
+
+		push @expected,
+			qr/${$header}number of attributes 67 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 12)
+	{
+		# Corrupt the bits in column 'b' 1-byte varlena header
+		$tup->{b_header} = 0x80;
+
+		$header = header(0, $offnum, 1);
+		push @expected,
+			qr/${header}attribute 1 with length 4294967295 ends at offset 416848000 beyond total tuple length 58/;
+	}
+	elsif ($offnum == 13)
+	{
+		# Corrupt the bits in column 'c' toast pointer
+		$tup->{c6} = 41;
+		$tup->{c7} = 41;
+
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}final toast chunk number 0 differs from expected value 6/,
+			qr/${header}toasted value for attribute 2 missing from toast table/;
+	}
+	elsif ($offnum == 14)
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4;
+
+		push @expected,
+			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
+	}
+	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4000000000;
+
+		push @expected,
+			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
+	}
+	write_tuple($file, $offset, $tup);
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Run pg_amcheck against the corrupt table with epoch=0, comparing actual
+# corruption messages against the expected messages
+$node->command_checks_all(
+	['pg_amcheck', '--no-dependent-indexes', '-p', $port, 'postgres'],
+	2,
+	[ @expected ],
+	[ ],
+	'Expected corruption message output');
+
+$node->teardown_node;
+$node->clean_node;
diff --git a/contrib/pg_amcheck/t/005_opclass_damage.pl b/contrib/pg_amcheck/t/005_opclass_damage.pl
new file mode 100644
index 0000000000..eba8ea9cae
--- /dev/null
+++ b/contrib/pg_amcheck/t/005_opclass_damage.pl
@@ -0,0 +1,54 @@
+# This regression test checks the behavior of the btree validation in the
+# presence of breaking sort order changes.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create a custom operator class and an index which uses it.
+$node->safe_psql('postgres', q(
+	CREATE EXTENSION amcheck;
+
+	CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;
+
+	CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
+	    OPERATOR 1 < (int4, int4), OPERATOR 2 <= (int4, int4),
+	    OPERATOR 3 = (int4, int4), OPERATOR 4 >= (int4, int4),
+	    OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);
+
+	CREATE TABLE int4tbl (i int4);
+	INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);
+	CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
+));
+
+# We have not yet broken the index, so we should get no corruption
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $node->port, 'postgres' ],
+	qr/^$/,
+	'pg_amcheck all schemas, tables and indexes reports no corruption');
+
+# Change the operator class to use a function which sorts in a different
+# order to corrupt the btree index
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION int4_desc_cmp (int4, int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN -1 ELSE 1 END; $$;
+	UPDATE pg_catalog.pg_amproc
+		SET amproc = 'int4_desc_cmp'::regproc
+		WHERE amproc = 'int4_asc_cmp'::regproc
+));
+
+# Index corruption should now be reported
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $node->port, 'postgres' ],
+	2,
+	[ qr/item order invariant violated for index "fickleidx"/ ],
+	[ ],
+	'pg_amcheck all schemas, tables and indexes reports fickleidx corruption'
+);
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index d3ca4b6932..7e101f7c11 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -185,6 +185,7 @@ pages.
   </para>
 
  &oid2name;
+ &pgamcheck;
  &vacuumlo;
  </sect1>
 
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index db1d369743..5115cb03d0 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY oldsnapshot     SYSTEM "oldsnapshot.sgml">
 <!ENTITY pageinspect     SYSTEM "pageinspect.sgml">
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
+<!ENTITY pgamcheck       SYSTEM "pgamcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
diff --git a/doc/src/sgml/pgamcheck.sgml b/doc/src/sgml/pgamcheck.sgml
new file mode 100644
index 0000000000..12573e950a
--- /dev/null
+++ b/doc/src/sgml/pgamcheck.sgml
@@ -0,0 +1,600 @@
+<!-- doc/src/sgml/pgamcheck.sgml -->
+
+<refentry id="pgamcheck">
+ <indexterm zone="pgamcheck">
+  <primary>pg_amcheck</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle><application>pg_amcheck</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_amcheck</refname>
+  <refpurpose>checks for corruption in one or more
+  <productname>PostgreSQL</productname> databases</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_amcheck</command>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+   <arg><replaceable>dbname</replaceable></arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <application>pg_amcheck</application> supports running
+   <xref linkend="amcheck"/>'s corruption checking functions against one or
+   more databases, with options to select which schemas, tables and indexes to
+   check, which kinds of checking to perform, and whether to perform the checks
+   in parallel, and if so, the number of parallel connections to establish and
+   use.
+  </para>
+
+  <para>
+   Only table relations and btree indexes are currently supported.  Other
+   relation types are silently skipped.
+  </para>
+
+  <para>
+   If <literal>dbname</literal> is specified, it should be the name of a
+   single database to check, and no other database selection options should
+   be present. Otherwise, if any database selection options are present,
+   all matching databases will be checked. If no such options are present,
+   the default database will be checked. Database selection options include
+   <option>--all</option>, <option>--database</option> and
+   <option>--exclude-database</option>. They also include
+   <option>--relation</option>, <option>--exclude-relation</option>,
+   <option>--table</option>, <option>--exclude-table</option>,
+   <option>--index</option>, and <option>--exclude-index</option>,
+   but only when such options are used with a three-part pattern
+   (e.g. <option>mydb*.myschema*.myrel*</option>).
+  </para>
+
+  <para>
+   <replaceable>dbname</replaceable> can also be a
+   <link linkend="libpq-connstring">connection string</link>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <para>
+   pg_amcheck accepts the following command-line arguments:
+
+   <variablelist>
+    <varlistentry>
+     <term><option>-a</option></term>
+     <term><option>--all</option></term>
+       <listitem>
+      <para>
+       Check all databases, except for any excluded via
+       <option>--exclude-database</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-d <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check databases matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       except for any excluded by <option>--exclude-database</option>.
+       This option can be specified more than once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-D <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude databases matching the given
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-e</option></term>
+     <term><option>--echo</option></term>
+     <listitem>
+      <para>
+      Echo to stdout all SQL sent to the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--endblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       End checking at the specified block number.  An error will occur if the
+       table relation being checked has fewer than this number of blocks.
+       This option does not apply to indexes, and is probably only useful when
+       checking a single table relation. If both a regular table and a toast
+       table are checked, this option will apply to both, but higher-numbered
+       toast blocks may still be accessed while validating toast pointers,
+       unless that is suppressed using <option>--exclude-toast-pointers</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--exclude-toast-pointers</option></term>
+     <listitem>
+      <para>
+       By default, whenever a toast pointer is encountered in a table,
+       a lookup is performed to ensure that it references apparently-valid
+       entries in the toast table. These checks can be quite slow, and this
+       option can be used to skip them.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--heapallindexed</option></term>
+     <listitem>
+      <para>
+       For each index checked, verify the presence of all heap tuples as index
+       tuples in the index using <xref linkend="amcheck"/>'s
+       <option>heapallindexed</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-?</option></term>
+     <term><option>--help</option></term>
+     <listitem>
+      <para>
+       Show help about <application>pg_amcheck</application> command line
+       arguments, and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-h <replaceable class="parameter">hostname</replaceable></option></term>
+     <term><option>--host=<replaceable class="parameter">hostname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the host name of the machine on which the server is running.
+       If the value begins with a slash, it is used as the directory for the
+       Unix domain socket.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-i <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check indexes matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--relation</option> option, except that
+       it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-I <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude indexes matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--exclude-relation</option> option,
+       except that it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-j <replaceable class="parameter">num</replaceable></option></term>
+     <term><option>--jobs=<replaceable class="parameter">num</replaceable></option></term>
+     <listitem>
+      <para>
+       Use <replaceable>num</replaceable> concurrent connections to the server,
+       or one per object to be checked, whichever is less.
+      </para>
+      <para>
+       The default is to use a single connection.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--maintenance-db=<replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies a database or
+       <link linkend="libpq-connstring">connection string</link> to be
+       used to discover the list of databases to be checked. If neither
+       <option>--all</option> nor any option including a database pattern is
+       used, no such connection is required and this option does nothing.
+       Otherwise, any connection string parameters other than
+       the database name which are included in the value for this option
+       will also be used when connecting to the databases
+       being checked. If this option is omitted, the default is
+       <literal>postgres</literal> or, if that fails,
+       <literal>template1</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-indexes</option></term>
+     <listitem>
+      <para>
+       By default, if a table is checked, any btree indexes of that table
+       will also be checked, even if they are not explicitly selected by
+       an option such as <literal>--index</literal> or
+       <literal>--relation</literal>. This option suppresses that behavior.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-strict-names</option></term>
+     <listitem>
+      <para>
+       By default, if an argument to <literal>--database</literal>,
+       <literal>--table</literal>, <literal>--index</literal>,
+       or <literal>--relation</literal> matches no objects, it is a fatal
+       error. This option downgrades that error to a warning.
+       If this option is used with <literal>--quiet</literal>, the warning
+       will be supressed as well.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-toast</option></term>
+     <listitem>
+      <para>
+       By default, if a table is checked, its toast table, if any, will also
+       be checked, even if it is not explicitly selected by an option
+       such as <literal>--table</literal> or <literal>--relation</literal>.
+       This option suppresses that behavior.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--on-error-stop</option></term>
+     <listitem>
+      <para>
+       After reporting all corruptions on the first page of a table where
+       corruption is found, stop processing that table relation and move on
+       to the next table or index.
+      </para>
+      <para>
+       Note that index checking always stops after the first corrupt page.
+       This option only has meaning relative to table relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--parent-check</option></term>
+     <listitem>
+      <para>
+       For each btree index checked, use <xref linkend="amcheck"/>'s
+       <function>bt_index_parent_check</function> function, which performs
+       additional checks of parent/child relationships during index checking.
+      </para>
+      <para>
+       The default is to use <application>amcheck</application>'s
+       <function>bt_index_check</function> function, but note that use of the
+       <option>--rootdescend</option> option implicitly selects
+       <function>bt_index_parent_check</function>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-p <replaceable class="parameter">port</replaceable></option></term>
+     <term><option>--port=<replaceable class="parameter">port</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the TCP port or local Unix domain socket file extension on
+       which the server is listening for connections.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-P</option></term>
+     <term><option>--progress</option></term>
+     <listitem>
+      <para>
+       Show progress information. Progress information includes the number
+       of relations for which checking has been completed, and the total
+       size of those relations. It also includes the total number of relations
+       that will eventually be checked, and the estimated size of those
+       relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-q</option></term>
+     <term><option>--quiet</option></term>
+     <listitem>
+      <para>
+       Print fewer messages, and less detail regarding any server errors.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-r <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       Patterns may be unqualified, e.g. <literal>myrel*</literal>, or they
+       may be schema-qualified, e.g. <literal>myschema*.myrel*</literal> or
+       database-qualified and schema-qualified, e.g.
+       <literal>mydb*.myscheam*.myrel*</literal>. A database-qualified
+       pattern will add matching databases to the list of databases to be
+       checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-R <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+      <para>
+       As with <option>-r</option> <option>--relation</option>, the
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> may be unqualified, schema-qualified,
+       or database- and schema-qualified.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--rootdescend</option></term>
+     <listitem>
+      <para>
+       For each index checked, re-find tuples on the leaf level by performing a
+       new search from the root page for each tuple using
+       <xref linkend="amcheck"/>'s <option>rootdescend</option> option.
+      </para>
+      <para>
+       Use of this option implicitly also selects the
+       <option>--parent-check</option> option.
+      </para>
+      <para>
+       This form of verification was originally written to help in the
+       development of btree index features.  It may be of limited use or even
+       of no use in helping detect the kinds of corruption that occur in
+       practice.  It may also cause corruption checking to take considerably
+       longer and consume considerably more resources on the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-s <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check tables and indexes in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>, unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       To select only tables in schemas matching a particular pattern,
+       consider using something like
+       <literal>--table=SCHEMAPAT.* --no-dependent-indexes</literal>.
+       To select only indexes, consider using something like
+       <literal>--index=SCHEMAPAT.*</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-S <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude tables and indexes in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--skip=<replaceable class="parameter">option</replaceable></option></term>
+     <listitem>
+      <para>
+       If <literal>"all-frozen"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all frozen.
+      </para>
+      <para>
+       If <literal>all-visible</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all visible.
+      </para>
+      <para>
+       By default, no pages are skipped.  This can be specified as
+       <literal>none</literal>, but since this is the default, it need not be
+       mentioned.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--startblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Start checking at the specified block number. An error will occur if
+       the table relation being checked has fewer than this number of blocks.
+       This option does not apply to indexes, and is probably only useful
+       when checking a single table relation. See <literal>--endblock</literal>
+       for further caveats.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-t <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--relation</option> option, except that
+       it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-T <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--exclude-relation</option> option,
+       except that it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-U</option></term>
+     <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
+     <listitem>
+      <para>
+       User name to connect as.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-v</option></term>
+     <term><option>--verbose</option></term>
+     <listitem>
+      <para>
+       Print more messages. In particular, this will print a message for
+       each relation being checked, and will increase the level of detail
+       shown for server errors.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-V</option></term>
+     <term><option>--version</option></term>
+     <listitem>
+      <para>
+       Print the <application>pg_amcheck</application> version and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-w</option></term>
+     <term><option>--no-password</option></term>
+     <listitem>
+      <para>
+       Never issue a password prompt.  If the server requires password
+       authentication and a password is not available by other means such as
+       a <filename>.pgpass</filename> file, the connection attempt will fail.
+       This option can be useful in batch jobs and scripts where no user is
+       present to enter a password.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-W</option></term>
+     <term><option>--password</option></term>
+     <listitem>
+      <para>
+       Force <application>pg_amcheck</application> to prompt for a password
+       before connecting to a database.
+      </para>
+      <para>
+       This option is never essential, since
+       <application>pg_amcheck</application> will automatically prompt for a
+       password if the server demands password authentication.  However,
+       <application>pg_amcheck</application> will waste a connection attempt
+       finding out that the server wants a password.  In some cases it is
+       worth typing <option>-W</option> to avoid the extra connection attempt.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+   <application>pg_amcheck</application> is designed to work with
+   <productname>PostgreSQL</productname> 14.0 and later.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Author</title>
+
+  <para>
+   Mark Dilger <email>mark.dilger@enterprisedb.com</email>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="amcheck"/></member>
+  </simplelist>
+ </refsect1>
+</refentry>
diff --git a/src/tools/msvc/Install.pm b/src/tools/msvc/Install.pm
index ea3af48777..49ad558b74 100644
--- a/src/tools/msvc/Install.pm
+++ b/src/tools/msvc/Install.pm
@@ -18,7 +18,7 @@ our (@ISA, @EXPORT_OK);
 @EXPORT_OK = qw(Install);
 
 my $insttype;
-my @client_contribs = ('oid2name', 'pgbench', 'vacuumlo');
+my @client_contribs = ('oid2name', 'pg_amcheck', 'pgbench', 'vacuumlo');
 my @client_program_files = (
 	'clusterdb',      'createdb',   'createuser',    'dropdb',
 	'dropuser',       'ecpg',       'libecpg',       'libecpg_compat',
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 49614106dc..f680544e07 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -33,9 +33,9 @@ my @unlink_on_exit;
 
 # Set of variables for modules in contrib/ and src/test/modules/
 my $contrib_defines = { 'refint' => 'REFINT_VERBOSE' };
-my @contrib_uselibpq = ('dblink', 'oid2name', 'postgres_fdw', 'vacuumlo');
-my @contrib_uselibpgport   = ('oid2name', 'vacuumlo');
-my @contrib_uselibpgcommon = ('oid2name', 'vacuumlo');
+my @contrib_uselibpq = ('dblink', 'oid2name', 'pg_amcheck', 'postgres_fdw', 'vacuumlo');
+my @contrib_uselibpgport   = ('oid2name', 'pg_amcheck', 'vacuumlo');
+my @contrib_uselibpgcommon = ('oid2name', 'pg_amcheck', 'vacuumlo');
 my $contrib_extralibs      = undef;
 my $contrib_extraincludes = { 'dblink' => ['src/backend'] };
 my $contrib_extrasource = {
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e017557e3e..202673d37f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -101,6 +101,7 @@ AlterUserMappingStmt
 AlteredTableInfo
 AlternativeSubPlan
 AlternativeSubPlanState
+AmcheckOptions
 AnalyzeAttrComputeStatsFunc
 AnalyzeAttrFetchFunc
 AnalyzeForeignTable_function
@@ -500,6 +501,7 @@ DSA
 DWORD
 DataDumperPtr
 DataPageDeleteStack
+DatabaseInfo
 DateADT
 Datum
 DatumTupleFields
@@ -1803,6 +1805,8 @@ PathHashStack
 PathKey
 PathKeysComparison
 PathTarget
+PatternInfo
+PatternInfoArray
 Pattern_Prefix_Status
 Pattern_Type
 PendingFsyncEntry
@@ -2085,6 +2089,7 @@ RelToCluster
 RelabelType
 Relation
 RelationData
+RelationInfo
 RelationPtr
 RelationSyncEntry
 RelcacheCallbackFunction
-- 
2.21.1 (Apple Git-122.3)

#25Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#24)
1 attachment(s)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 12:00 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Your proposal is used in this next version of the patch, along with a resolution to the solution to the -D option handling, discussed before, and a change to make --schema and --exclude-schema options accept "database.schema" patterns as well as "schema" patterns. It previously only interpreted the parameter as a schema without treating embedded dots as separators, but that seems strangely inconsistent with the way all the other pattern options work, so I made it consistent. (I think the previous behavior was defensible, but harder to explain and perhaps less intuitive.)

Well, OK. In that case I guess we need to patch the docs a little
more. Here's a patch documentation that revised behavior, and also
tidying up a few other things I noticed along the way.

Since nobody is saying we *shouldn't* move this to src/bin, I think
you may as well go put it there per Peter's suggestion.

Then I think it's time to get this committed and move on to the next thing.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

more-doc-hacking.patchapplication/octet-stream; name=more-doc-hacking.patchDownload
diff --git a/doc/src/sgml/pgamcheck.sgml b/doc/src/sgml/pgamcheck.sgml
index 12573e950a..62812a751f 100644
--- a/doc/src/sgml/pgamcheck.sgml
+++ b/doc/src/sgml/pgamcheck.sgml
@@ -54,7 +54,10 @@
    <option>--table</option>, <option>--exclude-table</option>,
    <option>--index</option>, and <option>--exclude-index</option>,
    but only when such options are used with a three-part pattern
-   (e.g. <option>mydb*.myschema*.myrel*</option>).
+   (e.g. <option>mydb*.myschema*.myrel*</option>). Finally, they include
+   <option>--schema</option> and <option>--exclude-schema</option>
+   when such options are used with a two-part pattern
+   (e.g. <option>mydb*.myschema*</option>).
   </para>
 
   <para>
@@ -126,7 +129,8 @@
        checking a single table relation. If both a regular table and a toast
        table are checked, this option will apply to both, but higher-numbered
        toast blocks may still be accessed while validating toast pointers,
-       unless that is suppressed using <option>--exclude-toast-pointers</option>.
+       unless that is suppressed using
+       <option>--exclude-toast-pointers</option>.
       </para>
      </listitem>
     </varlistentry>
@@ -379,7 +383,7 @@
        This option can be specified more than once.
       </para>
       <para>
-       As with <option>-r</option> <option>--relation</option>, the
+       As with <option>--relation</option>, the
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> may be unqualified, schema-qualified,
        or database- and schema-qualified.
       </para>
@@ -424,6 +428,12 @@
        To select only indexes, consider using something like
        <literal>--index=SCHEMAPAT.*</literal>.
       </para>
+      <para>
+       A schema pattern may be database-qualified. For example, you may
+       write <literal>--schema=mydb*.myschema*</literal> to select
+       schemas matching <literal>myschema*</literal> in databases matching
+       <literal>mydb*</literal>.
+      </para>
      </listitem>
     </varlistentry>
 
@@ -436,6 +446,10 @@
        <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
        This option can be specified more than once.
       </para>
+      <para>
+       As with <option>--schema</option>, the pattern may be
+       database-qualified.
+      </para>
      </listitem>
     </varlistentry>
 
#26Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#25)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Well, OK. In that case I guess we need to patch the docs a little
more. Here's a patch documentation that revised behavior, and also
tidying up a few other things I noticed along the way.

Since nobody is saying we *shouldn't* move this to src/bin, I think
you may as well go put it there per Peter's suggestion.

Then I think it's time to get this committed and move on to the next thing.

In this next patch, your documentation patch has been applied, and the whole project has been relocated from contrib/pg_amcheck to src/bin/pg_amcheck.

Attachments:

v47-0001-Adding-frontend-utility-program-pg_amcheck.patchapplication/octet-stream; name=v47-0001-Adding-frontend-utility-program-pg_amcheck.patch; x-unix-mode=0644Download
From bf0f4ae87ccbdff8d7e268a62547cb3d50870240 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 08:24:49 -0800
Subject: [PATCH v47] Adding frontend utility program pg_amcheck

Adding new utility pg_amcheck, which is a command line interface for
running contrib/amcheck's verifications against tables and indexes.
---
 doc/src/sgml/ref/allfiles.sgml             |    1 +
 doc/src/sgml/ref/pg_amcheck.sgml           |  609 ++++++
 doc/src/sgml/reference.sgml                |    1 +
 src/bin/Makefile                           |    1 +
 src/bin/pg_amcheck/.gitignore              |    3 +
 src/bin/pg_amcheck/Makefile                |   51 +
 src/bin/pg_amcheck/pg_amcheck.c            | 2134 ++++++++++++++++++++
 src/bin/pg_amcheck/t/001_basic.pl          |    9 +
 src/bin/pg_amcheck/t/002_nonesuch.pl       |  248 +++
 src/bin/pg_amcheck/t/003_check.pl          |  504 +++++
 src/bin/pg_amcheck/t/004_verify_heapam.pl  |  516 +++++
 src/bin/pg_amcheck/t/005_opclass_damage.pl |   54 +
 src/tools/msvc/Install.pm                  |   12 +-
 src/tools/msvc/Mkvcbuild.pm                |   26 +-
 14 files changed, 4155 insertions(+), 14 deletions(-)
 create mode 100644 doc/src/sgml/ref/pg_amcheck.sgml
 create mode 100644 src/bin/pg_amcheck/.gitignore
 create mode 100644 src/bin/pg_amcheck/Makefile
 create mode 100644 src/bin/pg_amcheck/pg_amcheck.c
 create mode 100644 src/bin/pg_amcheck/t/001_basic.pl
 create mode 100644 src/bin/pg_amcheck/t/002_nonesuch.pl
 create mode 100644 src/bin/pg_amcheck/t/003_check.pl
 create mode 100644 src/bin/pg_amcheck/t/004_verify_heapam.pl
 create mode 100644 src/bin/pg_amcheck/t/005_opclass_damage.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index bee7d28928..202711c005 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -196,6 +196,7 @@ Complete list of usable sgml source files in this directory.
 <!ENTITY dropuser           SYSTEM "dropuser.sgml">
 <!ENTITY ecpgRef            SYSTEM "ecpg-ref.sgml">
 <!ENTITY initdb             SYSTEM "initdb.sgml">
+<!ENTITY pgAMCheck          SYSTEM "pg_amcheck.sgml">
 <!ENTITY pgarchivecleanup   SYSTEM "pgarchivecleanup.sgml">
 <!ENTITY pgBasebackup       SYSTEM "pg_basebackup.sgml">
 <!ENTITY pgbench            SYSTEM "pgbench.sgml">
diff --git a/doc/src/sgml/ref/pg_amcheck.sgml b/doc/src/sgml/ref/pg_amcheck.sgml
new file mode 100644
index 0000000000..fcc96b430a
--- /dev/null
+++ b/doc/src/sgml/ref/pg_amcheck.sgml
@@ -0,0 +1,609 @@
+<!--
+doc/src/sgml/ref/pg_amcheck.sgml
+PostgreSQL documentation
+-->
+
+<refentry id="app-pgamcheck">
+ <indexterm zone="app-pgamcheck">
+  <primary>pg_amcheck</primary>
+ </indexterm>
+
+ <refmeta>
+  <refentrytitle><application>pg_amcheck</application></refentrytitle>
+  <manvolnum>1</manvolnum>
+  <refmiscinfo>Application</refmiscinfo>
+ </refmeta>
+
+ <refnamediv>
+  <refname>pg_amcheck</refname>
+  <refpurpose>checks for corruption in one or more
+  <productname>PostgreSQL</productname> databases</refpurpose>
+ </refnamediv>
+
+ <refsynopsisdiv>
+  <cmdsynopsis>
+   <command>pg_amcheck</command>
+   <arg rep="repeat"><replaceable>option</replaceable></arg>
+   <arg><replaceable>dbname</replaceable></arg>
+  </cmdsynopsis>
+ </refsynopsisdiv>
+
+ <refsect1>
+  <title>Description</title>
+
+  <para>
+   <application>pg_amcheck</application> supports running
+   <xref linkend="amcheck"/>'s corruption checking functions against one or
+   more databases, with options to select which schemas, tables and indexes to
+   check, which kinds of checking to perform, and whether to perform the checks
+   in parallel, and if so, the number of parallel connections to establish and
+   use.
+  </para>
+
+  <para>
+   Only table relations and btree indexes are currently supported.  Other
+   relation types are silently skipped.
+  </para>
+
+  <para>
+   If <literal>dbname</literal> is specified, it should be the name of a
+   single database to check, and no other database selection options should
+   be present. Otherwise, if any database selection options are present,
+   all matching databases will be checked. If no such options are present,
+   the default database will be checked. Database selection options include
+   <option>--all</option>, <option>--database</option> and
+   <option>--exclude-database</option>. They also include
+   <option>--relation</option>, <option>--exclude-relation</option>,
+   <option>--table</option>, <option>--exclude-table</option>,
+   <option>--index</option>, and <option>--exclude-index</option>,
+   but only when such options are used with a three-part pattern
+   (e.g. <option>mydb*.myschema*.myrel*</option>).  Finally, they include
+   <option>--schema</option> and <option>--exclude-schema</option>
+   when such options are used with a two-part pattern
+   (e.g. <option>mydb*.myschema*</option>).
+  </para>
+
+  <para>
+   <replaceable>dbname</replaceable> can also be a
+   <link linkend="libpq-connstring">connection string</link>.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Options</title>
+
+  <para>
+   pg_amcheck accepts the following command-line arguments:
+
+   <variablelist>
+    <varlistentry>
+     <term><option>-a</option></term>
+     <term><option>--all</option></term>
+       <listitem>
+      <para>
+       Check all databases, except for any excluded via
+       <option>--exclude-database</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-d <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check databases matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       except for any excluded by <option>--exclude-database</option>.
+       This option can be specified more than once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-D <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-database=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude databases matching the given
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-e</option></term>
+     <term><option>--echo</option></term>
+     <listitem>
+      <para>
+      Echo to stdout all SQL sent to the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--endblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       End checking at the specified block number.  An error will occur if the
+       table relation being checked has fewer than this number of blocks.
+       This option does not apply to indexes, and is probably only useful when
+       checking a single table relation. If both a regular table and a toast
+       table are checked, this option will apply to both, but higher-numbered
+       toast blocks may still be accessed while validating toast pointers,
+       unless that is suppressed using
+       <option>--exclude-toast-pointers</option>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--exclude-toast-pointers</option></term>
+     <listitem>
+      <para>
+       By default, whenever a toast pointer is encountered in a table,
+       a lookup is performed to ensure that it references apparently-valid
+       entries in the toast table. These checks can be quite slow, and this
+       option can be used to skip them.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--heapallindexed</option></term>
+     <listitem>
+      <para>
+       For each index checked, verify the presence of all heap tuples as index
+       tuples in the index using <xref linkend="amcheck"/>'s
+       <option>heapallindexed</option> option.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-?</option></term>
+     <term><option>--help</option></term>
+     <listitem>
+      <para>
+       Show help about <application>pg_amcheck</application> command line
+       arguments, and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-h <replaceable class="parameter">hostname</replaceable></option></term>
+     <term><option>--host=<replaceable class="parameter">hostname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the host name of the machine on which the server is running.
+       If the value begins with a slash, it is used as the directory for the
+       Unix domain socket.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-i <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check indexes matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--relation</option> option, except that
+       it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-I <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-index=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude indexes matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--exclude-relation</option> option,
+       except that it applies only to indexes, not tables.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-j <replaceable class="parameter">num</replaceable></option></term>
+     <term><option>--jobs=<replaceable class="parameter">num</replaceable></option></term>
+     <listitem>
+      <para>
+       Use <replaceable>num</replaceable> concurrent connections to the server,
+       or one per object to be checked, whichever is less.
+      </para>
+      <para>
+       The default is to use a single connection.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--maintenance-db=<replaceable class="parameter">dbname</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies a database or
+       <link linkend="libpq-connstring">connection string</link> to be
+       used to discover the list of databases to be checked. If neither
+       <option>--all</option> nor any option including a database pattern is
+       used, no such connection is required and this option does nothing.
+       Otherwise, any connection string parameters other than
+       the database name which are included in the value for this option
+       will also be used when connecting to the databases
+       being checked. If this option is omitted, the default is
+       <literal>postgres</literal> or, if that fails,
+       <literal>template1</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-indexes</option></term>
+     <listitem>
+      <para>
+       By default, if a table is checked, any btree indexes of that table
+       will also be checked, even if they are not explicitly selected by
+       an option such as <literal>--index</literal> or
+       <literal>--relation</literal>. This option suppresses that behavior.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-strict-names</option></term>
+     <listitem>
+      <para>
+       By default, if an argument to <literal>--database</literal>,
+       <literal>--table</literal>, <literal>--index</literal>,
+       or <literal>--relation</literal> matches no objects, it is a fatal
+       error. This option downgrades that error to a warning.
+       If this option is used with <literal>--quiet</literal>, the warning
+       will be supressed as well.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--no-dependent-toast</option></term>
+     <listitem>
+      <para>
+       By default, if a table is checked, its toast table, if any, will also
+       be checked, even if it is not explicitly selected by an option
+       such as <literal>--table</literal> or <literal>--relation</literal>.
+       This option suppresses that behavior.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--on-error-stop</option></term>
+     <listitem>
+      <para>
+       After reporting all corruptions on the first page of a table where
+       corruption is found, stop processing that table relation and move on
+       to the next table or index.
+      </para>
+      <para>
+       Note that index checking always stops after the first corrupt page.
+       This option only has meaning relative to table relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--parent-check</option></term>
+     <listitem>
+      <para>
+       For each btree index checked, use <xref linkend="amcheck"/>'s
+       <function>bt_index_parent_check</function> function, which performs
+       additional checks of parent/child relationships during index checking.
+      </para>
+      <para>
+       The default is to use <application>amcheck</application>'s
+       <function>bt_index_check</function> function, but note that use of the
+       <option>--rootdescend</option> option implicitly selects
+       <function>bt_index_parent_check</function>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-p <replaceable class="parameter">port</replaceable></option></term>
+     <term><option>--port=<replaceable class="parameter">port</replaceable></option></term>
+     <listitem>
+      <para>
+       Specifies the TCP port or local Unix domain socket file extension on
+       which the server is listening for connections.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-P</option></term>
+     <term><option>--progress</option></term>
+     <listitem>
+      <para>
+       Show progress information. Progress information includes the number
+       of relations for which checking has been completed, and the total
+       size of those relations. It also includes the total number of relations
+       that will eventually be checked, and the estimated size of those
+       relations.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-q</option></term>
+     <term><option>--quiet</option></term>
+     <listitem>
+      <para>
+       Print fewer messages, and less detail regarding any server errors.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-r <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       Patterns may be unqualified, e.g. <literal>myrel*</literal>, or they
+       may be schema-qualified, e.g. <literal>myschema*.myrel*</literal> or
+       database-qualified and schema-qualified, e.g.
+       <literal>mydb*.myscheam*.myrel*</literal>. A database-qualified
+       pattern will add matching databases to the list of databases to be
+       checked.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-R <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-relation=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude relations matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+      <para>
+       As with <option>--relation</option>, the
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link> may be unqualified, schema-qualified,
+       or database- and schema-qualified.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--rootdescend</option></term>
+     <listitem>
+      <para>
+       For each index checked, re-find tuples on the leaf level by performing a
+       new search from the root page for each tuple using
+       <xref linkend="amcheck"/>'s <option>rootdescend</option> option.
+      </para>
+      <para>
+       Use of this option implicitly also selects the
+       <option>--parent-check</option> option.
+      </para>
+      <para>
+       This form of verification was originally written to help in the
+       development of btree index features.  It may be of limited use or even
+       of no use in helping detect the kinds of corruption that occur in
+       practice.  It may also cause corruption checking to take considerably
+       longer and consume considerably more resources on the server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-s <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check tables and indexes in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>, unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       To select only tables in schemas matching a particular pattern,
+       consider using something like
+       <literal>--table=SCHEMAPAT.* --no-dependent-indexes</literal>.
+       To select only indexes, consider using something like
+       <literal>--index=SCHEMAPAT.*</literal>.
+      </para>
+      <para>
+       A schema pattern may be database-qualified. For example, you may
+       write <literal>--schema=mydb*.myschema*</literal> to select
+       schemas matching <literal>myschema*</literal> in databases matching
+       <literal>mydb*</literal>.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-S <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-schema=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude tables and indexes in schemas matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+      <para>
+       As with <option>--schema</option>, the pattern may be
+       database-qualified.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--skip=<replaceable class="parameter">option</replaceable></option></term>
+     <listitem>
+      <para>
+       If <literal>"all-frozen"</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all frozen.
+      </para>
+      <para>
+       If <literal>all-visible</literal> is given, table corruption checks
+       will skip over pages in all tables that are marked as all visible.
+      </para>
+      <para>
+       By default, no pages are skipped.  This can be specified as
+       <literal>none</literal>, but since this is the default, it need not be
+       mentioned.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>--startblock=<replaceable class="parameter">block</replaceable></option></term>
+     <listitem>
+      <para>
+       Start checking at the specified block number. An error will occur if
+       the table relation being checked has fewer than this number of blocks.
+       This option does not apply to indexes, and is probably only useful
+       when checking a single table relation. See <literal>--endblock</literal>
+       for further caveats.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-t <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Check tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>,
+       unless they are otherwise excluded.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--relation</option> option, except that
+       it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-T <replaceable class="parameter">pattern</replaceable></option></term>
+     <term><option>--exclude-table=<replaceable class="parameter">pattern</replaceable></option></term>
+     <listitem>
+      <para>
+       Exclude tables matching the specified
+       <link linkend="app-psql-patterns"><replaceable class="parameter">pattern</replaceable></link>.
+       This option can be specified more than once.
+      </para>
+      <para>
+       This is similar to the <option>--exclude-relation</option> option,
+       except that it applies only to tables, not indexes.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-U</option></term>
+     <term><option>--username=<replaceable class="parameter">username</replaceable></option></term>
+     <listitem>
+      <para>
+       User name to connect as.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-v</option></term>
+     <term><option>--verbose</option></term>
+     <listitem>
+      <para>
+       Print more messages. In particular, this will print a message for
+       each relation being checked, and will increase the level of detail
+       shown for server errors.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-V</option></term>
+     <term><option>--version</option></term>
+     <listitem>
+      <para>
+       Print the <application>pg_amcheck</application> version and exit.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-w</option></term>
+     <term><option>--no-password</option></term>
+     <listitem>
+      <para>
+       Never issue a password prompt.  If the server requires password
+       authentication and a password is not available by other means such as
+       a <filename>.pgpass</filename> file, the connection attempt will fail.
+       This option can be useful in batch jobs and scripts where no user is
+       present to enter a password.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><option>-W</option></term>
+     <term><option>--password</option></term>
+     <listitem>
+      <para>
+       Force <application>pg_amcheck</application> to prompt for a password
+       before connecting to a database.
+      </para>
+      <para>
+       This option is never essential, since
+       <application>pg_amcheck</application> will automatically prompt for a
+       password if the server demands password authentication.  However,
+       <application>pg_amcheck</application> will waste a connection attempt
+       finding out that the server wants a password.  In some cases it is
+       worth typing <option>-W</option> to avoid the extra connection attempt.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>Notes</title>
+
+  <para>
+   <application>pg_amcheck</application> is designed to work with
+   <productname>PostgreSQL</productname> 14.0 and later.
+  </para>
+ </refsect1>
+
+ <refsect1>
+  <title>See Also</title>
+
+  <simplelist type="inline">
+   <member><xref linkend="amcheck"/></member>
+  </simplelist>
+ </refsect1>
+</refentry>
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index dd2bddab8c..40b0406f1c 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -246,6 +246,7 @@
    &dropdb;
    &dropuser;
    &ecpgRef;
+   &pgAMCheck;
    &pgBasebackup;
    &pgbench;
    &pgConfig;
diff --git a/src/bin/Makefile b/src/bin/Makefile
index f7573efcd3..2fe0ae6652 100644
--- a/src/bin/Makefile
+++ b/src/bin/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 
 SUBDIRS = \
 	initdb \
+	pg_amcheck \
 	pg_archivecleanup \
 	pg_basebackup \
 	pg_checksums \
diff --git a/src/bin/pg_amcheck/.gitignore b/src/bin/pg_amcheck/.gitignore
new file mode 100644
index 0000000000..c21a14de31
--- /dev/null
+++ b/src/bin/pg_amcheck/.gitignore
@@ -0,0 +1,3 @@
+pg_amcheck
+
+/tmp_check/
diff --git a/src/bin/pg_amcheck/Makefile b/src/bin/pg_amcheck/Makefile
new file mode 100644
index 0000000000..6192523f10
--- /dev/null
+++ b/src/bin/pg_amcheck/Makefile
@@ -0,0 +1,51 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/bin/pg_amcheck
+#
+# Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/pg_amcheck/Makefile
+#
+#-------------------------------------------------------------------------
+
+PGFILEDESC = "pg_amcheck - detect corruption within database relations"
+PGAPPICON=win32
+
+EXTRA_INSTALL=contrib/amcheck contrib/pageinspect
+
+subdir = src/bin/pg_amcheck
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
+LDFLAGS_INTERNAL += -L$(top_builddir)/src/fe_utils -lpgfeutils $(libpq_pgport)
+
+OBJS = \
+	$(WIN32RES) \
+	pg_amcheck.o
+
+all: pg_amcheck
+
+pg_amcheck: $(OBJS) | submake-libpq submake-libpgport submake-libpgfeutils
+	$(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)
+
+
+install: all installdirs
+	$(INSTALL_PROGRAM) pg_amcheck$(X) '$(DESTDIR)$(bindir)/pg_amcheck$(X)'
+
+installdirs:
+	$(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+uninstall:
+	rm -f '$(DESTDIR)$(bindir)/pg_amcheck$(X)'
+
+clean distclean maintainer-clean:
+	rm -f pg_amcheck$(X) $(OBJS)
+	rm -rf tmp_check
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_amcheck/pg_amcheck.c b/src/bin/pg_amcheck/pg_amcheck.c
new file mode 100644
index 0000000000..59dbf9e9a0
--- /dev/null
+++ b/src/bin/pg_amcheck/pg_amcheck.c
@@ -0,0 +1,2134 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_amcheck.c
+ *		Detects corruption within database relations.
+ *
+ * Copyright (c) 2017-2021, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/bin/pg_amcheck/pg_amcheck.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres_fe.h"
+
+#include <time.h>
+
+#include "catalog/pg_am_d.h"
+#include "catalog/pg_namespace_d.h"
+#include "common/logging.h"
+#include "common/username.h"
+#include "fe_utils/cancel.h"
+#include "fe_utils/option_utils.h"
+#include "fe_utils/parallel_slot.h"
+#include "fe_utils/query_utils.h"
+#include "fe_utils/simple_list.h"
+#include "fe_utils/string_utils.h"
+#include "getopt_long.h"		/* pgrminclude ignore */
+#include "pgtime.h"
+#include "storage/block.h"
+
+typedef struct PatternInfo
+{
+	const char *pattern;		/* Unaltered pattern from the command line */
+	char	   *db_regex;		/* Database regexp parsed from pattern, or
+								 * NULL */
+	char	   *nsp_regex;		/* Schema regexp parsed from pattern, or NULL */
+	char	   *rel_regex;		/* Relation regexp parsed from pattern, or
+								 * NULL */
+	bool		heap_only;		/* true if rel_regex should only match heap
+								 * tables */
+	bool		btree_only;		/* true if rel_regex should only match btree
+								 * indexes */
+	bool		matched;		/* true if the pattern matched in any database */
+} PatternInfo;
+
+typedef struct PatternInfoArray
+{
+	PatternInfo *data;
+	size_t		len;
+} PatternInfoArray;
+
+/* pg_amcheck command line options controlled by user flags */
+typedef struct AmcheckOptions
+{
+	bool		dbpattern;
+	bool		alldb;
+	bool		echo;
+	bool		quiet;
+	bool		verbose;
+	bool		strict_names;
+	bool		show_progress;
+	int			jobs;
+
+	/* Objects to check or not to check, as lists of PatternInfo structs. */
+	PatternInfoArray include;
+	PatternInfoArray exclude;
+
+	/*
+	 * As an optimization, if any pattern in the exclude list applies to heap
+	 * tables, or similarly if any such pattern applies to btree indexes, or
+	 * to schemas, then these will be true, otherwise false.  These should
+	 * always agree with what you'd conclude by grep'ing through the exclude
+	 * list.
+	 */
+	bool		excludetbl;
+	bool		excludeidx;
+	bool		excludensp;
+
+	/*
+	 * If any inclusion pattern exists, then we should only be checking
+	 * matching relations rather than all relations, so this is true iff
+	 * include is empty.
+	 */
+	bool		allrel;
+
+	/* heap table checking options */
+	bool		no_toast_expansion;
+	bool		reconcile_toast;
+	bool		on_error_stop;
+	int64		startblock;
+	int64		endblock;
+	const char *skip;
+
+	/* btree index checking options */
+	bool		parent_check;
+	bool		rootdescend;
+	bool		heapallindexed;
+
+	/* heap and btree hybrid option */
+	bool		no_btree_expansion;
+} AmcheckOptions;
+
+static AmcheckOptions opts = {
+	.dbpattern = false,
+	.alldb = false,
+	.echo = false,
+	.quiet = false,
+	.verbose = false,
+	.strict_names = true,
+	.show_progress = false,
+	.jobs = 1,
+	.include = {NULL, 0},
+	.exclude = {NULL, 0},
+	.excludetbl = false,
+	.excludeidx = false,
+	.excludensp = false,
+	.allrel = true,
+	.no_toast_expansion = false,
+	.reconcile_toast = true,
+	.on_error_stop = false,
+	.startblock = -1,
+	.endblock = -1,
+	.skip = "none",
+	.parent_check = false,
+	.rootdescend = false,
+	.heapallindexed = false,
+	.no_btree_expansion = false
+};
+
+static const char *progname = NULL;
+
+/* Whether all relations have so far passed their corruption checks */
+static bool all_checks_pass = true;
+
+/* Time last progress report was displayed */
+static pg_time_t last_progress_report = 0;
+static bool progress_since_last_stderr = false;
+
+typedef struct DatabaseInfo
+{
+	char	   *datname;
+	char	   *amcheck_schema; /* escaped, quoted literal */
+} DatabaseInfo;
+
+typedef struct RelationInfo
+{
+	const DatabaseInfo *datinfo;	/* shared by other relinfos */
+	Oid			reloid;
+	bool		is_heap;		/* true if heap, false if btree */
+	char	   *nspname;
+	char	   *relname;
+	int			relpages;
+	int			blocks_to_check;
+	char	   *sql;			/* set during query run, pg_free'd after */
+} RelationInfo;
+
+/*
+ * Query for determining if contrib's amcheck is installed.  If so, selects the
+ * namespace name where amcheck's functions can be found.
+ */
+static const char *amcheck_sql =
+"SELECT n.nspname, x.extversion FROM pg_catalog.pg_extension x"
+"\nJOIN pg_catalog.pg_namespace n ON x.extnamespace = n.oid"
+"\nWHERE x.extname = 'amcheck'";
+
+static void prepare_heap_command(PQExpBuffer sql, RelationInfo *rel,
+								 PGconn *conn);
+static void prepare_btree_command(PQExpBuffer sql, RelationInfo *rel,
+								  PGconn *conn);
+static void run_command(ParallelSlot *slot, const char *sql);
+static bool verify_heap_slot_handler(PGresult *res, PGconn *conn,
+									 void *context);
+static bool verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context);
+static void help(const char *progname);
+static void progress_report(uint64 relations_total, uint64 relations_checked,
+							uint64 relpages_total, uint64 relpages_checked,
+							const char *datname, bool force, bool finished);
+
+static void append_database_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_schema_pattern(PatternInfoArray *pia, const char *pattern,
+								  int encoding);
+static void append_relation_pattern(PatternInfoArray *pia, const char *pattern,
+									int encoding);
+static void append_heap_pattern(PatternInfoArray *pia, const char *pattern,
+								int encoding);
+static void append_btree_pattern(PatternInfoArray *pia, const char *pattern,
+								 int encoding);
+static void compile_database_list(PGconn *conn, SimplePtrList *databases,
+								  const char *initial_dbname);
+static void compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+										 const DatabaseInfo *datinfo,
+										 uint64 *pagecount);
+
+#define log_no_match(...) do { \
+		if (opts.strict_names) \
+			pg_log_generic(PG_LOG_ERROR, __VA_ARGS__); \
+		else \
+			pg_log_generic(PG_LOG_WARNING, __VA_ARGS__); \
+	} while(0)
+
+#define FREE_AND_SET_NULL(x) do { \
+	pg_free(x); \
+	(x) = NULL; \
+	} while (0)
+
+int
+main(int argc, char *argv[])
+{
+	PGconn	   *conn = NULL;
+	SimplePtrListCell *cell;
+	SimplePtrList databases = {NULL, NULL};
+	SimplePtrList relations = {NULL, NULL};
+	bool		failed = false;
+	const char *latest_datname;
+	int			parallel_workers;
+	ParallelSlotArray *sa;
+	PQExpBufferData sql;
+	uint64		reltotal = 0;
+	uint64		pageschecked = 0;
+	uint64		pagestotal = 0;
+	uint64		relprogress = 0;
+	int			pattern_id;
+
+	static struct option long_options[] = {
+		/* Connection options */
+		{"host", required_argument, NULL, 'h'},
+		{"port", required_argument, NULL, 'p'},
+		{"username", required_argument, NULL, 'U'},
+		{"no-password", no_argument, NULL, 'w'},
+		{"password", no_argument, NULL, 'W'},
+		{"maintenance-db", required_argument, NULL, 1},
+
+		/* check options */
+		{"all", no_argument, NULL, 'a'},
+		{"database", required_argument, NULL, 'd'},
+		{"exclude-database", required_argument, NULL, 'D'},
+		{"echo", no_argument, NULL, 'e'},
+		{"index", required_argument, NULL, 'i'},
+		{"exclude-index", required_argument, NULL, 'I'},
+		{"jobs", required_argument, NULL, 'j'},
+		{"progress", no_argument, NULL, 'P'},
+		{"quiet", no_argument, NULL, 'q'},
+		{"relation", required_argument, NULL, 'r'},
+		{"exclude-relation", required_argument, NULL, 'R'},
+		{"schema", required_argument, NULL, 's'},
+		{"exclude-schema", required_argument, NULL, 'S'},
+		{"table", required_argument, NULL, 't'},
+		{"exclude-table", required_argument, NULL, 'T'},
+		{"verbose", no_argument, NULL, 'v'},
+		{"no-dependent-indexes", no_argument, NULL, 2},
+		{"no-dependent-toast", no_argument, NULL, 3},
+		{"exclude-toast-pointers", no_argument, NULL, 4},
+		{"on-error-stop", no_argument, NULL, 5},
+		{"skip", required_argument, NULL, 6},
+		{"startblock", required_argument, NULL, 7},
+		{"endblock", required_argument, NULL, 8},
+		{"rootdescend", no_argument, NULL, 9},
+		{"no-strict-names", no_argument, NULL, 10},
+		{"heapallindexed", no_argument, NULL, 11},
+		{"parent-check", no_argument, NULL, 12},
+
+		{NULL, 0, NULL, 0}
+	};
+
+	int			optindex;
+	int			c;
+
+	const char *db = NULL;
+	const char *maintenance_db = NULL;
+
+	const char *host = NULL;
+	const char *port = NULL;
+	const char *username = NULL;
+	enum trivalue prompt_password = TRI_DEFAULT;
+	int			encoding = pg_get_encoding_from_locale(NULL, false);
+	ConnParams	cparams;
+
+	pg_logging_init(argv[0]);
+	progname = get_progname(argv[0]);
+	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_amcheck"));
+
+	handle_help_version_opts(argc, argv, progname, help);
+
+	/* process command-line options */
+	while ((c = getopt_long(argc, argv, "ad:D:eh:Hi:I:j:p:Pqr:R:s:S:t:T:U:wWv",
+							long_options, &optindex)) != -1)
+	{
+		char	   *endptr;
+
+		switch (c)
+		{
+			case 'a':
+				opts.alldb = true;
+				break;
+			case 'd':
+				opts.dbpattern = true;
+				append_database_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'D':
+				opts.dbpattern = true;
+				append_database_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'e':
+				opts.echo = true;
+				break;
+			case 'h':
+				host = pg_strdup(optarg);
+				break;
+			case 'i':
+				opts.allrel = false;
+				append_btree_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'I':
+				opts.excludeidx = true;
+				append_btree_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'j':
+				opts.jobs = atoi(optarg);
+				if (opts.jobs < 1)
+				{
+					fprintf(stderr,
+							"number of parallel jobs must be at least 1\n");
+					exit(1);
+				}
+				break;
+			case 'p':
+				port = pg_strdup(optarg);
+				break;
+			case 'P':
+				opts.show_progress = true;
+				break;
+			case 'q':
+				opts.quiet = true;
+				break;
+			case 'r':
+				opts.allrel = false;
+				append_relation_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'R':
+				opts.excludeidx = true;
+				opts.excludetbl = true;
+				append_relation_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 's':
+				opts.allrel = false;
+				append_schema_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'S':
+				opts.excludensp = true;
+				append_schema_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 't':
+				opts.allrel = false;
+				append_heap_pattern(&opts.include, optarg, encoding);
+				break;
+			case 'T':
+				opts.excludetbl = true;
+				append_heap_pattern(&opts.exclude, optarg, encoding);
+				break;
+			case 'U':
+				username = pg_strdup(optarg);
+				break;
+			case 'w':
+				prompt_password = TRI_NO;
+				break;
+			case 'W':
+				prompt_password = TRI_YES;
+				break;
+			case 'v':
+				opts.verbose = true;
+				pg_logging_increase_verbosity();
+				break;
+			case 1:
+				maintenance_db = pg_strdup(optarg);
+				break;
+			case 2:
+				opts.no_btree_expansion = true;
+				break;
+			case 3:
+				opts.no_toast_expansion = true;
+				break;
+			case 4:
+				opts.reconcile_toast = false;
+				break;
+			case 5:
+				opts.on_error_stop = true;
+				break;
+			case 6:
+				if (pg_strcasecmp(optarg, "all-visible") == 0)
+					opts.skip = "all visible";
+				else if (pg_strcasecmp(optarg, "all-frozen") == 0)
+					opts.skip = "all frozen";
+				else
+				{
+					fprintf(stderr, "invalid skip option\n");
+					exit(1);
+				}
+				break;
+			case 7:
+				opts.startblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"invalid start block\n");
+					exit(1);
+				}
+				if (opts.startblock > MaxBlockNumber || opts.startblock < 0)
+				{
+					fprintf(stderr,
+							"start block out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 8:
+				opts.endblock = strtol(optarg, &endptr, 10);
+				if (*endptr != '\0')
+				{
+					fprintf(stderr,
+							"invalid end block\n");
+					exit(1);
+				}
+				if (opts.endblock > MaxBlockNumber || opts.endblock < 0)
+				{
+					fprintf(stderr,
+							"end block out of bounds\n");
+					exit(1);
+				}
+				break;
+			case 9:
+				opts.rootdescend = true;
+				opts.parent_check = true;
+				break;
+			case 10:
+				opts.strict_names = false;
+				break;
+			case 11:
+				opts.heapallindexed = true;
+				break;
+			case 12:
+				opts.parent_check = true;
+				break;
+			default:
+				fprintf(stderr,
+						"Try \"%s --help\" for more information.\n",
+						progname);
+				exit(1);
+		}
+	}
+
+	if (opts.endblock >= 0 && opts.endblock < opts.startblock)
+	{
+		fprintf(stderr,
+				"end block precedes start block\n");
+		exit(1);
+	}
+
+	/*
+	 * A single non-option arguments specifies a database name or connection
+	 * string.
+	 */
+	if (optind < argc)
+	{
+		db = argv[optind];
+		optind++;
+	}
+
+	if (optind < argc)
+	{
+		pg_log_error("too many command-line arguments (first is \"%s\")",
+					 argv[optind]);
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+		exit(1);
+	}
+
+	/* fill cparams except for dbname, which is set below */
+	cparams.pghost = host;
+	cparams.pgport = port;
+	cparams.pguser = username;
+	cparams.prompt_password = prompt_password;
+	cparams.dbname = NULL;
+	cparams.override_dbname = NULL;
+
+	setup_cancel_handler(NULL);
+
+	/* choose the database for our initial connection */
+	if (opts.alldb)
+	{
+		if (db != NULL)
+		{
+			pg_log_error("Cannot check all databases and a specific one at the same time");
+			exit(1);
+		}
+		cparams.dbname = maintenance_db;
+	}
+	else if (db != NULL)
+	{
+		if (opts.dbpattern)
+		{
+			pg_log_error("Cannot check a specific database and specify database patterns at the same time");
+			exit(1);
+		}
+		cparams.dbname = db;
+	}
+
+	if (opts.alldb || opts.dbpattern)
+	{
+		conn = connectMaintenanceDatabase(&cparams, progname, opts.echo);
+		compile_database_list(conn, &databases, NULL);
+	}
+	else
+	{
+		if (cparams.dbname == NULL)
+		{
+			if (getenv("PGDATABASE"))
+				cparams.dbname = getenv("PGDATABASE");
+			else if (getenv("PGUSER"))
+				cparams.dbname = getenv("PGUSER");
+			else
+				cparams.dbname = get_user_name_or_exit(progname);
+		}
+		conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+		compile_database_list(conn, &databases, PQdb(conn));
+	}
+
+	if (databases.head == NULL)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		pg_log_error("no databases to check");
+		exit(0);
+	}
+
+	/*
+	 * Compile a list of all relations spanning all databases to be checked.
+	 */
+	for (cell = databases.head; cell; cell = cell->next)
+	{
+		PGresult   *result;
+		int			ntups;
+		const char *amcheck_schema = NULL;
+		DatabaseInfo *dat = (DatabaseInfo *) cell->ptr;
+
+		cparams.override_dbname = dat->datname;
+		if (conn == NULL || strcmp(PQdb(conn), dat->datname) != 0)
+		{
+			if (conn != NULL)
+				disconnectDatabase(conn);
+			conn = connectDatabase(&cparams, progname, opts.echo, false, true);
+		}
+
+		/*
+		 * Verify that amcheck is installed for this next database.  User
+		 * error could result in a database not having amcheck that should
+		 * have it, but we also could be iterating over multiple databases
+		 * where not all of them have amcheck installed (for example,
+		 * 'template1').
+		 */
+		result = executeQuery(conn, amcheck_sql, opts.echo);
+		if (PQresultStatus(result) != PGRES_TUPLES_OK)
+		{
+			/* Querying the catalog failed. */
+			pg_log_error("database \"%s\": %s",
+						 PQdb(conn), PQerrorMessage(conn));
+			pg_log_info("query was: %s", amcheck_sql);
+			PQclear(result);
+			disconnectDatabase(conn);
+			exit(1);
+		}
+		ntups = PQntuples(result);
+		if (ntups == 0)
+		{
+			/* Querying the catalog succeeded, but amcheck is missing. */
+			pg_log_warning("skipping database \"%s\": amcheck is not installed",
+						   PQdb(conn));
+			disconnectDatabase(conn);
+			conn = NULL;
+			continue;
+		}
+		amcheck_schema = PQgetvalue(result, 0, 0);
+		if (opts.verbose)
+			pg_log_info("in database \"%s\": using amcheck version \"%s\" in schema \"%s\"",
+						PQdb(conn), PQgetvalue(result, 0, 1), amcheck_schema);
+		dat->amcheck_schema = PQescapeIdentifier(conn, amcheck_schema,
+												 strlen(amcheck_schema));
+		PQclear(result);
+
+		compile_relation_list_one_db(conn, &relations, dat, &pagestotal);
+	}
+
+	/*
+	 * Check that all inclusion patterns matched at least one schema or
+	 * relation that we can check.
+	 */
+	for (pattern_id = 0; pattern_id < opts.include.len; pattern_id++)
+	{
+		PatternInfo *pat = &opts.include.data[pattern_id];
+
+		if (!pat->matched && (pat->nsp_regex != NULL || pat->rel_regex != NULL))
+		{
+			failed = opts.strict_names;
+
+			if (!opts.quiet || failed)
+			{
+				if (pat->heap_only)
+					log_no_match("no heap tables to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->btree_only)
+					log_no_match("no btree indexes to check matching \"%s\"",
+								 pat->pattern);
+				else if (pat->rel_regex == NULL)
+					log_no_match("no relations to check in schemas matching \"%s\"",
+								 pat->pattern);
+				else
+					log_no_match("no relations to check matching \"%s\"",
+								 pat->pattern);
+			}
+		}
+	}
+
+	if (failed)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		exit(1);
+	}
+
+	/*
+	 * Set parallel_workers to the lesser of opts.jobs and the number of
+	 * relations.
+	 */
+	parallel_workers = 0;
+	for (cell = relations.head; cell; cell = cell->next)
+	{
+		reltotal++;
+		if (parallel_workers < opts.jobs)
+			parallel_workers++;
+	}
+
+	if (reltotal == 0)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		pg_log_error("no relations to check");
+		exit(1);
+	}
+	progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, false);
+
+	/*
+	 * Main event loop.
+	 *
+	 * We use server-side parallelism to check up to parallel_workers
+	 * relations in parallel.  The list of relations was computed in database
+	 * order, which minimizes the number of connects and disconnects as we
+	 * process the list.
+	 */
+	latest_datname = NULL;
+	sa = ParallelSlotsSetup(parallel_workers, &cparams, progname, opts.echo,
+							NULL);
+	if (conn != NULL)
+	{
+		ParallelSlotsAdoptConn(sa, conn);
+		conn = NULL;
+	}
+
+	initPQExpBuffer(&sql);
+	for (relprogress = 0, cell = relations.head; cell; cell = cell->next)
+	{
+		ParallelSlot *free_slot;
+		RelationInfo *rel;
+
+		rel = (RelationInfo *) cell->ptr;
+
+		if (CancelRequested)
+		{
+			failed = true;
+			break;
+		}
+
+		/*
+		 * The list of relations is in database sorted order.  If this next
+		 * relation is in a different database than the last one seen, we are
+		 * about to start checking this database.  Note that other slots may
+		 * still be working on relations from prior databases.
+		 */
+		latest_datname = rel->datinfo->datname;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, latest_datname, false, false);
+
+		relprogress++;
+		pageschecked += rel->blocks_to_check;
+
+		/*
+		 * Get a parallel slot for the next amcheck command, blocking if
+		 * necessary until one is available, or until a previously issued slot
+		 * command fails, indicating that we should abort checking the
+		 * remaining objects.
+		 */
+		free_slot = ParallelSlotsGetIdle(sa, rel->datinfo->datname);
+		if (!free_slot)
+		{
+			/*
+			 * Something failed.  We don't need to know what it was, because
+			 * the handler should already have emitted the necessary error
+			 * messages.
+			 */
+			failed = true;
+			break;
+		}
+
+		if (opts.verbose)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_VERBOSE);
+		else if (opts.quiet)
+			PQsetErrorVerbosity(free_slot->connection, PQERRORS_TERSE);
+
+		/*
+		 * Execute the appropriate amcheck command for this relation using our
+		 * slot's database connection.  We do not wait for the command to
+		 * complete, nor do we perform any error checking, as that is done by
+		 * the parallel slots and our handler callback functions.
+		 */
+		if (rel->is_heap)
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+				pg_log_info("checking heap table \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_heap_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_heap_slot_handler, rel);
+			run_command(free_slot, rel->sql);
+		}
+		else
+		{
+			if (opts.verbose)
+			{
+				if (opts.show_progress && progress_since_last_stderr)
+					fprintf(stderr, "\n");
+
+				pg_log_info("checking btree index \"%s\".\"%s\".\"%s\"",
+							rel->datinfo->datname, rel->nspname, rel->relname);
+				progress_since_last_stderr = false;
+			}
+			prepare_btree_command(&sql, rel, free_slot->connection);
+			rel->sql = pstrdup(sql.data);	/* pg_free'd after command */
+			ParallelSlotSetHandler(free_slot, verify_btree_slot_handler, rel);
+			run_command(free_slot, rel->sql);
+		}
+	}
+	termPQExpBuffer(&sql);
+
+	if (!failed)
+	{
+
+		/*
+		 * Wait for all slots to complete, or for one to indicate that an
+		 * error occurred.  Like above, we rely on the handler emitting the
+		 * necessary error messages.
+		 */
+		if (sa && !ParallelSlotsWaitCompletion(sa))
+			failed = true;
+
+		progress_report(reltotal, relprogress, pagestotal, pageschecked, NULL, true, true);
+	}
+
+	if (sa)
+	{
+		ParallelSlotsTerminate(sa);
+		FREE_AND_SET_NULL(sa);
+	}
+
+	if (failed)
+		exit(1);
+
+	if (!all_checks_pass)
+		exit(2);
+}
+
+/*
+ * prepare_heap_command
+ *
+ * Creates a SQL command for running amcheck checking on the given heap
+ * relation.  The command is phrased as a SQL query, with column order and
+ * names matching the expectations of verify_heap_slot_handler, which will
+ * receive and handle each row returned from the verify_heapam() function.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the heap table to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_heap_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+	appendPQExpBuffer(sql,
+					  "SELECT blkno, offnum, attnum, msg FROM %s.verify_heapam("
+					  "\nrelation := %u, on_error_stop := %s, check_toast := %s, skip := '%s'",
+					  rel->datinfo->amcheck_schema,
+					  rel->reloid,
+					  opts.on_error_stop ? "true" : "false",
+					  opts.reconcile_toast ? "true" : "false",
+					  opts.skip);
+
+	if (opts.startblock >= 0)
+		appendPQExpBuffer(sql, ", startblock := " INT64_FORMAT, opts.startblock);
+	if (opts.endblock >= 0)
+		appendPQExpBuffer(sql, ", endblock := " INT64_FORMAT, opts.endblock);
+
+	appendPQExpBuffer(sql, ")");
+}
+
+/*
+ * prepare_btree_command
+ *
+ * Creates a SQL command for running amcheck checking on the given btree index
+ * relation.  The command does not select any columns, as btree checking
+ * functions do not return any, but rather return corruption information by
+ * raising errors, which verify_btree_slot_handler expects.
+ *
+ * sql: buffer into which the heap table checking command will be written
+ * rel: relation information for the index to be checked
+ * conn: the connection to be used, for string escaping purposes
+ */
+static void
+prepare_btree_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
+{
+	resetPQExpBuffer(sql);
+
+	/*
+	 * Embed the database, schema, and relation name in the query, so if the
+	 * check throws an error, the user knows which relation the error came
+	 * from.
+	 */
+	if (opts.parent_check)
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_parent_check("
+						  "index := '%u'::regclass, heapallindexed := %s, "
+						  "rootdescend := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"),
+						  (opts.rootdescend ? "true" : "false"));
+	else
+		appendPQExpBuffer(sql,
+						  "SELECT * FROM %s.bt_index_check("
+						  "index := '%u'::regclass, heapallindexed := %s)",
+						  rel->datinfo->amcheck_schema,
+						  rel->reloid,
+						  (opts.heapallindexed ? "true" : "false"));
+}
+
+/*
+ * run_command
+ *
+ * Sends a command to the server without waiting for the command to complete.
+ * Logs an error if the command cannot be sent, but otherwise any errors are
+ * expected to be handled by a ParallelSlotHandler.
+ *
+ * If reconnecting to the database is necessary, the cparams argument may be
+ * modified.
+ *
+ * slot: slot with connection to the server we should use for the command
+ * sql: query to send
+ */
+static void
+run_command(ParallelSlot *slot, const char *sql)
+{
+	if (opts.echo)
+		printf("%s\n", sql);
+
+	if (PQsendQuery(slot->connection, sql) == 0)
+	{
+		pg_log_error("error sending command to database \"%s\": %s",
+					 PQdb(slot->connection),
+					 PQerrorMessage(slot->connection));
+		pg_log_error("command was: %s", sql);
+		exit(1);
+	}
+}
+
+/*
+ * should_processing_continue
+ *
+ * Checks a query result returned from a query (presumably issued on a slot's
+ * connection) to determine if parallel slots should continue issuing further
+ * commands.
+ *
+ * Note: Heap relation corruption is reported by verify_heapam() via the result
+ * set, rather than an ERROR, but running verify_heapam() on a corrupted heap
+ * table may still result in an error being returned from the server due to
+ * missing relation files, bad checksums, etc.  The btree corruption checking
+ * functions always use errors to communicate corruption messages.  We can't
+ * just abort processing because we got a mere ERROR.
+ *
+ * res: result from an executed sql query
+ */
+static bool
+should_processing_continue(PGresult *res)
+{
+	const char *severity;
+
+	switch (PQresultStatus(res))
+	{
+			/* These are expected and ok */
+		case PGRES_COMMAND_OK:
+		case PGRES_TUPLES_OK:
+		case PGRES_NONFATAL_ERROR:
+			break;
+
+			/* This is expected but requires closer scrutiny */
+		case PGRES_FATAL_ERROR:
+			severity = PQresultErrorField(res, PG_DIAG_SEVERITY_NONLOCALIZED);
+			if (strcmp(severity, "FATAL") == 0)
+				return false;
+			if (strcmp(severity, "PANIC") == 0)
+				return false;
+			break;
+
+			/* These are unexpected */
+		case PGRES_BAD_RESPONSE:
+		case PGRES_EMPTY_QUERY:
+		case PGRES_COPY_OUT:
+		case PGRES_COPY_IN:
+		case PGRES_COPY_BOTH:
+		case PGRES_SINGLE_TUPLE:
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Returns a copy of the argument string with all lines indented four spaces.
+ *
+ * The caller should pg_free the result when finished with it.
+ */
+static char *
+indent_lines(const char *str)
+{
+	PQExpBufferData buf;
+	const char *c;
+	char	   *result;
+
+	initPQExpBuffer(&buf);
+	appendPQExpBufferStr(&buf, "    ");
+	for (c = str; *c; c++)
+	{
+		appendPQExpBufferChar(&buf, *c);
+		if (c[0] == '\n' && c[1] != '\0')
+			appendPQExpBufferStr(&buf, "    ");
+	}
+	result = pstrdup(buf.data);
+	termPQExpBuffer(&buf);
+
+	return result;
+}
+
+/*
+ * verify_heap_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a heap table checking command
+ * created by prepare_heap_command and outputs the results for the user.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: the sql query being handled, as a cstring
+ */
+static bool
+verify_heap_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			i;
+		int			ntups = PQntuples(res);
+
+		if (ntups > 0)
+			all_checks_pass = false;
+
+		for (i = 0; i < ntups; i++)
+		{
+			const char *msg;
+
+			/* The message string should never be null, but check */
+			if (PQgetisnull(res, i, 3))
+				msg = "NO MESSAGE";
+			else
+				msg = PQgetvalue(res, i, 3);
+
+			if (!PQgetisnull(res, i, 2))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s, attribute %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   PQgetvalue(res, i, 2),	/* attnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 1))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 0))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   msg);
+
+			else
+				printf("heap table \"%s\".\"%s\".\"%s\":\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		}
+	}
+	else if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("heap table \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		FREE_AND_SET_NULL(msg);
+	}
+
+	FREE_AND_SET_NULL(rel->sql);
+	FREE_AND_SET_NULL(rel->nspname);
+	FREE_AND_SET_NULL(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * verify_btree_slot_handler
+ *
+ * ParallelSlotHandler that receives results from a btree checking command
+ * created by prepare_btree_command and outputs them for the user.  The results
+ * from the btree checking command is assumed to be empty, but when the results
+ * are an error code, the useful information about the corruption is expected
+ * in the connection's error message.
+ *
+ * res: result from an executed sql query
+ * conn: connection on which the sql query was executed
+ * context: unused
+ */
+static bool
+verify_btree_slot_handler(PGresult *res, PGconn *conn, void *context)
+{
+	RelationInfo *rel = (RelationInfo *) context;
+
+	if (PQresultStatus(res) == PGRES_TUPLES_OK)
+	{
+		int			ntups = PQntuples(res);
+
+		if (ntups != 1)
+		{
+			/*
+			 * We expect the btree checking functions to return one void row
+			 * each, so we should output some sort of warning if we get
+			 * anything else, not because it indicates corruption, but because
+			 * it suggests a mismatch between amcheck and pg_amcheck versions.
+			 *
+			 * In conjunction with --progress, anything written to stderr at
+			 * this time would present strangely to the user without an extra
+			 * newline, so we print one.  If we were multithreaded, we'd have
+			 * to avoid splitting this across multiple calls, but we're in an
+			 * event loop, so it doesn't matter.
+			 */
+			if (opts.show_progress && progress_since_last_stderr)
+				fprintf(stderr, "\n");
+			pg_log_warning("btree index \"%s\".\"%s\".\"%s\": btree checking function returned unexpected number of rows: %d",
+						   rel->datinfo->datname, rel->nspname, rel->relname, ntups);
+			if (opts.verbose)
+				pg_log_info("query was: %s", rel->sql);
+			pg_log_warning("are %s's and amcheck's versions compatible?",
+						   progname);
+			progress_since_last_stderr = false;
+		}
+	}
+	else
+	{
+		char	   *msg = indent_lines(PQerrorMessage(conn));
+
+		all_checks_pass = false;
+		printf("btree index \"%s\".\"%s\".\"%s\":\n%s",
+			   rel->datinfo->datname, rel->nspname, rel->relname, msg);
+		if (opts.verbose)
+			printf("query was: %s\n", rel->sql);
+		FREE_AND_SET_NULL(msg);
+	}
+
+	FREE_AND_SET_NULL(rel->sql);
+	FREE_AND_SET_NULL(rel->nspname);
+	FREE_AND_SET_NULL(rel->relname);
+
+	return should_processing_continue(res);
+}
+
+/*
+ * help
+ *
+ * Prints help page for the program
+ *
+ * progname: the name of the executed program, such as "pg_amcheck"
+ */
+static void
+help(const char *progname)
+{
+	printf("%s uses amcheck module to check objects in a PostgreSQL database for corruption.\n\n", progname);
+	printf("Usage:\n");
+	printf("  %s [OPTION]... [DBNAME]\n", progname);
+	printf("\nTarget Options:\n");
+	printf("  -a, --all                      check all databases\n");
+	printf("  -d, --database=PATTERN         check matching database(s)\n");
+	printf("  -D, --exclude-database=PATTERN do NOT check matching database(s)\n");
+	printf("  -i, --index=PATTERN            check matching index(es)\n");
+	printf("  -I, --exclude-index=PATTERN    do NOT check matching index(es)\n");
+	printf("  -r, --relation=PATTERN         check matching relation(s)\n");
+	printf("  -R, --exclude-relation=PATTERN do NOT check matching relation(s)\n");
+	printf("  -s, --schema=PATTERN           check matching schema(s)\n");
+	printf("  -S, --exclude-schema=PATTERN   do NOT check matching schema(s)\n");
+	printf("  -t, --table=PATTERN            check matching table(s)\n");
+	printf("  -T, --exclude-table=PATTERN    do NOT check matching table(s)\n");
+	printf("      --no-dependent-indexes     do NOT expand list of relations to include indexes\n");
+	printf("      --no-dependent-toast       do NOT expand list of relations to include toast\n");
+	printf("      --no-strict-names          do NOT require patterns to match objects\n");
+	printf("\nTable Checking Options:\n");
+	printf("      --exclude-toast-pointers   do NOT follow relation toast pointers\n");
+	printf("      --on-error-stop            stop checking at end of first corrupt page\n");
+	printf("      --skip=OPTION              do NOT check \"all-frozen\" or \"all-visible\" blocks\n");
+	printf("      --startblock=BLOCK         begin checking table(s) at the given block number\n");
+	printf("      --endblock=BLOCK           check table(s) only up to the given block number\n");
+	printf("\nBtree Index Checking Options:\n");
+	printf("      --heapallindexed           check all heap tuples are found within indexes\n");
+	printf("      --parent-check             check index parent/child relationships\n");
+	printf("      --rootdescend              search from root page to refind tuples\n");
+	printf("\nConnection options:\n");
+	printf("  -h, --host=HOSTNAME            database server host or socket directory\n");
+	printf("  -p, --port=PORT                database server port\n");
+	printf("  -U, --username=USERNAME        user name to connect as\n");
+	printf("  -w, --no-password              never prompt for password\n");
+	printf("  -W, --password                 force password prompt\n");
+	printf("      --maintenance-db=DBNAME    alternate maintenance database\n");
+	printf("\nOther Options:\n");
+	printf("  -e, --echo                     show the commands being sent to the server\n");
+	printf("  -j, --jobs=NUM                 use this many concurrent connections to the server\n");
+	printf("  -q, --quiet                    don't write any messages\n");
+	printf("  -v, --verbose                  write a lot of output\n");
+	printf("  -V, --version                  output version information, then exit\n");
+	printf("  -P, --progress                 show progress information\n");
+	printf("  -?, --help                     show this help, then exit\n");
+
+	printf("\nReport bugs to <%s>.\n", PACKAGE_BUGREPORT);
+	printf("%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
+}
+
+/*
+ * Print a progress report based on the global variables.
+ *
+ * Progress report is written at maximum once per second, unless the force
+ * parameter is set to true.
+ *
+ * If finished is set to true, this is the last progress report. The cursor
+ * is moved to the next line.
+ */
+static void
+progress_report(uint64 relations_total, uint64 relations_checked,
+				uint64 relpages_total, uint64 relpages_checked,
+				const char *datname, bool force, bool finished)
+{
+	int			percent_rel = 0;
+	int			percent_pages = 0;
+	char		checked_rel[32];
+	char		total_rel[32];
+	char		checked_pages[32];
+	char		total_pages[32];
+	pg_time_t	now;
+
+	if (!opts.show_progress)
+		return;
+
+	now = time(NULL);
+	if (now == last_progress_report && !force && !finished)
+		return;					/* Max once per second */
+
+	last_progress_report = now;
+	if (relations_total)
+		percent_rel = (int) (relations_checked * 100 / relations_total);
+	if (relpages_total)
+		percent_pages = (int) (relpages_checked * 100 / relpages_total);
+
+	/*
+	 * Separate step to keep platform-dependent format code out of fprintf
+	 * calls.  We only test for INT64_FORMAT availability in snprintf, not
+	 * fprintf.
+	 */
+	snprintf(checked_rel, sizeof(checked_rel), INT64_FORMAT, relations_checked);
+	snprintf(total_rel, sizeof(total_rel), INT64_FORMAT, relations_total);
+	snprintf(checked_pages, sizeof(checked_pages), INT64_FORMAT, relpages_checked);
+	snprintf(total_pages, sizeof(total_pages), INT64_FORMAT, relpages_total);
+
+#define VERBOSE_DATNAME_LENGTH 35
+	if (opts.verbose)
+	{
+		if (!datname)
+
+			/*
+			 * No datname given, so clear the status line (used for first and
+			 * last call)
+			 */
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%) %*s",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+					VERBOSE_DATNAME_LENGTH + 2, "");
+		else
+		{
+			bool		truncate = (strlen(datname) > VERBOSE_DATNAME_LENGTH);
+
+			fprintf(stderr,
+					"%*s/%s relations (%d%%) %*s/%s pages (%d%%), (%s%-*.*s)",
+					(int) strlen(total_rel),
+					checked_rel, total_rel, percent_rel,
+					(int) strlen(total_pages),
+					checked_pages, total_pages, percent_pages,
+			/* Prefix with "..." if we do leading truncation */
+					truncate ? "..." : "",
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+					truncate ? VERBOSE_DATNAME_LENGTH - 3 : VERBOSE_DATNAME_LENGTH,
+			/* Truncate datname at beginning if it's too long */
+					truncate ? datname + strlen(datname) - VERBOSE_DATNAME_LENGTH + 3 : datname);
+		}
+	}
+	else
+		fprintf(stderr,
+				"%*s/%s relations (%d%%) %*s/%s pages (%d%%)",
+				(int) strlen(total_rel),
+				checked_rel, total_rel, percent_rel,
+				(int) strlen(total_pages),
+				checked_pages, total_pages, percent_pages);
+
+	/*
+	 * Stay on the same line if reporting to a terminal and we're not done
+	 * yet.
+	 */
+	if (!finished && isatty(fileno(stderr)))
+	{
+		fputc('\r', stderr);
+		progress_since_last_stderr = true;
+	}
+	else
+		fputc('\n', stderr);
+}
+
+/*
+ * Extend the pattern info array to hold one additional initialized pattern
+ * info entry.
+ *
+ * Returns a pointer to the new entry.
+ */
+static PatternInfo *
+extend_pattern_info_array(PatternInfoArray *pia)
+{
+	PatternInfo *result;
+
+	pia->len++;
+	pia->data = (PatternInfo *) pg_realloc(pia->data, pia->len * sizeof(PatternInfo));
+	result = &pia->data[pia->len - 1];
+	memset(result, 0, sizeof(*result));
+
+	return result;
+}
+
+/*
+ * append_database_pattern
+ *
+ * Adds the given pattern interpreted as a database name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the database name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_database_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData buf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&buf);
+	patternToSQLRegex(encoding, NULL, NULL, &buf, pattern, false);
+	info->pattern = pattern;
+	info->db_regex = pstrdup(buf.data);
+
+	termPQExpBuffer(&buf);
+}
+
+/*
+ * append_schema_pattern
+ *
+ * Adds the given pattern interpreted as a schema name pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the schema name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_schema_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+
+	patternToSQLRegex(encoding, NULL, &dbbuf, &nspbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+	{
+		opts.dbpattern = true;
+		info->db_regex = pstrdup(dbbuf.data);
+	}
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+}
+
+/*
+ * append_relation_pattern_helper
+ *
+ * Adds to a list the given pattern interpreted as a relation pattern.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ * heap_only: whether the pattern should only be matched against heap tables
+ * btree_only: whether the pattern should only be matched against btree indexes
+ */
+static void
+append_relation_pattern_helper(PatternInfoArray *pia, const char *pattern,
+							   int encoding, bool heap_only, bool btree_only)
+{
+	PQExpBufferData dbbuf;
+	PQExpBufferData nspbuf;
+	PQExpBufferData relbuf;
+	PatternInfo *info = extend_pattern_info_array(pia);
+
+	initPQExpBuffer(&dbbuf);
+	initPQExpBuffer(&nspbuf);
+	initPQExpBuffer(&relbuf);
+
+	patternToSQLRegex(encoding, &dbbuf, &nspbuf, &relbuf, pattern, false);
+	info->pattern = pattern;
+	if (dbbuf.data[0])
+	{
+		opts.dbpattern = true;
+		info->db_regex = pstrdup(dbbuf.data);
+	}
+	if (nspbuf.data[0])
+		info->nsp_regex = pstrdup(nspbuf.data);
+	if (relbuf.data[0])
+		info->rel_regex = pstrdup(relbuf.data);
+
+	termPQExpBuffer(&dbbuf);
+	termPQExpBuffer(&nspbuf);
+	termPQExpBuffer(&relbuf);
+
+	info->heap_only = heap_only;
+	info->btree_only = btree_only;
+}
+
+/*
+ * append_relation_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched
+ * against both heap tables and btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_relation_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, false);
+}
+
+/*
+ * append_heap_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against heap tables.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_heap_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, true, false);
+}
+
+/*
+ * append_btree_pattern
+ *
+ * Adds the given pattern interpreted as a relation pattern, to be matched only
+ * against btree indexes.
+ *
+ * pia: the pattern info array to be appended
+ * pattern: the relation name pattern
+ * encoding: client encoding for parsing the pattern
+ */
+static void
+append_btree_pattern(PatternInfoArray *pia, const char *pattern, int encoding)
+{
+	append_relation_pattern_helper(pia, pattern, encoding, false, true);
+}
+
+/*
+ * append_db_pattern_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the database portions filtered from the list of patterns expressed as two
+ * columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     rgx: the database regular expression parsed from the pattern
+ *
+ * Patterns without a database portion are skipped.  Patterns with more than
+ * just a database portion are optionally skipped, depending on argument
+ * 'inclusive'.
+ *
+ * buf: the buffer to be appended
+ * pia: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ * inclusive: whether to include patterns with schema and/or relation parts
+ *
+ * Returns whether any database patterns were appended.
+ */
+static bool
+append_db_pattern_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+					  PGconn *conn, bool inclusive)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (info->db_regex != NULL &&
+			(inclusive || (info->nsp_regex == NULL && info->rel_regex == NULL)))
+		{
+			if (!have_values)
+				appendPQExpBufferStr(buf, "\nVALUES");
+			have_values = true;
+			appendPQExpBuffer(buf, "%s\n(%d, ", comma, pattern_id);
+			appendStringLiteralConn(buf, info->db_regex, conn);
+			appendPQExpBufferStr(buf, ")");
+			comma = ",";
+		}
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf, "\nSELECT NULL, NULL, NULL WHERE false");
+
+	return have_values;
+}
+
+/*
+ * compile_database_list
+ *
+ * If any database patterns exist, or if --all was given, compiles a distinct
+ * list of databases to check using a SQL query based on the patterns plus the
+ * literal initial database name, if given.  If no database patterns exist and
+ * --all was not given, the query is not necessary, and only the initial
+ * database name (if any) is added to the list.
+ *
+ * conn: connection to the initial database
+ * databases: the list onto which databases should be appended
+ * initial_dbname: an optional extra database name to include in the list
+ */
+static void
+compile_database_list(PGconn *conn, SimplePtrList *databases,
+					  const char *initial_dbname)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+	bool		fatal;
+
+	if (initial_dbname)
+	{
+		DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+		/* This database is included.  Add to list */
+		if (opts.verbose)
+			pg_log_info("including database: \"%s\"", initial_dbname);
+
+		dat->datname = pstrdup(initial_dbname);
+		simple_ptr_list_append(databases, dat);
+	}
+
+	initPQExpBuffer(&sql);
+
+	/* Append the include patterns CTE. */
+	appendPQExpBufferStr(&sql, "WITH include_raw (pattern_id, rgx) AS (");
+	if (!append_db_pattern_cte(&sql, &opts.include, conn, true) &&
+		!opts.alldb)
+	{
+		/*
+		 * None of the inclusion patterns (if any) contain database portions,
+		 * so there is no need to query the database to resolve database
+		 * patterns.
+		 *
+		 * Since we're also not operating under --all, we don't need to query
+		 * the exhaustive list of connectable databases, either.
+		 */
+		termPQExpBuffer(&sql);
+		return;
+	}
+
+	/* Append the exclude patterns CTE. */
+	appendPQExpBufferStr(&sql, "),\nexclude_raw (pattern_id, rgx) AS (");
+	append_db_pattern_cte(&sql, &opts.exclude, conn, false);
+	appendPQExpBufferStr(&sql, "),");
+
+	/*
+	 * Append the database CTE, which includes whether each database is
+	 * connectable and also joins against exclude_raw to determine whether
+	 * each database is excluded.
+	 */
+	appendPQExpBufferStr(&sql,
+						 "\ndatabase (datname) AS ("
+						 "\nSELECT d.datname "
+						 "FROM pg_catalog.pg_database d "
+						 "LEFT OUTER JOIN exclude_raw e "
+						 "ON d.datname ~ e.rgx "
+						 "\nWHERE d.datallowconn "
+						 "AND e.pattern_id IS NULL"
+						 "),"
+
+	/*
+	 * Append the include_pat CTE, which joins the include_raw CTE against the
+	 * databases CTE to determine if all the inclusion patterns had matches,
+	 * and whether each matched pattern had the misfortune of only matching
+	 * excluded or unconnectable databases.
+	 */
+						 "\ninclude_pat (pattern_id, checkable) AS ("
+						 "\nSELECT i.pattern_id, "
+						 "COUNT(*) FILTER ("
+						 "WHERE d IS NOT NULL"
+						 ") AS checkable"
+						 "\nFROM include_raw i "
+						 "LEFT OUTER JOIN database d "
+						 "ON d.datname ~ i.rgx"
+						 "\nGROUP BY i.pattern_id"
+						 "),"
+
+	/*
+	 * Append the filtered_databases CTE, which selects from the database CTE
+	 * optionally joined against the include_raw CTE to only select databases
+	 * that match an inclusion pattern.  This appears to duplicate what the
+	 * include_pat CTE already did above, but here we want only databases, and
+	 * there we wanted patterns.
+	 */
+						 "\nfiltered_databases (datname) AS ("
+						 "\nSELECT DISTINCT d.datname "
+						 "FROM database d");
+	if (!opts.alldb)
+		appendPQExpBufferStr(&sql,
+							 " INNER JOIN include_raw i "
+							 "ON d.datname ~ i.rgx");
+	appendPQExpBufferStr(&sql,
+						 ")"
+
+	/*
+	 * Select the checkable databases and the unmatched inclusion patterns.
+	 */
+						 "\nSELECT pattern_id, datname FROM ("
+						 "\nSELECT pattern_id, NULL::TEXT AS datname "
+						 "FROM include_pat "
+						 "WHERE checkable = 0 "
+						 "UNION ALL"
+						 "\nSELECT NULL, datname "
+						 "FROM filtered_databases"
+						 ") AS combined_records"
+						 "\nORDER BY pattern_id NULLS LAST, datname");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (fatal = false, i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		const char *datname = NULL;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			datname = PQgetvalue(res, i, 1);
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern that matched no
+			 * checkable databases.
+			 */
+			fatal = opts.strict_names;
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected database pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+			log_no_match("no connectable databases to check matching \"%s\"",
+						 opts.include.data[pattern_id].pattern);
+		}
+		else
+		{
+			/* Current record pertains to a database */
+			Assert(datname != NULL);
+
+			/* Avoid entering a duplicate entry matching the initial_dbname */
+			if (initial_dbname != NULL && strcmp(initial_dbname, datname) == 0)
+				continue;
+
+			DatabaseInfo *dat = (DatabaseInfo *) pg_malloc0(sizeof(DatabaseInfo));
+
+			/* This database is included.  Add to list */
+			if (opts.verbose)
+				pg_log_info("including database: \"%s\"", datname);
+
+			dat->datname = pstrdup(datname);
+			simple_ptr_list_append(databases, dat);
+		}
+	}
+	PQclear(res);
+
+	if (fatal)
+	{
+		if (conn != NULL)
+			disconnectDatabase(conn);
+		exit(1);
+	}
+}
+
+/*
+ * append_rel_pattern_raw_cte
+ *
+ * Appends to the buffer the body of a Common Table Expression (CTE) containing
+ * the given patterns as six columns:
+ *
+ *     pattern_id: the index of this pattern in pia->data[]
+ *     db_regex: the database regexp parsed from the pattern, or NULL if the
+ *               pattern had no database part
+ *     nsp_regex: the namespace regexp parsed from the pattern, or NULL if the
+ *                pattern had no namespace part
+ *     rel_regex: the relname regexp parsed from the pattern, or NULL if the
+ *                pattern had no relname part
+ *     heap_only: true if the pattern applies only to heap tables (not indexes)
+ *     btree_only: true if the pattern applies only to btree indexes (not tables)
+ *
+ * buf: the buffer to be appended
+ * patterns: the array of patterns to be inserted into the CTE
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_raw_cte(PQExpBuffer buf, const PatternInfoArray *pia,
+						   PGconn *conn)
+{
+	int			pattern_id;
+	const char *comma;
+	bool		have_values;
+
+	comma = "";
+	have_values = false;
+	for (pattern_id = 0; pattern_id < pia->len; pattern_id++)
+	{
+		PatternInfo *info = &pia->data[pattern_id];
+
+		if (!have_values)
+			appendPQExpBufferStr(buf, "\nVALUES");
+		have_values = true;
+		appendPQExpBuffer(buf, "%s\n(%d::INTEGER, ", comma, pattern_id);
+		if (info->db_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->db_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->nsp_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->nsp_regex, conn);
+		appendPQExpBufferStr(buf, "::TEXT, ");
+		if (info->rel_regex == NULL)
+			appendPQExpBufferStr(buf, "NULL");
+		else
+			appendStringLiteralConn(buf, info->rel_regex, conn);
+		if (info->heap_only)
+			appendPQExpBufferStr(buf, "::TEXT, true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, "::TEXT, false::BOOLEAN");
+		if (info->btree_only)
+			appendPQExpBufferStr(buf, ", true::BOOLEAN");
+		else
+			appendPQExpBufferStr(buf, ", false::BOOLEAN");
+		appendPQExpBufferStr(buf, ")");
+		comma = ",";
+	}
+
+	if (!have_values)
+		appendPQExpBufferStr(buf,
+							 "\nSELECT NULL::INTEGER, NULL::TEXT, NULL::TEXT, "
+							 "NULL::TEXT, NULL::BOOLEAN, NULL::BOOLEAN "
+							 "WHERE false");
+}
+
+/*
+ * append_rel_pattern_filtered_cte
+ *
+ * Appends to the buffer a Common Table Expression (CTE) which selects
+ * all patterns from the named raw CTE, filtered by database.  All patterns
+ * which have no database portion or whose database portion matches our
+ * connection's database name are selected, with other patterns excluded.
+ *
+ * The basic idea here is that if we're connected to database "foo" and we have
+ * patterns "foo.bar.baz", "alpha.beta" and "one.two.three", we only want to
+ * use the first two while processing relations in this database, as the third
+ * one is not relevant.
+ *
+ * buf: the buffer to be appended
+ * raw: the name of the CTE to select from
+ * filtered: the name of the CTE to create
+ * conn: the database connection
+ */
+static void
+append_rel_pattern_filtered_cte(PQExpBuffer buf, const char *raw,
+								const char *filtered, PGconn *conn)
+{
+	appendPQExpBuffer(buf,
+					  "\n%s (pattern_id, nsp_regex, rel_regex, heap_only, btree_only) AS ("
+					  "\nSELECT pattern_id, nsp_regex, rel_regex, heap_only, btree_only "
+					  "FROM %s r"
+					  "\nWHERE (r.db_regex IS NULL "
+					  "OR ",
+					  filtered, raw);
+	appendStringLiteralConn(buf, PQdb(conn), conn);
+	appendPQExpBufferStr(buf, " ~ r.db_regex)");
+	appendPQExpBufferStr(buf,
+						 " AND (r.nsp_regex IS NOT NULL"
+						 " OR r.rel_regex IS NOT NULL)"
+						 "),");
+}
+
+/*
+ * compile_relation_list_one_db
+ *
+ * Compiles a list of relations to check within the currently connected
+ * database based on the user supplied options, sorted by descending size,
+ * and appends them to the given list of relations.
+ *
+ * The cells of the constructed list contain all information about the relation
+ * necessary to connect to the database and check the object, including which
+ * database to connect to, where contrib/amcheck is installed, and the Oid and
+ * type of object (heap table vs. btree index).  Rather than duplicating the
+ * database details per relation, the relation structs use references to the
+ * same database object, provided by the caller.
+ *
+ * conn: connection to this next database, which should be the same as in 'dat'
+ * relations: list onto which the relations information should be appended
+ * dat: the database info struct for use by each relation
+ * pagecount: gets incremented by the number of blocks to check in all
+ * relations added
+ */
+static void
+compile_relation_list_one_db(PGconn *conn, SimplePtrList *relations,
+							 const DatabaseInfo *dat,
+							 uint64 *pagecount)
+{
+	PGresult   *res;
+	PQExpBufferData sql;
+	int			ntups;
+	int			i;
+
+	initPQExpBuffer(&sql);
+	appendPQExpBufferStr(&sql, "WITH");
+
+	/* Append CTEs for the relation inclusion patterns, if any */
+	if (!opts.allrel)
+	{
+		appendPQExpBufferStr(&sql,
+							 " include_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.include, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "include_raw", "include_pat", conn);
+	}
+
+	/* Append CTEs for the relation exclusion patterns, if any */
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+	{
+		appendPQExpBufferStr(&sql,
+							 " exclude_raw (pattern_id, db_regex, nsp_regex, rel_regex, heap_only, btree_only) AS (");
+		append_rel_pattern_raw_cte(&sql, &opts.exclude, conn);
+		appendPQExpBufferStr(&sql, "\n),");
+		append_rel_pattern_filtered_cte(&sql, "exclude_raw", "exclude_pat", conn);
+	}
+
+	/* Append the relation CTE. */
+	appendPQExpBufferStr(&sql,
+						 " relation (pattern_id, oid, nspname, relname, reltoastrelid, relpages, is_heap, is_btree) AS ("
+						 "\nSELECT DISTINCT ON (c.oid");
+	if (!opts.allrel)
+		appendPQExpBufferStr(&sql, ", ip.pattern_id) ip.pattern_id,");
+	else
+		appendPQExpBufferStr(&sql, ") NULL::INTEGER AS pattern_id,");
+	appendPQExpBuffer(&sql,
+					  "\nc.oid, n.nspname, c.relname, c.reltoastrelid, c.relpages, "
+					  "c.relam = %u AS is_heap, "
+					  "c.relam = %u AS is_btree"
+					  "\nFROM pg_catalog.pg_class c "
+					  "INNER JOIN pg_catalog.pg_namespace n "
+					  "ON c.relnamespace = n.oid",
+					  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (!opts.allrel)
+		appendPQExpBuffer(&sql,
+						  "\nINNER JOIN include_pat ip"
+						  "\nON (n.nspname ~ ip.nsp_regex OR ip.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ip.rel_regex OR ip.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ip.heap_only)"
+						  "\nAND (c.relam = %u OR NOT ip.btree_only)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBuffer(&sql,
+						  "\nLEFT OUTER JOIN exclude_pat ep"
+						  "\nON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+						  "\nAND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.heap_only OR ep.rel_regex IS NULL)"
+						  "\nAND (c.relam = %u OR NOT ep.btree_only OR ep.rel_regex IS NULL)",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	if (opts.excludetbl || opts.excludeidx || opts.excludensp)
+		appendPQExpBufferStr(&sql, "\nWHERE ep.pattern_id IS NULL");
+	else
+		appendPQExpBufferStr(&sql, "\nWHERE true");
+
+	/*
+	 * We need to be careful not to break the --no-dependent-toast and
+	 * --no-dependent-indexes options.  By default, the btree indexes, toast
+	 * tables, and toast table btree indexes associated with primary heap
+	 * tables are included, using their own CTEs below.  We implement the
+	 * --exclude-* options by not creating those CTEs, but that's no use if
+	 * we've already selected the toast and indexes here.  On the other hand,
+	 * we want inclusion patterns that match indexes or toast tables to be
+	 * honored.  So, if inclusion patterns were given, we want to select all
+	 * tables, toast tables, or indexes that match the patterns.  But if no
+	 * inclusion patterns were given, and we're simply matching all relations,
+	 * then we only want to match the primary tables here.
+	 */
+	if (opts.allrel)
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind IN ('r', 'm', 't') "
+						  "AND c.relnamespace != %u",
+						  HEAP_TABLE_AM_OID, PG_TOAST_NAMESPACE);
+	else
+		appendPQExpBuffer(&sql,
+						  " AND c.relam IN (%u, %u)"
+						  "AND c.relkind IN ('r', 'm', 't', 'i') "
+						  "AND ((c.relam = %u AND c.relkind IN ('r', 'm', 't')) OR "
+						  "(c.relam = %u AND c.relkind = 'i'))",
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID,
+						  HEAP_TABLE_AM_OID, BTREE_AM_OID);
+
+	appendPQExpBufferStr(&sql,
+						 "\nORDER BY c.oid)");
+
+	if (!opts.no_toast_expansion)
+	{
+		/*
+		 * Include a CTE for toast tables associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * toast table names.
+		 */
+		appendPQExpBufferStr(&sql,
+							 ", toast (oid, nspname, relname, relpages) AS ("
+							 "\nSELECT t.oid, 'pg_toast', t.relname, t.relpages"
+							 "\nFROM pg_catalog.pg_class t "
+							 "INNER JOIN relation r "
+							 "ON r.reltoastrelid = t.oid");
+		if (opts.excludetbl || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep"
+								 "\nON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL)"
+								 "\nAND (t.relname ~ ep.rel_regex OR ep.rel_regex IS NULL)"
+								 "\nAND ep.heap_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		appendPQExpBufferStr(&sql,
+							 "\n)");
+	}
+	if (!opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with primary heap tables
+		 * selected above, filtering by exclusion patterns (if any) that match
+		 * btree index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, r.nspname, c.relname, c.relpages "
+						  "FROM relation r"
+						  "\nINNER JOIN pg_catalog.pg_index i "
+						  "ON r.oid = i.indrelid "
+						  "INNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx || opts.excludensp)
+			appendPQExpBufferStr(&sql,
+								 "\nINNER JOIN pg_catalog.pg_namespace n "
+								 "ON c.relnamespace = n.oid"
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON (n.nspname ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only"
+								 "\nWHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u "
+						  "AND c.relkind = 'i'",
+						  BTREE_AM_OID);
+		if (opts.no_toast_expansion)
+			appendPQExpBuffer(&sql,
+							  " AND c.relnamespace != %u",
+							  PG_TOAST_NAMESPACE);
+		appendPQExpBufferStr(&sql, "\n)");
+	}
+
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+	{
+		/*
+		 * Include a CTE for btree indexes associated with toast tables of
+		 * primary heap tables selected above, filtering by exclusion patterns
+		 * (if any) that match the toast index names.
+		 */
+		appendPQExpBuffer(&sql,
+						  ", toast_index (oid, nspname, relname, relpages) AS ("
+						  "\nSELECT c.oid, 'pg_toast', c.relname, c.relpages "
+						  "FROM toast t "
+						  "INNER JOIN pg_catalog.pg_index i "
+						  "ON t.oid = i.indrelid"
+						  "\nINNER JOIN pg_catalog.pg_class c "
+						  "ON i.indexrelid = c.oid");
+		if (opts.excludeidx)
+			appendPQExpBufferStr(&sql,
+								 "\nLEFT OUTER JOIN exclude_pat ep "
+								 "ON ('pg_toast' ~ ep.nsp_regex OR ep.nsp_regex IS NULL) "
+								 "AND (c.relname ~ ep.rel_regex OR ep.rel_regex IS NULL) "
+								 "AND ep.btree_only "
+								 "WHERE ep.pattern_id IS NULL");
+		else
+			appendPQExpBufferStr(&sql,
+								 "\nWHERE true");
+		appendPQExpBuffer(&sql,
+						  " AND c.relam = %u"
+						  " AND c.relkind = 'i')",
+						  BTREE_AM_OID);
+	}
+
+	/*
+	 * Roll-up distinct rows from CTEs.
+	 *
+	 * Relations that match more than one pattern may occur more than once in
+	 * the list, and indexes and toast for primary relations may also have
+	 * matched in their own right, so we rely on UNION to deduplicate the
+	 * list.
+	 */
+	appendPQExpBuffer(&sql,
+					  "\nSELECT pattern_id, is_heap, is_btree, oid, nspname, relname, relpages "
+					  "FROM (");
+	appendPQExpBufferStr(&sql,
+	/* Inclusion patterns that failed to match */
+						 "\nSELECT pattern_id, is_heap, is_btree, "
+						 "NULL::OID AS oid, "
+						 "NULL::TEXT AS nspname, "
+						 "NULL::TEXT AS relname, "
+						 "NULL::INTEGER AS relpages"
+						 "\nFROM relation "
+						 "WHERE pattern_id IS NOT NULL "
+						 "UNION"
+	/* Primary relations */
+						 "\nSELECT NULL::INTEGER AS pattern_id, "
+						 "is_heap, is_btree, oid, nspname, relname, relpages "
+						 "FROM relation");
+	if (!opts.no_toast_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Toast tables for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, TRUE AS is_heap, "
+							 "FALSE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast");
+	if (!opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for primary relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM index");
+	if (!opts.no_toast_expansion && !opts.no_btree_expansion)
+		appendPQExpBufferStr(&sql,
+							 " UNION"
+		/* Indexes for toast relations */
+							 "\nSELECT NULL::INTEGER AS pattern_id, FALSE AS is_heap, "
+							 "TRUE AS is_btree, oid, nspname, relname, relpages "
+							 "FROM toast_index");
+	appendPQExpBufferStr(&sql,
+						 "\n) AS combined_records "
+						 "ORDER BY relpages DESC NULLS FIRST, oid");
+
+	res = executeQuery(conn, sql.data, opts.echo);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(conn));
+		pg_log_info("query was: %s", sql.data);
+		disconnectDatabase(conn);
+		exit(1);
+	}
+	termPQExpBuffer(&sql);
+
+	ntups = PQntuples(res);
+	for (i = 0; i < ntups; i++)
+	{
+		int			pattern_id = -1;
+		bool		is_heap = false;
+		bool		is_btree = false;
+		Oid			oid = InvalidOid;
+		const char *nspname = NULL;
+		const char *relname = NULL;
+		int			relpages = 0;
+
+		if (!PQgetisnull(res, i, 0))
+			pattern_id = atoi(PQgetvalue(res, i, 0));
+		if (!PQgetisnull(res, i, 1))
+			is_heap = (PQgetvalue(res, i, 1)[0] == 't');
+		if (!PQgetisnull(res, i, 2))
+			is_btree = (PQgetvalue(res, i, 2)[0] == 't');
+		if (!PQgetisnull(res, i, 3))
+			oid = atooid(PQgetvalue(res, i, 3));
+		if (!PQgetisnull(res, i, 4))
+			nspname = PQgetvalue(res, i, 4);
+		if (!PQgetisnull(res, i, 5))
+			relname = PQgetvalue(res, i, 5);
+		if (!PQgetisnull(res, i, 6))
+			relpages = atoi(PQgetvalue(res, i, 6));
+
+		if (pattern_id >= 0)
+		{
+			/*
+			 * Current record pertains to an inclusion pattern.  Record that
+			 * it matched.
+			 */
+
+			if (pattern_id >= opts.include.len)
+			{
+				pg_log_error("internal error: received unexpected relation pattern_id %d",
+							 pattern_id);
+				exit(1);
+			}
+
+			opts.include.data[pattern_id].matched = true;
+		}
+		else
+		{
+			/* Current record pertains to a relation */
+
+			RelationInfo *rel = (RelationInfo *) pg_malloc0(sizeof(RelationInfo));
+
+			Assert(OidIsValid(oid));
+			Assert((is_heap && !is_btree) || (is_btree && !is_heap));
+
+			rel->datinfo = dat;
+			rel->reloid = oid;
+			rel->is_heap = is_heap;
+			rel->nspname = pstrdup(nspname);
+			rel->relname = pstrdup(relname);
+			rel->relpages = relpages;
+			rel->blocks_to_check = relpages;
+			if (is_heap && (opts.startblock >= 0 || opts.endblock >= 0))
+			{
+				/*
+				 * We apply --startblock and --endblock to heap tables, but
+				 * not btree indexes, and for progress purposes we need to
+				 * track how many blocks we expect to check.
+				 */
+				if (opts.endblock >= 0 && rel->blocks_to_check > opts.endblock)
+					rel->blocks_to_check = opts.endblock + 1;
+				if (opts.startblock >= 0)
+				{
+					if (rel->blocks_to_check > opts.startblock)
+						rel->blocks_to_check -= opts.startblock;
+					else
+						rel->blocks_to_check = 0;
+				}
+			}
+			*pagecount += rel->blocks_to_check;
+
+			simple_ptr_list_append(relations, rel);
+		}
+	}
+	PQclear(res);
+}
diff --git a/src/bin/pg_amcheck/t/001_basic.pl b/src/bin/pg_amcheck/t/001_basic.pl
new file mode 100644
index 0000000000..dfa0ae9e06
--- /dev/null
+++ b/src/bin/pg_amcheck/t/001_basic.pl
@@ -0,0 +1,9 @@
+use strict;
+use warnings;
+
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_amcheck');
+program_version_ok('pg_amcheck');
+program_options_handling_ok('pg_amcheck');
diff --git a/src/bin/pg_amcheck/t/002_nonesuch.pl b/src/bin/pg_amcheck/t/002_nonesuch.pl
new file mode 100644
index 0000000000..b7d41c9b49
--- /dev/null
+++ b/src/bin/pg_amcheck/t/002_nonesuch.pl
@@ -0,0 +1,248 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 76;
+
+# Test set-up
+my ($node, $port);
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+# Load the amcheck extension, upon which pg_amcheck depends
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+
+#########################################
+# Test non-existent databases
+
+# Failing to connect to the initial database is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'qqq' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/FATAL:  database "qqq" does not exist/ ],
+	'checking a non-existent database');
+
+# Failing to resolve a database pattern is an error by default.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'qqq', '-d', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern');
+
+# But only a warning under --no-strict-names
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-d', 'qqq', '-d', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no connectable databases to check matching "qqq"/ ],
+	'checking an unresolvable database pattern under --no-strict-names');
+
+# Check that a substring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'post', '-d', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "post"/ ],
+	'checking an unresolvable database pattern (substring of existent database)');
+
+# Check that a superstring of an existent database name does not get interpreted
+# as a matching pattern.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgresql', '-d', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "postgresql"/ ],
+	'checking an unresolvable database pattern (superstring of existent database)');
+
+#########################################
+# Test connecting with a non-existent user
+
+# Failing to connect to the initial database due to bad username is an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user');
+
+# Failing to connect to the initial database due to bad username is an still an
+# error under --no-strict-names.
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names', '-U', 'no_such_user', 'postgres' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/role "no_such_user" does not exist/ ],
+	'checking with a non-existent user under --no-strict-names');
+
+#########################################
+# Test checking databases without amcheck installed
+
+# Attempting to check a database by name where amcheck is not installed should
+# raise a warning.  If all databases are skipped, having no relations to check
+# raises an error.
+$node->command_checks_all(
+	[ 'pg_amcheck', 'template1' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'checking a database by name without amcheck installed, no other databases');
+
+# Again, but this time with another database to check, so no error is raised.
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'template1', '-d', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by name without amcheck installed, with other databases');
+
+# Again, but by way of checking all databases
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/ ],
+	'checking a database by pattern without amcheck installed, with other databases');
+
+#########################################
+# Test unreasonable patterns
+
+# Check three-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '-t', '..' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.\."/ ],
+	'checking table pattern ".."');
+
+# Again, but with non-trivial schema and relation parts
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '-t', '.foo.bar' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no connectable databases to check matching "\.foo\.bar"/ ],
+	'checking table pattern ".foo.bar"');
+
+# Check two-part unreasonable pattern that has zero-length names
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '-t', '.' ],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: error: no heap tables to check matching "\."/ ],
+	'checking table pattern "."');
+
+#########################################
+# Test checking non-existent databases, schemas, tables, and indexes
+
+# Use --no-strict-names and a single existent table so we only get warnings
+# about the failed pattern matches
+$node->command_checks_all(
+	[ 'pg_amcheck', '--no-strict-names',
+		'-t', 'no_such_table',
+		'-t', 'no*such*table',
+		'-i', 'no_such_index',
+		'-i', 'no*such*index',
+		'-r', 'no_such_relation',
+		'-r', 'no*such*relation',
+		'-d', 'no_such_database',
+		'-d', 'no*such*database',
+		'-r', 'none.none',
+		'-r', 'none.none.none',
+		'-r', 'this.is.a.really.long.dotted.string',
+		'-r', 'postgres.none.none',
+		'-r', 'postgres.long.dotted.string',
+		'-r', 'postgres.pg_catalog.none',
+		'-r', 'postgres.none.pg_class',
+		'-t', 'postgres.pg_catalog.pg_class',	# This exists
+	],
+	0,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: no heap tables to check matching "no_such_table"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no_such_index"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "no\*such\*index"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no_such_relation"/,
+	  qr/pg_amcheck: warning: no relations to check matching "no\*such\*relation"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "no\*such\*table"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no\*such\*database"/,
+	  qr/pg_amcheck: warning: no relations to check matching "none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "none\.none\.none"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "this\.is\.a\.really\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.long\.dotted\.string"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.pg_catalog\.none"/,
+	  qr/pg_amcheck: warning: no relations to check matching "postgres\.none\.pg_class"/,
+	],
+	'many unmatched patterns and one matched pattern under --no-strict-names');
+
+#########################################
+# Test checking otherwise existent objects but in databases where they do not exist
+
+$node->safe_psql('postgres', q(
+	CREATE TABLE public.foo (f integer);
+	CREATE INDEX foo_idx ON foo(f);
+));
+$node->safe_psql('postgres', q(CREATE DATABASE another_db));
+
+$node->command_checks_all(
+	[ 'pg_amcheck', '-d', 'postgres', '--no-strict-names',
+		'-t', 'template1.public.foo',
+		'-t', 'another_db.public.foo',
+		'-t', 'no_such_database.public.foo',
+		'-i', 'template1.public.foo_idx',
+		'-i', 'another_db.public.foo_idx',
+		'-i', 'no_such_database.public.foo_idx',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "template1\.public\.foo"/,
+	  qr/pg_amcheck: warning: no heap tables to check matching "another_db\.public\.foo"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "template1\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no btree indexes to check matching "another_db\.public\.foo_idx"/,
+	  qr/pg_amcheck: warning: no connectable databases to check matching "no_such_database\.public\.foo_idx"/,
+	  qr/pg_amcheck: error: no relations to check/,
+	],
+	'checking otherwise existent objets in the wrong databases');
+
+
+#########################################
+# Test schema exclusion patterns
+
+# Check with only schema exclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-S', 'public',
+		'-S', 'pg_catalog',
+		'-S', 'pg_toast',
+		'-S', 'information_schema',
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion patterns exclude all relations');
+
+# Check with schema exclusion patterns overriding relation and schema inclusion patterns
+$node->command_checks_all(
+	[ 'pg_amcheck', '--all', '--no-strict-names',
+		'-s', 'public',
+		'-s', 'pg_catalog',
+		'-s', 'pg_toast',
+		'-s', 'information_schema',
+		'-t', 'pg_catalog.pg_class',
+		'-S', '*'
+	],
+	1,
+	[ qr/^$/ ],
+	[ qr/pg_amcheck: warning: skipping database "template1": amcheck is not installed/,
+	  qr/pg_amcheck: error: no relations to check/ ],
+	'schema exclusion pattern overrides all inclusion patterns');
diff --git a/src/bin/pg_amcheck/t/003_check.pl b/src/bin/pg_amcheck/t/003_check.pl
new file mode 100644
index 0000000000..e43ffe7ed6
--- /dev/null
+++ b/src/bin/pg_amcheck/t/003_check.pl
@@ -0,0 +1,504 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 60;
+
+my ($node, $port, %corrupt_page, %remove_relation);
+
+# Returns the filesystem path for the named relation.
+#
+# Assumes the test node is running
+sub relation_filepath($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $pgdata = $node->data_dir;
+	my $rel = $node->safe_psql($dbname,
+							   qq(SELECT pg_relation_filepath('$relname')));
+	die "path not found for relation $relname" unless defined $rel;
+	return "$pgdata/$rel";
+}
+
+# Returns the name of the toast relation associated with the named relation.
+#
+# Assumes the test node is running
+sub relation_toast($$)
+{
+	my ($dbname, $relname) = @_;
+
+	my $rel = $node->safe_psql($dbname, qq(
+		SELECT c.reltoastrelid::regclass
+			FROM pg_catalog.pg_class c
+			WHERE c.oid = '$relname'::regclass
+			  AND c.reltoastrelid != 0
+			));
+	return undef unless defined $rel;
+	return $rel;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of overwriting junk in the first page.
+#
+# Assumes the test node is running.
+sub plan_to_corrupt_first_page($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$corrupt_page{$relpath} = 1;
+}
+
+# Adds the relation file for the given (dbname, relname) to the list
+# to be corrupted by means of removing the file..
+#
+# Assumes the test node is running
+sub plan_to_remove_relation_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $relpath = relation_filepath($dbname, $relname);
+	$remove_relation{$relpath} = 1;
+}
+
+# For the given (dbname, relname), if a corresponding toast table
+# exists, adds that toast table's relation file to the list to be
+# corrupted by means of removing the file.
+#
+# Assumes the test node is running.
+sub plan_to_remove_toast_file($$)
+{
+	my ($dbname, $relname) = @_;
+	my $toastname = relation_toast($dbname, $relname);
+	plan_to_remove_relation_file($dbname, $toastname) if ($toastname);
+}
+
+# Corrupts the first page of the given file path
+sub corrupt_first_page($)
+{
+	my ($relpath) = @_;
+
+	my $fh;
+	open($fh, '+<', $relpath)
+		or BAIL_OUT("open failed: $!");
+	binmode $fh;
+
+	# Corrupt some line pointers.  The values are chosen to hit the
+	# various line-pointer-corruption checks in verify_heapam.c
+	# on both little-endian and big-endian architectures.
+	seek($fh, 32, 0)
+		or BAIL_OUT("seek failed: $!");
+	syswrite(
+		$fh,
+		pack("L*",
+			0xAAA15550, 0xAAA0D550, 0x00010000,
+			0x00008000, 0x0000800F, 0x001e8000,
+			0xFFFFFFFF)
+	) or BAIL_OUT("syswrite failed: $!");
+	close($fh)
+		or BAIL_OUT("close failed: $!");
+}
+
+# Stops the node, performs all the corruptions previously planned, and
+# starts the node again.
+#
+sub perform_all_corruptions()
+{
+	$node->stop();
+	for my $relpath (keys %corrupt_page)
+	{
+		corrupt_first_page($relpath);
+	}
+	for my $relpath (keys %remove_relation)
+	{
+		unlink($relpath);
+	}
+	$node->start;
+}
+
+# Test set-up
+$node = get_new_node('test');
+$node->init;
+$node->start;
+$port = $node->port;
+
+for my $dbname (qw(db1 db2 db3))
+{
+	# Create the database
+	$node->safe_psql('postgres', qq(CREATE DATABASE $dbname));
+
+	# Load the amcheck extension, upon which pg_amcheck depends.  Put the
+	# extension in an unexpected location to test that pg_amcheck finds it
+	# correctly.  Create tables with names that look like pg_catalog names to
+	# check that pg_amcheck does not get confused by them.  Create functions in
+	# schema public that look like amcheck functions to check that pg_amcheck
+	# does not use them.
+	$node->safe_psql($dbname, q(
+		CREATE SCHEMA amcheck_schema;
+		CREATE EXTENSION amcheck WITH SCHEMA amcheck_schema;
+		CREATE TABLE amcheck_schema.pg_database (junk text);
+		CREATE TABLE amcheck_schema.pg_namespace (junk text);
+		CREATE TABLE amcheck_schema.pg_class (junk text);
+		CREATE TABLE amcheck_schema.pg_operator (junk text);
+		CREATE TABLE amcheck_schema.pg_proc (junk text);
+		CREATE TABLE amcheck_schema.pg_tablespace (junk text);
+
+		CREATE FUNCTION public.bt_index_check(index regclass,
+											  heapallindexed boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.bt_index_parent_check(index regclass,
+													 heapallindexed boolean default false,
+													 rootdescend boolean default false)
+		RETURNS VOID AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong bt_index_parent_check!';
+		END;
+		$$ LANGUAGE plpgsql;
+
+		CREATE FUNCTION public.verify_heapam(relation regclass,
+											 on_error_stop boolean default false,
+											 check_toast boolean default false,
+											 skip text default 'none',
+											 startblock bigint default null,
+											 endblock bigint default null,
+											 blkno OUT bigint,
+											 offnum OUT integer,
+											 attnum OUT integer,
+											 msg OUT text)
+		RETURNS SETOF record AS $$
+		BEGIN
+			RAISE EXCEPTION 'Invoked wrong verify_heapam!';
+		END;
+		$$ LANGUAGE plpgsql;
+	));
+
+	# Create schemas, tables and indexes in five separate
+	# schemas.  The schemas are all identical to start, but
+	# we will corrupt them differently later.
+	#
+	for my $schema (qw(s1 s2 s3 s4 s5))
+	{
+		$node->safe_psql($dbname, qq(
+			CREATE SCHEMA $schema;
+			CREATE SEQUENCE $schema.seq1;
+			CREATE SEQUENCE $schema.seq2;
+			CREATE TABLE $schema.t1 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE TABLE $schema.t2 (
+				i INTEGER,
+				b BOX,
+				ia int4[],
+				ir int4range,
+				t TEXT
+			);
+			CREATE VIEW $schema.t2_view AS (
+				SELECT i*2, t FROM $schema.t2
+			);
+			ALTER TABLE $schema.t2
+				ALTER COLUMN t
+				SET STORAGE EXTERNAL;
+
+			INSERT INTO $schema.t1 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			INSERT INTO $schema.t2 (i, b, ia, ir, t)
+				(SELECT gs::INTEGER AS i,
+						box(point(gs,gs+5),point(gs*2,gs*3)) AS b,
+						array[gs, gs + 1]::int4[] AS ia,
+						int4range(gs, gs+100) AS ir,
+						repeat('foo', gs) AS t
+					 FROM generate_series(1,10000,3000) AS gs);
+
+			CREATE MATERIALIZED VIEW $schema.t1_mv AS SELECT * FROM $schema.t1;
+			CREATE MATERIALIZED VIEW $schema.t2_mv AS SELECT * FROM $schema.t2;
+
+			create table $schema.p1 (a int, b int) PARTITION BY list (a);
+			create table $schema.p2 (a int, b int) PARTITION BY list (a);
+
+			create table $schema.p1_1 partition of $schema.p1 for values in (1, 2, 3);
+			create table $schema.p1_2 partition of $schema.p1 for values in (4, 5, 6);
+			create table $schema.p2_1 partition of $schema.p2 for values in (1, 2, 3);
+			create table $schema.p2_2 partition of $schema.p2 for values in (4, 5, 6);
+
+			CREATE INDEX t1_btree ON $schema.t1 USING BTREE (i);
+			CREATE INDEX t2_btree ON $schema.t2 USING BTREE (i);
+
+			CREATE INDEX t1_hash ON $schema.t1 USING HASH (i);
+			CREATE INDEX t2_hash ON $schema.t2 USING HASH (i);
+
+			CREATE INDEX t1_brin ON $schema.t1 USING BRIN (i);
+			CREATE INDEX t2_brin ON $schema.t2 USING BRIN (i);
+
+			CREATE INDEX t1_gist ON $schema.t1 USING GIST (b);
+			CREATE INDEX t2_gist ON $schema.t2 USING GIST (b);
+
+			CREATE INDEX t1_gin ON $schema.t1 USING GIN (ia);
+			CREATE INDEX t2_gin ON $schema.t2 USING GIN (ia);
+
+			CREATE INDEX t1_spgist ON $schema.t1 USING SPGIST (ir);
+			CREATE INDEX t2_spgist ON $schema.t2 USING SPGIST (ir);
+		));
+	}
+}
+
+# Database 'db1' corruptions
+#
+
+# Corrupt indexes in schema "s1"
+plan_to_remove_relation_file('db1', 's1.t1_btree');
+plan_to_corrupt_first_page('db1', 's1.t2_btree');
+
+# Corrupt tables in schema "s2"
+plan_to_remove_relation_file('db1', 's2.t1');
+plan_to_corrupt_first_page('db1', 's2.t2');
+
+# Corrupt tables, partitions, matviews, and btrees in schema "s3"
+plan_to_remove_relation_file('db1', 's3.t1');
+plan_to_corrupt_first_page('db1', 's3.t2');
+
+plan_to_remove_relation_file('db1', 's3.t1_mv');
+plan_to_remove_relation_file('db1', 's3.p1_1');
+
+plan_to_corrupt_first_page('db1', 's3.t2_mv');
+plan_to_corrupt_first_page('db1', 's3.p2_1');
+
+plan_to_remove_relation_file('db1', 's3.t1_btree');
+plan_to_corrupt_first_page('db1', 's3.t2_btree');
+
+# Corrupt toast table, partitions, and materialized views in schema "s4"
+plan_to_remove_toast_file('db1', 's4.t2');
+
+# Corrupt all other object types in schema "s5".  We don't have amcheck support
+# for these types, but we check that their corruption does not trigger any
+# errors in pg_amcheck
+plan_to_remove_relation_file('db1', 's5.seq1');
+plan_to_remove_relation_file('db1', 's5.t1_hash');
+plan_to_remove_relation_file('db1', 's5.t1_gist');
+plan_to_remove_relation_file('db1', 's5.t1_gin');
+plan_to_remove_relation_file('db1', 's5.t1_brin');
+plan_to_remove_relation_file('db1', 's5.t1_spgist');
+
+plan_to_corrupt_first_page('db1', 's5.seq2');
+plan_to_corrupt_first_page('db1', 's5.t2_hash');
+plan_to_corrupt_first_page('db1', 's5.t2_gist');
+plan_to_corrupt_first_page('db1', 's5.t2_gin');
+plan_to_corrupt_first_page('db1', 's5.t2_brin');
+plan_to_corrupt_first_page('db1', 's5.t2_spgist');
+
+
+# Database 'db2' corruptions
+#
+plan_to_remove_relation_file('db2', 's1.t1');
+plan_to_remove_relation_file('db2', 's1.t1_btree');
+
+
+# Leave 'db3' uncorrupted
+#
+
+# Perform the corruptions we planned above using only a single database restart.
+#
+perform_all_corruptions();
+
+
+# Standard first arguments to TestLib functions
+my @cmd = ('pg_amcheck', '--quiet', '-p', $port);
+
+# Regular expressions to match various expected output
+my $no_output_re = qr/^$/;
+my $line_pointer_corruption_re = qr/line pointer/;
+my $missing_file_re = qr/could not open file ".*": No such file or directory/;
+my $index_missing_relation_fork_re = qr/index ".*" lacks a main relation fork/;
+
+# Checking databases with amcheck installed and corrupt relations, pg_amcheck
+# command itself should return exit status = 2, because tables and indexes are
+# corrupt, not exit status = 1, which would mean the pg_amcheck command itself
+# failed.  Corruption messages should go to stdout, and nothing to stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in database db1');
+
+$node->command_checks_all(
+	[ @cmd, '-d', 'db1', '-d', 'db2', '-d', 'db3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck all schemas, tables and indexes in databases db1, db2, and db3');
+
+# Scans of indexes in s1 should detect the specific corruption that we created
+# above.  For missing relation forks, we know what the error message looks
+# like.  For corrupted index pages, the error might vary depending on how the
+# page was formatted on disk, including variations due to alignment differences
+# between platforms, so we accept any non-empty error message.
+#
+# If we don't limit the check to databases with amcheck installed, we expect
+# complaint on stderr, but otherwise stderr should be quiet.
+#
+$node->command_checks_all(
+	[ @cmd, '--all', '-s', 's1', '-i', 't1_btree' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ qr/pg_amcheck: warning: skipping database "postgres": amcheck is not installed/ ],
+	'pg_amcheck index s1.t1_btree reports missing main relation fork');
+
+$node->command_checks_all(
+	[ @cmd, '-d', 'db1', '-s', 's1', '-i', 't2_btree' ],
+	2,
+	[ qr/.+/ ],			# Any non-empty error message is acceptable
+	[ $no_output_re ],
+	'pg_amcheck index s1.s2 reports index corruption');
+
+# Checking db1.s1 with indexes excluded should show no corruptions because we
+# did not corrupt any tables in db1.s1.  Verify that both stdout and stderr
+# are quiet.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db1.s1 excluding indexes');
+
+# Checking db2.s1 should show table corruptions if indexes are excluded
+#
+$node->command_checks_all(
+	[ @cmd, 'db2', '-t', 's1.*', '--no-dependent-indexes' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck of db2.s1 excluding indexes');
+
+# In schema db1.s3, the tables and indexes are both corrupt.  We should see
+# corruption messages on stdout, and nothing on stderr.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's3' ],
+	2,
+	[ $index_missing_relation_fork_re,
+	  $line_pointer_corruption_re,
+	  $missing_file_re,
+	],
+	[ $no_output_re ],
+	'pg_amcheck schema s3 reports table and index errors');
+
+# In schema db1.s4, only toast tables are corrupt.  Check that under default
+# options the toast corruption is reported, but when excluding toast we get no
+# error reports.
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's4' ],
+	2,
+	[ $missing_file_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 reports toast corruption');
+
+$node->command_checks_all(
+	[ @cmd, '--no-dependent-toast', '--exclude-toast-pointers', 'db1', '-s', 's4' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck in schema s4 excluding toast reports no corruption');
+
+# Check that no corruption is reported in schema db1.s5
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's5' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s5 reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we exclude
+# the indexes, no corruption is reported about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-I', 't1_btree', '-I', 't2_btree' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with corrupt indexes excluded reports no corruption');
+
+# In schema db1.s1, only indexes are corrupt.  Verify that when we provide only
+# table inclusions, and disable index expansion, no corruption is reported
+# about the schema.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s1 with all indexes excluded reports no corruption');
+
+# In schema db1.s2, only tables are corrupt.  Verify that when we exclude those
+# tables that no corruption is reported.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's2', '-T', 't1', '-T', 't2' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck over schema s2 with corrupt tables excluded reports no corruption');
+
+# Check errors about bad block range command line arguments.  We use schema s5
+# to avoid getting messages about corrupt tables or indexes.
+#
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', 'junk' ],
+	qr/invalid start block/,
+	'pg_amcheck rejects garbage startblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--endblock', '1234junk' ],
+	qr/invalid end block/,
+	'pg_amcheck rejects garbage endblock');
+
+command_fails_like(
+	[ @cmd, 'db1', '-s', 's5', '--startblock', '5', '--endblock', '4' ],
+	qr/end block precedes start block/,
+	'pg_amcheck rejects invalid block range');
+
+# Check bt_index_parent_check alternates.  We don't create any index corruption
+# that would behave differently under these modes, so just smoke test that the
+# arguments are handled sensibly.
+#
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--parent-check' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --parent-check');
+
+$node->command_checks_all(
+	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend' ],
+	2,
+	[ $index_missing_relation_fork_re ],
+	[ $no_output_re ],
+	'pg_amcheck smoke test --heapallindexed --rootdescend');
+
+$node->command_checks_all(
+	[ @cmd, '-d', 'db1', '-d', 'db2', '-d', 'db3', '-S', 's*' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck excluding all corrupt schemas');
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
new file mode 100644
index 0000000000..48dfbef145
--- /dev/null
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -0,0 +1,516 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+
+use Test::More;
+
+# This regression test demonstrates that the pg_amcheck binary correctly
+# identifies specific kinds of corruption within pages.  To test this, we need
+# a mechanism to create corrupt pages with predictable, repeatable corruption.
+# The postgres backend cannot be expected to help us with this, as its design
+# is not consistent with the goal of intentionally corrupting pages.
+#
+# Instead, we create a table to corrupt, and with careful consideration of how
+# postgresql lays out heap pages, we seek to offsets within the page and
+# overwrite deliberately chosen bytes with specific values calculated to
+# corrupt the page in expected ways.  We then verify that pg_amcheck reports
+# the corruption, and that it runs without crashing.  Note that the backend
+# cannot simply be started to run queries against the corrupt table, as the
+# backend will crash, at least for some of the corruption types we generate.
+#
+# Autovacuum potentially touching the table in the background makes the exact
+# behavior of this test harder to reason about.  We turn it off to keep things
+# simpler.  We use a "belt and suspenders" approach, turning it off for the
+# system generally in postgresql.conf, and turning it off specifically for the
+# test table.
+#
+# This test depends on the table being written to the heap file exactly as we
+# expect it to be, so we take care to arrange the columns of the table, and
+# insert rows of the table, that give predictable sizes and locations within
+# the table page.
+#
+# The HeapTupleHeaderData has 23 bytes of fixed size fields before the variable
+# length t_bits[] array.  We have exactly 3 columns in the table, so natts = 3,
+# t_bits is 1 byte long, and t_hoff = MAXALIGN(23 + 1) = 24.
+#
+# We're not too fussy about which datatypes we use for the test, but we do care
+# about some specific properties.  We'd like to test both fixed size and
+# varlena types.  We'd like some varlena data inline and some toasted.  And
+# we'd like the layout of the table such that the datums land at predictable
+# offsets within the tuple.  We choose a structure without padding on all
+# supported architectures:
+#
+# 	a BIGINT
+#	b TEXT
+#	c TEXT
+#
+# We always insert a 7-ascii character string into field 'b', which with a
+# 1-byte varlena header gives an 8 byte inline value.  We always insert a long
+# text string in field 'c', long enough to force toast storage.
+#
+# We choose to read and write binary copies of our table's tuples, using perl's
+# pack() and unpack() functions.  Perl uses a packing code system in which:
+#
+#	L = "Unsigned 32-bit Long",
+#	S = "Unsigned 16-bit Short",
+#	C = "Unsigned 8-bit Octet",
+#	c = "signed 8-bit octet",
+#	q = "signed 64-bit quadword"
+#
+# Each tuple in our table has a layout as follows:
+#
+#    xx xx xx xx            t_xmin: xxxx		offset = 0		L
+#    xx xx xx xx            t_xmax: xxxx		offset = 4		L
+#    xx xx xx xx          t_field3: xxxx		offset = 8		L
+#    xx xx                   bi_hi: xx			offset = 12		S
+#    xx xx                   bi_lo: xx			offset = 14		S
+#    xx xx                ip_posid: xx			offset = 16		S
+#    xx xx             t_infomask2: xx			offset = 18		S
+#    xx xx              t_infomask: xx			offset = 20		S
+#    xx                     t_hoff: x			offset = 22		C
+#    xx                     t_bits: x			offset = 23		C
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		Cccccccc
+#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		SSSS
+#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued	SSSS
+#    xx xx                        : xx      	 ...continued	S
+#
+# We could choose to read and write columns 'b' and 'c' in other ways, but
+# it is convenient enough to do it this way.  We define packing code
+# constants here, where they can be compared easily against the layout.
+
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCcccccccSSSSSSSSS';
+use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
+
+# Read a tuple of our table from a heap page.
+#
+# Takes an open filehandle to the heap file, and the offset of the tuple.
+#
+# Rather than returning the binary data from the file, unpacks the data into a
+# perl hash with named fields.  These fields exactly match the ones understood
+# by write_tuple(), below.  Returns a reference to this hash.
+#
+sub read_tuple ($$)
+{
+	my ($fh, $offset) = @_;
+	my ($buffer, %tup);
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(sysread($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("sysread failed: $!");
+
+	@_ = unpack(HEAPTUPLE_PACK_CODE, $buffer);
+	%tup = (t_xmin => shift,
+			t_xmax => shift,
+			t_field3 => shift,
+			bi_hi => shift,
+			bi_lo => shift,
+			ip_posid => shift,
+			t_infomask2 => shift,
+			t_infomask => shift,
+			t_hoff => shift,
+			t_bits => shift,
+			a => shift,
+			b_header => shift,
+			b_body1 => shift,
+			b_body2 => shift,
+			b_body3 => shift,
+			b_body4 => shift,
+			b_body5 => shift,
+			b_body6 => shift,
+			b_body7 => shift,
+			c1 => shift,
+			c2 => shift,
+			c3 => shift,
+			c4 => shift,
+			c5 => shift,
+			c6 => shift,
+			c7 => shift,
+			c8 => shift,
+			c9 => shift);
+	# Stitch together the text for column 'b'
+	$tup{b} = join('', map { chr($tup{"b_body$_"}) } (1..7));
+	return \%tup;
+}
+
+# Write a tuple of our table to a heap page.
+#
+# Takes an open filehandle to the heap file, the offset of the tuple, and a
+# reference to a hash with the tuple values, as returned by read_tuple().
+# Writes the tuple fields from the hash into the heap file.
+#
+# The purpose of this function is to write a tuple back to disk with some
+# subset of fields modified.  The function does no error checking.  Use
+# cautiously.
+#
+sub write_tuple($$$)
+{
+	my ($fh, $offset, $tup) = @_;
+	my $buffer = pack(HEAPTUPLE_PACK_CODE,
+					$tup->{t_xmin},
+					$tup->{t_xmax},
+					$tup->{t_field3},
+					$tup->{bi_hi},
+					$tup->{bi_lo},
+					$tup->{ip_posid},
+					$tup->{t_infomask2},
+					$tup->{t_infomask},
+					$tup->{t_hoff},
+					$tup->{t_bits},
+					$tup->{a},
+					$tup->{b_header},
+					$tup->{b_body1},
+					$tup->{b_body2},
+					$tup->{b_body3},
+					$tup->{b_body4},
+					$tup->{b_body5},
+					$tup->{b_body6},
+					$tup->{b_body7},
+					$tup->{c1},
+					$tup->{c2},
+					$tup->{c3},
+					$tup->{c4},
+					$tup->{c5},
+					$tup->{c6},
+					$tup->{c7},
+					$tup->{c8},
+					$tup->{c9});
+	seek($fh, $offset, 0)
+		or BAIL_OUT("seek failed: $!");
+	defined(syswrite($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
+		or BAIL_OUT("syswrite failed: $!");;
+	return;
+}
+
+# Set umask so test directories and files are created with default permissions
+umask(0077);
+
+# Set up the node.  Once we create and corrupt the table,
+# autovacuum workers visiting the table could crash the backend.
+# Disable autovacuum so that won't happen.
+my $node = get_new_node('test');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
+
+# Start the node and load the extensions.  We depend on both
+# amcheck and pageinspect for this test.
+$node->start;
+my $port = $node->port;
+my $pgdata = $node->data_dir;
+$node->safe_psql('postgres', "CREATE EXTENSION amcheck");
+$node->safe_psql('postgres', "CREATE EXTENSION pageinspect");
+
+# Get a non-zero datfrozenxid
+$node->safe_psql('postgres', qq(VACUUM FREEZE));
+
+# Create the test table with precisely the schema that our corruption function
+# expects.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.test (a BIGINT, b TEXT, c TEXT);
+		ALTER TABLE public.test SET (autovacuum_enabled=false);
+		ALTER TABLE public.test ALTER COLUMN c SET STORAGE EXTERNAL;
+		CREATE INDEX test_idx ON public.test(a, b);
+	));
+
+# We want (0 < datfrozenxid < test.relfrozenxid).  To achieve this, we freeze
+# an otherwise unused table, public.junk, prior to inserting data and freezing
+# public.test
+$node->safe_psql(
+	'postgres', qq(
+		CREATE TABLE public.junk AS SELECT 'junk'::TEXT AS junk_column;
+		ALTER TABLE public.junk SET (autovacuum_enabled=false);
+		VACUUM FREEZE public.junk
+	));
+
+my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.test')));
+my $relpath = "$pgdata/$rel";
+
+# Insert data and freeze public.test
+use constant ROWCOUNT => 16;
+$node->safe_psql('postgres', qq(
+	INSERT INTO public.test (a, b, c)
+		VALUES (
+			12345678,
+			'abcdefg',
+			repeat('w', 10000)
+		);
+	VACUUM FREEZE public.test
+	)) for (1..ROWCOUNT);
+
+my $relfrozenxid = $node->safe_psql('postgres',
+	q(select relfrozenxid from pg_class where relname = 'test'));
+my $datfrozenxid = $node->safe_psql('postgres',
+	q(select datfrozenxid from pg_database where datname = 'postgres'));
+
+# Sanity check that our 'test' table has a relfrozenxid newer than the
+# datfrozenxid for the database, and that the datfrozenxid is greater than the
+# first normal xid.  We rely on these invariants in some of our tests.
+if ($datfrozenxid <= 3 || $datfrozenxid >= $relfrozenxid)
+{
+	$node->clean_node;
+	plan skip_all => "Xid thresholds not as expected: got datfrozenxid = $datfrozenxid, relfrozenxid = $relfrozenxid";
+	exit;
+}
+
+# Find where each of the tuples is located on the page.
+my @lp_off;
+for my $tup (0..ROWCOUNT-1)
+{
+	push (@lp_off, $node->safe_psql('postgres', qq(
+select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
+	offset $tup limit 1)));
+}
+
+# Sanity check that our 'test' table on disk layout matches expectations.  If
+# this is not so, we will have to skip the test until somebody updates the test
+# to work on this platform.
+$node->stop;
+my $file;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	# Sanity-check that the data appears on the page where we expect.
+	my $a = $tup->{a};
+	my $b = $tup->{b};
+	if ($a ne '12345678' || $b ne 'abcdefg')
+	{
+		close($file);  # ignore errors on close; we're exiting anyway
+		$node->clean_node;
+		plan skip_all => qq(Page layout differs from our expectations: expected (12345678, "abcdefg"), got ($a, "$b"));
+		exit;
+	}
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Ok, Xids and page layout look ok.  We can run corruption tests.
+plan tests => 20;
+
+# Check that pg_amcheck runs against the uncorrupted table without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table, prior to corruption');
+
+# Check that pg_amcheck runs against the uncorrupted table and index without error.
+$node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
+				  'pg_amcheck test table and index, prior to corruption');
+
+$node->stop;
+
+# Some #define constants from access/htup_details.h for use while corrupting.
+use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
+use constant HEAP_XMIN_COMMITTED     => 0x0100;
+use constant HEAP_XMIN_INVALID       => 0x0200;
+use constant HEAP_XMAX_COMMITTED     => 0x0400;
+use constant HEAP_XMAX_INVALID       => 0x0800;
+use constant HEAP_NATTS_MASK         => 0x07FF;
+use constant HEAP_XMAX_IS_MULTI      => 0x1000;
+use constant HEAP_KEYS_UPDATED       => 0x2000;
+
+# Helper function to generate a regular expression matching the header we
+# expect verify_heapam() to return given which fields we expect to be non-null.
+sub header
+{
+	my ($blkno, $offnum, $attnum) = @_;
+	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum, attribute $attnum:\s+/ms
+		if (defined $attnum);
+	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum:\s+/ms
+		if (defined $offnum);
+	return qr/heap table "postgres"\."public"\."test", block $blkno:\s+/ms
+		if (defined $blkno);
+	return qr/heap table "postgres"\."public"\."test":\s+/ms;
+}
+
+# Corrupt the tuples, one type of corruption per tuple.  Some types of
+# corruption cause verify_heapam to skip to the next tuple without
+# performing any remaining checks, so we can't exercise the system properly if
+# we focus all our corruption on a single tuple.
+#
+my @expected;
+open($file, '+<', $relpath)
+	or BAIL_OUT("open failed: $!");
+binmode $file;
+
+for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
+{
+	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
+	my $offset = $lp_off[$tupidx];
+	my $tup = read_tuple($file, $offset);
+
+	my $header = header(0, $offnum, undef);
+	if ($offnum == 1)
+	{
+		# Corruptly set xmin < relfrozenxid
+		my $xmin = $relfrozenxid - 1;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		# Expected corruption report
+		push @expected,
+			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
+	}
+	if ($offnum == 2)
+	{
+		# Corruptly set xmin < datfrozenxid
+		my $xmin = 3;
+		$tup->{t_xmin} = $xmin;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin $xmin precedes oldest valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 3)
+	{
+		# Corruptly set xmin < datfrozenxid, further back, noting circularity
+		# of xid comparison.  For a new cluster with epoch = 0, the corrupt
+		# xmin will be interpreted as in the future
+		$tup->{t_xmin} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMIN_COMMITTED;
+		$tup->{t_infomask} &= ~HEAP_XMIN_INVALID;
+
+		push @expected,
+			qr/${$header}xmin 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 4)
+	{
+		# Corruptly set xmax < relminmxid;
+		$tup->{t_xmax} = 4026531839;
+		$tup->{t_infomask} &= ~HEAP_XMAX_INVALID;
+
+		push @expected,
+			qr/${$header}xmax 4026531839 equals or exceeds next valid transaction ID 0:\d+/;
+	}
+	elsif ($offnum == 5)
+	{
+		# Corrupt the tuple t_hoff, but keep it aligned properly
+		$tup->{t_hoff} += 128;
+
+		push @expected,
+			qr/${$header}data begins at offset 152 beyond the tuple length 58/,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 152 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 6)
+	{
+		# Corrupt the tuple t_hoff, wrong alignment
+		$tup->{t_hoff} += 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 27 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 7)
+	{
+		# Corrupt the tuple t_hoff, underflow but correct alignment
+		$tup->{t_hoff} -= 8;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 16 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 8)
+	{
+		# Corrupt the tuple t_hoff, underflow and wrong alignment
+		$tup->{t_hoff} -= 3;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 24, but actually begins at byte 21 \(3 attributes, no nulls\)/;
+	}
+	elsif ($offnum == 9)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, not just 3
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+
+		push @expected,
+			qr/${$header}number of attributes 2047 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 10)
+	{
+		# Corrupt the tuple to look like it has lots of attributes, some of
+		# them null.  This falsely creates the impression that the t_bits
+		# array is longer than just one byte, but t_hoff still says otherwise.
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= HEAP_NATTS_MASK;
+		$tup->{t_bits} = 0xAA;
+
+		push @expected,
+			qr/${$header}tuple data should begin at byte 280, but actually begins at byte 24 \(2047 attributes, has nulls\)/;
+	}
+	elsif ($offnum == 11)
+	{
+		# Same as above, but this time t_hoff plays along
+		$tup->{t_infomask} |= HEAP_HASNULL;
+		$tup->{t_infomask2} |= (HEAP_NATTS_MASK & 0x40);
+		$tup->{t_bits} = 0xAA;
+		$tup->{t_hoff} = 32;
+
+		push @expected,
+			qr/${$header}number of attributes 67 exceeds maximum expected for table 3/;
+	}
+	elsif ($offnum == 12)
+	{
+		# Corrupt the bits in column 'b' 1-byte varlena header
+		$tup->{b_header} = 0x80;
+
+		$header = header(0, $offnum, 1);
+		push @expected,
+			qr/${header}attribute 1 with length 4294967295 ends at offset 416848000 beyond total tuple length 58/;
+	}
+	elsif ($offnum == 13)
+	{
+		# Corrupt the bits in column 'c' toast pointer
+		$tup->{c6} = 41;
+		$tup->{c7} = 41;
+
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}final toast chunk number 0 differs from expected value 6/,
+			qr/${header}toasted value for attribute 2 missing from toast table/;
+	}
+	elsif ($offnum == 14)
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4;
+
+		push @expected,
+			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
+	}
+	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	{
+		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
+		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
+		$tup->{t_infomask} |= HEAP_XMAX_IS_MULTI;
+		$tup->{t_xmax} = 4000000000;
+
+		push @expected,
+			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
+	}
+	write_tuple($file, $offset, $tup);
+}
+close($file)
+	or BAIL_OUT("close failed: $!");
+$node->start;
+
+# Run pg_amcheck against the corrupt table with epoch=0, comparing actual
+# corruption messages against the expected messages
+$node->command_checks_all(
+	['pg_amcheck', '--no-dependent-indexes', '-p', $port, 'postgres'],
+	2,
+	[ @expected ],
+	[ ],
+	'Expected corruption message output');
+
+$node->teardown_node;
+$node->clean_node;
diff --git a/src/bin/pg_amcheck/t/005_opclass_damage.pl b/src/bin/pg_amcheck/t/005_opclass_damage.pl
new file mode 100644
index 0000000000..eba8ea9cae
--- /dev/null
+++ b/src/bin/pg_amcheck/t/005_opclass_damage.pl
@@ -0,0 +1,54 @@
+# This regression test checks the behavior of the btree validation in the
+# presence of breaking sort order changes.
+#
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 5;
+
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+# Create a custom operator class and an index which uses it.
+$node->safe_psql('postgres', q(
+	CREATE EXTENSION amcheck;
+
+	CREATE FUNCTION int4_asc_cmp (a int4, b int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;
+
+	CREATE OPERATOR CLASS int4_fickle_ops FOR TYPE int4 USING btree AS
+	    OPERATOR 1 < (int4, int4), OPERATOR 2 <= (int4, int4),
+	    OPERATOR 3 = (int4, int4), OPERATOR 4 >= (int4, int4),
+	    OPERATOR 5 > (int4, int4), FUNCTION 1 int4_asc_cmp(int4, int4);
+
+	CREATE TABLE int4tbl (i int4);
+	INSERT INTO int4tbl (SELECT * FROM generate_series(1,1000) gs);
+	CREATE INDEX fickleidx ON int4tbl USING btree (i int4_fickle_ops);
+));
+
+# We have not yet broken the index, so we should get no corruption
+$node->command_like(
+	[ 'pg_amcheck', '--quiet', '-p', $node->port, 'postgres' ],
+	qr/^$/,
+	'pg_amcheck all schemas, tables and indexes reports no corruption');
+
+# Change the operator class to use a function which sorts in a different
+# order to corrupt the btree index
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION int4_desc_cmp (int4, int4) RETURNS int LANGUAGE sql AS $$
+		SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN -1 ELSE 1 END; $$;
+	UPDATE pg_catalog.pg_amproc
+		SET amproc = 'int4_desc_cmp'::regproc
+		WHERE amproc = 'int4_asc_cmp'::regproc
+));
+
+# Index corruption should now be reported
+$node->command_checks_all(
+	[ 'pg_amcheck', '-p', $node->port, 'postgres' ],
+	2,
+	[ qr/item order invariant violated for index "fickleidx"/ ],
+	[ ],
+	'pg_amcheck all schemas, tables and indexes reports fickleidx corruption'
+);
diff --git a/src/tools/msvc/Install.pm b/src/tools/msvc/Install.pm
index ea3af48777..ffcd0e5095 100644
--- a/src/tools/msvc/Install.pm
+++ b/src/tools/msvc/Install.pm
@@ -20,12 +20,12 @@ our (@ISA, @EXPORT_OK);
 my $insttype;
 my @client_contribs = ('oid2name', 'pgbench', 'vacuumlo');
 my @client_program_files = (
-	'clusterdb',      'createdb',   'createuser',    'dropdb',
-	'dropuser',       'ecpg',       'libecpg',       'libecpg_compat',
-	'libpgtypes',     'libpq',      'pg_basebackup', 'pg_config',
-	'pg_dump',        'pg_dumpall', 'pg_isready',    'pg_receivewal',
-	'pg_recvlogical', 'pg_restore', 'psql',          'reindexdb',
-	'vacuumdb',       @client_contribs);
+	'clusterdb',     'createdb',       'createuser', 'dropdb',
+	'dropuser',      'ecpg',           'libecpg',    'libecpg_compat',
+	'libpgtypes',    'libpq',          'pg_amcheck', 'pg_basebackup',
+	'pg_config',     'pg_dump',        'pg_dumpall', 'pg_isready',
+	'pg_receivewal', 'pg_recvlogical', 'pg_restore', 'psql',
+	'reindexdb',     'vacuumdb',       @client_contribs);
 
 sub lcopy
 {
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 49614106dc..ead6765f4e 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -54,17 +54,21 @@ my @contrib_excludes = (
 
 # Set of variables for frontend modules
 my $frontend_defines = { 'initdb' => 'FRONTEND' };
-my @frontend_uselibpq = ('pg_ctl', 'pg_upgrade', 'pgbench', 'psql', 'initdb');
+my @frontend_uselibpq = (
+	'pg_amcheck', 'pg_ctl', 'pg_upgrade', 'pgbench', 'psql', 'initdb');
 my @frontend_uselibpgport = (
-	'pg_archivecleanup', 'pg_test_fsync',
-	'pg_test_timing',    'pg_upgrade',
-	'pg_waldump',        'pgbench');
+	'pg_amcheck',    'pg_archivecleanup',
+	'pg_test_fsync', 'pg_test_timing',
+	'pg_upgrade',    'pg_waldump',
+	'pgbench');
 my @frontend_uselibpgcommon = (
-	'pg_archivecleanup', 'pg_test_fsync',
-	'pg_test_timing',    'pg_upgrade',
-	'pg_waldump',        'pgbench');
+	'pg_amcheck',        'pg_archivecleanup',
+	'pg_test_fsync',     'pg_test_timing',
+	'pg_upgrade',        'pg_waldump',
+	'pgbench');
 my $frontend_extralibs = {
 	'initdb'     => ['ws2_32.lib'],
+	'pg_amcheck' => ['ws2_32.lib'],
 	'pg_restore' => ['ws2_32.lib'],
 	'pgbench'    => ['ws2_32.lib'],
 	'psql'       => ['ws2_32.lib']
@@ -79,7 +83,7 @@ my $frontend_extrasource = {
 	  [ 'src/bin/pgbench/exprscan.l', 'src/bin/pgbench/exprparse.y' ]
 };
 my @frontend_excludes = (
-	'pgevent',    'pg_basebackup', 'pg_rewind', 'pg_dump',
+	'pgevent',    'pg_amcheck', 'pg_basebackup', 'pg_rewind', 'pg_dump',
 	'pg_waldump', 'scripts');
 
 sub mkvcbuild
@@ -366,6 +370,12 @@ sub mkvcbuild
 		AddSimpleFrontend($d);
 	}
 
+	my $pgamcheck = AddSimpleFrontend('pg_amcheck', 1);
+	$pgamcheck->{name} = 'pg_amcheck';
+	$pgamcheck->AddFile('src/bin/pg_amcheck/pg_amcheck.c');
+	$pgamcheck->AddLibrary('ws2_32.lib');
+	$pgamcheck->AddDefine('FRONTEND');
+
 	my $pgbasebackup = AddSimpleFrontend('pg_basebackup', 1);
 	$pgbasebackup->AddFile('src/bin/pg_basebackup/pg_basebackup.c');
 	$pgbasebackup->AddLibrary('ws2_32.lib');
-- 
2.21.1 (Apple Git-122.3)

#27Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#26)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 11:41 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

In this next patch, your documentation patch has been applied, and the whole project has been relocated from contrib/pg_amcheck to src/bin/pg_amcheck.

Committed that way with some small adjustments. Let's see what the
buildfarm thinks.

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#27)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 10:10 AM Robert Haas <robertmhaas@gmail.com> wrote:

Committed that way with some small adjustments. Let's see what the
buildfarm thinks.

Thank you both, Mark and Robert. This is excellent work!

--
Peter Geoghegan

In reply to: Peter Geoghegan (#28)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 10:32 AM Peter Geoghegan <pg@bowt.ie> wrote:

Thank you both, Mark and Robert. This is excellent work!

FYI I see these compiler warnings just now:

pg_amcheck.c:1653:4: warning: ISO C90 forbids mixed declarations and
code [-Wdeclaration-after-statement]
1653 | DatabaseInfo *dat = (DatabaseInfo *)
pg_malloc0(sizeof(DatabaseInfo));
| ^~~~~~~~~~~~
pg_amcheck.c: In function ‘compile_relation_list_one_db’:
pg_amcheck.c:2060:9: warning: variable ‘is_btree’ set but not used
[-Wunused-but-set-variable]
2060 | bool is_btree = false;
| ^~~~~~~~

Looks like this 'is_btree' variable should be PG_USED_FOR_ASSERTS_ONLY.

--
Peter Geoghegan

#30Noname
er@xs4all.nl
In reply to: Robert Haas (#27)
Re: pg_amcheck contrib application

On 2021.03.12. 19:10 Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 12, 2021 at 11:41 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

In this next patch, your documentation patch has been applied, and the whole project has been relocated from contrib/pg_amcheck to src/bin/pg_amcheck.

Committed that way with some small adjustments. Let's see what the
buildfarm thinks.

Hi,

An output-formatting error, I think:

I ran pg_amcheck against a 1.5 GB table:

-- pg_amcheck --progress --on-error-stop --heapallindexed -vt myjsonfile100k

pg_amcheck: including database: "testdb"
pg_amcheck: in database "testdb": using amcheck version "1.3" in schema "public"
0/4 relations (0%) 0/187978 pages (0%)
pg_amcheck: checking heap table "testdb"."public"."myjsonfile100k"
pg_amcheck: checking btree index "testdb"."public"."myjsonfile100k_pkey"
2/4 relations (50%) 187977/187978 pages (99%), (testdb )
pg_amcheck: checking btree index "testdb"."pg_toast"."pg_toast_26110_index"
3/4 relations (75%) 187978/187978 pages (100%), (testdb )
pg_amcheck: checking heap table "testdb"."pg_toast"."pg_toast_26110"
4/4 relations (100%) 187978/187978 pages (100%)

I think there is a formatting glitch in lines like:

2/4 relations (50%) 187977/187978 pages (99%), (testdb )

I suppose that last part should show up trimmed as '(testdb)', right?

Thanks,

Erik Rijkers

#31Robert Haas
robertmhaas@gmail.com
In reply to: Noname (#30)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 2:05 PM <er@xs4all.nl> wrote:

I think there is a formatting glitch in lines like:

2/4 relations (50%) 187977/187978 pages (99%), (testdb )

I suppose that last part should show up trimmed as '(testdb)', right?

Actually I think this is intentional. The idea is that as the line is
rewritten we don't want the close-paren to move around. It's the same
thing pg_basebackup does with its progress reporting.

Now that is not to say that some other behavior might not be better,
just that Mark was copying something that already exists, probably
because he knows that I'm finnicky about consistency....

--
Robert Haas
EDB: http://www.enterprisedb.com

#32Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#29)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 1:35 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Fri, Mar 12, 2021 at 10:32 AM Peter Geoghegan <pg@bowt.ie> wrote:

Thank you both, Mark and Robert. This is excellent work!

Thanks.

FYI I see these compiler warnings just now:

pg_amcheck.c:1653:4: warning: ISO C90 forbids mixed declarations and
code [-Wdeclaration-after-statement]
1653 | DatabaseInfo *dat = (DatabaseInfo *)
pg_malloc0(sizeof(DatabaseInfo));
| ^~~~~~~~~~~~
pg_amcheck.c: In function ‘compile_relation_list_one_db’:
pg_amcheck.c:2060:9: warning: variable ‘is_btree’ set but not used
[-Wunused-but-set-variable]
2060 | bool is_btree = false;
| ^~~~~~~~

Looks like this 'is_btree' variable should be PG_USED_FOR_ASSERTS_ONLY.

I'll commit something shortly to address these.

--
Robert Haas
EDB: http://www.enterprisedb.com

#33Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#32)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 2:31 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'll commit something shortly to address these.

There are some interesting failures in the test cases on the
buildfarm. One of the tests ($offnum == 13) corrupts the TOAST pointer
with a garbage value, expecting to get the message "final toast chunk
number 0 differs from expected value 6". But on florican and maybe
other systems we instead get "final toast chunk number 0 differs from
expected value 5". That's because the value of TOAST_MAX_CHUNK_SIZE
depends on MAXIMUM_ALIGNOF. I think that on 4-byte alignment systems
it works out to 2000 and on 8-byte alignment systems it works out to
1996, and the value being stored is 10000 bytes, hence the problem.
The place where the calculation goes different seems to be in
MaximumBytesPerTuple(), where it uses MAXALIGN_DOWN() on a value that,
according to my calculations, will be 2038 on all platforms, but the
output of MAXALIGN_DOWN() will be 2032 or 2036 depending on the
platform. I think the solution to this is just to change the message
to match \d+ chunks instead of exactly 6. We should do that right away
to avoid having the buildfarm barf.

But, I also notice a couple of other things I think could be improved here:

1. amcheck is really reporting the complete absence of any TOAST rows
here due to a corrupted va_valueid. It could pick a better phrasing of
that message than "final toast chunk number 0 differs from expected
value XXX". I mean, there is no chunk 0. There are no chunks at all.

2. Using SSSSSSSSS as the perl unpack code for the varlena header is
not ideal, because it's really 2 1-byte fields followed by 4 4-byte
fields. So I think you should be using CCllLL, for two unsigned bytes
and then two signed 4-byte quantities and then two unsigned 4-byte
quantities. I think if you did that you'd be overwriting the
va_valueid with the *same* garbage value on every platform, which
would be better than different ones. Perhaps when we improve the
message as suggested in (1) this will become a live issue, since we
might choose to say something like "no TOAST entries for value %u".

--
Robert Haas
EDB: http://www.enterprisedb.com

In reply to: Robert Haas (#33)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 1:43 PM Robert Haas <robertmhaas@gmail.com> wrote:

There are some interesting failures in the test cases on the
buildfarm.

I wonder if Andrew Dunstan (now CC'd) could configure his crake
buildfarm member to run pg_amcheck with the most expensive and
thorough options on the master branch (plus all new major version
branches going forward).

That would give us some degree of amcheck test coverage in the back
branches right away. It might even detect cross-version
inconsistencies. Or even pg_upgrade bugs.

--
Peter Geoghegan

#35Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#33)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 1:43 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 12, 2021 at 2:31 PM Robert Haas <robertmhaas@gmail.com> wrote:

I'll commit something shortly to address these.

There are some interesting failures in the test cases on the
buildfarm. One of the tests ($offnum == 13) corrupts the TOAST pointer
with a garbage value, expecting to get the message "final toast chunk
number 0 differs from expected value 6". But on florican and maybe
other systems we instead get "final toast chunk number 0 differs from
expected value 5". That's because the value of TOAST_MAX_CHUNK_SIZE
depends on MAXIMUM_ALIGNOF. I think that on 4-byte alignment systems
it works out to 2000 and on 8-byte alignment systems it works out to
1996, and the value being stored is 10000 bytes, hence the problem.
The place where the calculation goes different seems to be in
MaximumBytesPerTuple(), where it uses MAXALIGN_DOWN() on a value that,
according to my calculations, will be 2038 on all platforms, but the
output of MAXALIGN_DOWN() will be 2032 or 2036 depending on the
platform. I think the solution to this is just to change the message
to match \d+ chunks instead of exactly 6. We should do that right away
to avoid having the buildfarm barf.

But, I also notice a couple of other things I think could be improved here:

1. amcheck is really reporting the complete absence of any TOAST rows
here due to a corrupted va_valueid. It could pick a better phrasing of
that message than "final toast chunk number 0 differs from expected
value XXX". I mean, there is no chunk 0. There are no chunks at all.

2. Using SSSSSSSSS as the perl unpack code for the varlena header is
not ideal, because it's really 2 1-byte fields followed by 4 4-byte
fields. So I think you should be using CCllLL, for two unsigned bytes
and then two signed 4-byte quantities and then two unsigned 4-byte
quantities. I think if you did that you'd be overwriting the
va_valueid with the *same* garbage value on every platform, which
would be better than different ones. Perhaps when we improve the
message as suggested in (1) this will become a live issue, since we
might choose to say something like "no TOAST entries for value %u".

--
Robert Haas
EDB: http://www.enterprisedb.com

This does nothing to change the verbiage from contrib/amcheck, but it should address the problems discussed here in pg_amcheck's regression tests.

Attachments:

v1-0001-Fixing-portability-issues-in-pg_amcheck-regressio.patchapplication/octet-stream; name=v1-0001-Fixing-portability-issues-in-pg_amcheck-regressio.patch; x-unix-mode=0644Download
From 194a53e7b0a060425a49e4447b70a1663c3280d5 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 13:49:02 -0800
Subject: [PATCH v1] Fixing portability issues in pg_amcheck regression test

The first broken test was overwriting a 1-byte varlena header to
make it look like the initial byte of a 4-byte varlena header, but
there were two problems with this.  First, the byte written, 0x80,
was only appropriate on little-endian machines.  That's the wrong
bit pattern for a big-endian machine.  The second problem is that,
even if you get the first byte correct, the first three bytes of the
payload of the text datum will be interpreted as part of the 4-byte
length field, but the test wasn't overwriting those, it was just
leaving them as the character string "abc".  That makes it hard to
think about what the 4-byte length will be on machines with
different endianness.  The cleanest solution seems to be to
overwrite the first four bytes of the datum with a 4-byte varlena
header with all 30 length bits set.  The purpose of the test is to
check what happens when the length is overlong, and this works just
as well without being as hard to read and understand.

The second broken test was assuming that a toasted uncompressed
string of length 10000 characters would occupy the same number of
toast chunks regardless of platform, but that is wrong.  It occupies
6 chunks on 64-bit systems and 5-chunks on 32-bit systems, owing to
TOAST_MAX_CHUNK_SIZE differing.  It just so happens that length
10000 is close to the cutoff between 5 chunks and 6 chunks, so we
could fix this by picking a different number.  But since the test is
hardcoding the number 6 in a pattern, it is less brittle to just use
\d+ and accept any number in this location.  The error text itself
is what we care about, not the number within that text.
---
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 72 +++++++++++++----------
 1 file changed, 42 insertions(+), 30 deletions(-)

diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 48dfbef145..618890d7cf 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -53,10 +53,10 @@ use Test::More;
 # We choose to read and write binary copies of our table's tuples, using perl's
 # pack() and unpack() functions.  Perl uses a packing code system in which:
 #
+#	l = "signed 32-bit Long",
 #	L = "Unsigned 32-bit Long",
 #	S = "Unsigned 16-bit Short",
 #	C = "Unsigned 8-bit Octet",
-#	c = "signed 8-bit octet",
 #	q = "signed 64-bit quadword"
 #
 # Each tuple in our table has a layout as follows:
@@ -72,16 +72,16 @@ use Test::More;
 #    xx                     t_hoff: x			offset = 22		C
 #    xx                     t_bits: x			offset = 23		C
 #    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
-#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		Cccccccc
-#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		SSSS
-#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued	SSSS
-#    xx xx                        : xx      	 ...continued	S
+#    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		CCCCCCCC
+#    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		CCllLL
+#    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued
+#    xx xx                        : xx      	 ...continued
 #
 # We could choose to read and write columns 'b' and 'c' in other ways, but
 # it is convenient enough to do it this way.  We define packing code
 # constants here, where they can be compared easily against the layout.
 
-use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCcccccccSSSSSSSSS';
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCCCCCCCCCCllLL';
 use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
 
 # Read a tuple of our table from a heap page.
@@ -121,15 +121,12 @@ sub read_tuple ($$)
 			b_body5 => shift,
 			b_body6 => shift,
 			b_body7 => shift,
-			c1 => shift,
-			c2 => shift,
-			c3 => shift,
-			c4 => shift,
-			c5 => shift,
-			c6 => shift,
-			c7 => shift,
-			c8 => shift,
-			c9 => shift);
+			c_va_header => shift,
+			c_va_vartag => shift,
+			c_va_rawsize => shift,
+			c_va_extsize => shift,
+			c_va_valueid => shift,
+			c_va_toastrelid => shift);
 	# Stitch together the text for column 'b'
 	$tup{b} = join('', map { chr($tup{"b_body$_"}) } (1..7));
 	return \%tup;
@@ -168,15 +165,12 @@ sub write_tuple($$$)
 					$tup->{b_body5},
 					$tup->{b_body6},
 					$tup->{b_body7},
-					$tup->{c1},
-					$tup->{c2},
-					$tup->{c3},
-					$tup->{c4},
-					$tup->{c5},
-					$tup->{c6},
-					$tup->{c7},
-					$tup->{c8},
-					$tup->{c9});
+					$tup->{c_va_header},
+					$tup->{c_va_vartag},
+					$tup->{c_va_rawsize},
+					$tup->{c_va_extsize},
+					$tup->{c_va_valueid},
+					$tup->{c_va_toastrelid});
 	seek($fh, $offset, 0)
 		or BAIL_OUT("seek failed: $!");
 	defined(syswrite($fh, $buffer, HEAPTUPLE_PACK_LENGTH))
@@ -273,6 +267,7 @@ open($file, '+<', $relpath)
 	or BAIL_OUT("open failed: $!");
 binmode $file;
 
+my $ENDIANNESS;
 for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 {
 	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
@@ -289,6 +284,9 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		plan skip_all => qq(Page layout differs from our expectations: expected (12345678, "abcdefg"), got ($a, "$b"));
 		exit;
 	}
+
+	# Determine endianness of current platform from the 1-byte varlena header
+	$ENDIANNESS = $tup->{b_header} == 0x11 ? "little" : "big";
 }
 close($file)
 	or BAIL_OUT("close failed: $!");
@@ -459,22 +457,36 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 	}
 	elsif ($offnum == 12)
 	{
-		# Corrupt the bits in column 'b' 1-byte varlena header
-		$tup->{b_header} = 0x80;
+		# Overwrite column 'b' 1-byte varlena header and initial characters to
+		# look like a long 4-byte varlena
+		#
+		# On little endian machines, bytes ending in two zero bits (xxxxxx00 bytes)
+		# are 4-byte length word, aligned, uncompressed data (up to 1G).  We set the
+		# high six bits to 111111 and the lower two bits to 00, then the next three
+		# bytes with 0xFF using 0xFCFFFFFF.
+		#
+		# On big endian machines, bytes starting in two zero bits (00xxxxxx bytes)
+		# are 4-byte length word, aligned, uncompressed data (up to 1G).  We set the
+		# low six bits to 111111 and the high two bits to 00, then the next three
+		# bytes with 0xFF using 0x3FFFFFFF.
+		#
+		$tup->{b_header} = $ENDIANNESS eq 'little' ? 0xFC : 0x3F;
+		$tup->{b_body1} = 0xFF;
+		$tup->{b_body2} = 0xFF;
+		$tup->{b_body3} = 0xFF;
 
 		$header = header(0, $offnum, 1);
 		push @expected,
-			qr/${header}attribute 1 with length 4294967295 ends at offset 416848000 beyond total tuple length 58/;
+			qr/${header}attribute \d+ with length \d+ ends at offset \d+ beyond total tuple length \d+/;
 	}
 	elsif ($offnum == 13)
 	{
 		# Corrupt the bits in column 'c' toast pointer
-		$tup->{c6} = 41;
-		$tup->{c7} = 41;
+		$tup->{c_va_valueid} = 0xFFFFFFFF;
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}final toast chunk number 0 differs from expected value 6/,
+			qr/${header}final toast chunk number 0 differs from expected value \d+/,
 			qr/${header}toasted value for attribute 2 missing from toast table/;
 	}
 	elsif ($offnum == 14)
-- 
2.21.1 (Apple Git-122.3)

#36Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#35)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 5:24 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

This does nothing to change the verbiage from contrib/amcheck, but it should address the problems discussed here in pg_amcheck's regression tests.

Committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#37Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#36)
2 attachment(s)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 2:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 12, 2021 at 5:24 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

This does nothing to change the verbiage from contrib/amcheck, but it should address the problems discussed here in pg_amcheck's regression tests.

Committed.

Thanks.

There are two more, attached here. The first deals with error message text which differs between farm animals, and the second deals with an apparent problem with IPC::Run shell expanding an asterisk on some platforms but not others. That second one, if true, seems like a problem with scope beyond the pg_amcheck project, as TestLib::command_checks_all uses IPC::Run, and it would be desirable to have consistent behavior across platforms.

Attachments:

v2-0001-Fixing-pg_amcheck-regression-test-portability-iss.patchapplication/octet-stream; name=v2-0001-Fixing-pg_amcheck-regression-test-portability-iss.patch; x-unix-mode=0644Download
From 02a598d2fec04e5c5f5185bd07d1f19527900844 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 14:43:19 -0800
Subject: [PATCH v2 1/2] Fixing pg_amcheck regression test portability issue

One of pg_amcheck's regression tests was failing because it was not
accounting for the fact that the exact error message for a
nonexistent role can differ.  The test was expecting a login attempt
with user "no_such_user" would draw an error matching the pattern
/role "no_such_user" does not exist/, but this ignores differences
across machines.  On fairywren, for example, the message actually
seen is:

	SSPI authentication failed for user "no_such_user"

Rather than try to update the test with an exhaustive list of all
possible failure messages, changing the test to merely verify that
the pg_amcheck command fails when given a nonexistent user.
---
 src/bin/pg_amcheck/t/002_nonesuch.pl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_amcheck/t/002_nonesuch.pl b/src/bin/pg_amcheck/t/002_nonesuch.pl
index b7d41c9b49..fd5f637d7b 100644
--- a/src/bin/pg_amcheck/t/002_nonesuch.pl
+++ b/src/bin/pg_amcheck/t/002_nonesuch.pl
@@ -3,7 +3,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 76;
+use Test::More tests => 75;
 
 # Test set-up
 my ($node, $port);
@@ -68,7 +68,7 @@ $node->command_checks_all(
 	[ 'pg_amcheck', '-U', 'no_such_user', 'postgres' ],
 	1,
 	[ qr/^$/ ],
-	[ qr/role "no_such_user" does not exist/ ],
+	[ ],
 	'checking with a non-existent user');
 
 # Failing to connect to the initial database due to bad username is an still an
-- 
2.21.1 (Apple Git-122.3)

v2-0002-Working-around-apparent-difficulty-in-IPC-Run.patchapplication/octet-stream; name=v2-0002-Working-around-apparent-difficulty-in-IPC-Run.patch; x-unix-mode=0644Download
From be788467ae9161ae478e1b4ae351d509252e021e Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 15:01:48 -0800
Subject: [PATCH v2 2/2] Working around apparent difficulty in IPC::Run

One of pg_amcheck's regression tests was passing an asterisk through
TestLib's command_checks_all() command, which gets through to
pg_amcheck without difficulty on most platforms, but appears to get
shell expanded on Windows (jacana) and AIX (hoverfly).

To fix this, passing '-S*' rather than the pair ('-S', '*').
---
 src/bin/pg_amcheck/t/002_nonesuch.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_amcheck/t/002_nonesuch.pl b/src/bin/pg_amcheck/t/002_nonesuch.pl
index fd5f637d7b..4df17885f9 100644
--- a/src/bin/pg_amcheck/t/002_nonesuch.pl
+++ b/src/bin/pg_amcheck/t/002_nonesuch.pl
@@ -239,7 +239,7 @@ $node->command_checks_all(
 		'-s', 'pg_toast',
 		'-s', 'information_schema',
 		'-t', 'pg_catalog.pg_class',
-		'-S', '*'
+		'-S*'
 	],
 	1,
 	[ qr/^$/ ],
-- 
2.21.1 (Apple Git-122.3)

#38Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#31)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 11:24 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Mar 12, 2021 at 2:05 PM <er@xs4all.nl> wrote:

I think there is a formatting glitch in lines like:

2/4 relations (50%) 187977/187978 pages (99%), (testdb )

I suppose that last part should show up trimmed as '(testdb)', right?

Actually I think this is intentional. The idea is that as the line is
rewritten we don't want the close-paren to move around. It's the same
thing pg_basebackup does with its progress reporting.

Now that is not to say that some other behavior might not be better,
just that Mark was copying something that already exists, probably
because he knows that I'm finnicky about consistency....

I think Erik's test case only checked one database, which might be why it looked odd to him. But consider:

pg_amcheck -d foo -d bar -d myreallylongdatabasename -d myshortername -d baz --progress

The tool will respect your database ordering, and check foo, then bar, etc. If you have --jobs greater than one, it will start checking some relations in bar before finishing all relations in foo, but with a fudge factor, pg_amcheck can report that the processing has moved on to database bar, etc.

You wouldn't want the parens to jump around when the long database names get processed. So it keeps the parens in the same location, space pads shorter database names, and truncates overlong database names.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#37)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 3:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

and the second deals with an apparent problem with IPC::Run shell expanding an asterisk on some platforms but not others. That second one, if true, seems like a problem with scope beyond the pg_amcheck project, as TestLib::command_checks_all uses IPC::Run, and it would be desirable to have consistent behavior across platforms.

The problem with IPC::Run appears to be real, though I might just need to wait longer for the farm animals to prove me wrong about that. But there is a similar symptom caused by an unrelated problem, one entirely my fault and spotted by Robert. Here is a patch:

Attachments:

v3-0001-Avoid-use-of-non-portable-option-ordering-in-comm.patchapplication/octet-stream; name=v3-0001-Avoid-use-of-non-portable-option-ordering-in-comm.patch; x-unix-mode=0644Download
From db0a3e9b43202aacf1be668e887cfb9f803a6ada Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 16:56:59 -0800
Subject: [PATCH v3] Avoid use of non-portable option ordering in
 command_checks_all().

The use of bare command line arguments before switches is not
portable and caused failures in the buildfarm.  Reordering the
arguments to be portable.

Failures were observed on drongo and hoverfly.

Non-portable usage spotted by Robert Haas.
---
 src/bin/pg_amcheck/t/003_check.pl | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/bin/pg_amcheck/t/003_check.pl b/src/bin/pg_amcheck/t/003_check.pl
index e43ffe7ed6..875a675423 100644
--- a/src/bin/pg_amcheck/t/003_check.pl
+++ b/src/bin/pg_amcheck/t/003_check.pl
@@ -376,7 +376,7 @@ $node->command_checks_all(
 # are quiet.
 #
 $node->command_checks_all(
-	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	[ @cmd, '-t', 's1.*', '--no-dependent-indexes', 'db1' ],
 	0,
 	[ $no_output_re ],
 	[ $no_output_re ],
@@ -385,7 +385,7 @@ $node->command_checks_all(
 # Checking db2.s1 should show table corruptions if indexes are excluded
 #
 $node->command_checks_all(
-	[ @cmd, 'db2', '-t', 's1.*', '--no-dependent-indexes' ],
+	[ @cmd, '-t', 's1.*', '--no-dependent-indexes', 'db2' ],
 	2,
 	[ $missing_file_re ],
 	[ $no_output_re ],
@@ -395,7 +395,7 @@ $node->command_checks_all(
 # corruption messages on stdout, and nothing on stderr.
 #
 $node->command_checks_all(
-	[ @cmd, 'db1', '-s', 's3' ],
+	[ @cmd, '-s', 's3', 'db1' ],
 	2,
 	[ $index_missing_relation_fork_re,
 	  $line_pointer_corruption_re,
@@ -408,14 +408,14 @@ $node->command_checks_all(
 # options the toast corruption is reported, but when excluding toast we get no
 # error reports.
 $node->command_checks_all(
-	[ @cmd, 'db1', '-s', 's4' ],
+	[ @cmd, '-s', 's4', 'db1' ],
 	2,
 	[ $missing_file_re ],
 	[ $no_output_re ],
 	'pg_amcheck in schema s4 reports toast corruption');
 
 $node->command_checks_all(
-	[ @cmd, '--no-dependent-toast', '--exclude-toast-pointers', 'db1', '-s', 's4' ],
+	[ @cmd, '--no-dependent-toast', '--exclude-toast-pointers', '-s', 's4', 'db1' ],
 	0,
 	[ $no_output_re ],
 	[ $no_output_re ],
@@ -423,7 +423,7 @@ $node->command_checks_all(
 
 # Check that no corruption is reported in schema db1.s5
 $node->command_checks_all(
-	[ @cmd, 'db1', '-s', 's5' ],
+	[ @cmd, '-s', 's5', 'db1' ],
 	0,
 	[ $no_output_re ],
 	[ $no_output_re ],
@@ -433,7 +433,7 @@ $node->command_checks_all(
 # the indexes, no corruption is reported about the schema.
 #
 $node->command_checks_all(
-	[ @cmd, 'db1', '-s', 's1', '-I', 't1_btree', '-I', 't2_btree' ],
+	[ @cmd, '-s', 's1', '-I', 't1_btree', '-I', 't2_btree', 'db1' ],
 	0,
 	[ $no_output_re ],
 	[ $no_output_re ],
@@ -444,7 +444,7 @@ $node->command_checks_all(
 # about the schema.
 #
 $node->command_checks_all(
-	[ @cmd, 'db1', '-t', 's1.*', '--no-dependent-indexes' ],
+	[ @cmd, '-t', 's1.*', '--no-dependent-indexes', 'db1' ],
 	0,
 	[ $no_output_re ],
 	[ $no_output_re ],
@@ -454,7 +454,7 @@ $node->command_checks_all(
 # tables that no corruption is reported.
 #
 $node->command_checks_all(
-	[ @cmd, 'db1', '-s', 's2', '-T', 't1', '-T', 't2' ],
+	[ @cmd, '-s', 's2', '-T', 't1', '-T', 't2', 'db1' ],
 	0,
 	[ $no_output_re ],
 	[ $no_output_re ],
@@ -464,7 +464,7 @@ $node->command_checks_all(
 # to avoid getting messages about corrupt tables or indexes.
 #
 command_fails_like(
-	[ @cmd, 'db1', '-s', 's5', '--startblock', 'junk' ],
+	[ @cmd, '-s', 's5', '--startblock', 'junk', 'db1' ],
 	qr/invalid start block/,
 	'pg_amcheck rejects garbage startblock');
 
@@ -474,7 +474,7 @@ command_fails_like(
 	'pg_amcheck rejects garbage endblock');
 
 command_fails_like(
-	[ @cmd, 'db1', '-s', 's5', '--startblock', '5', '--endblock', '4' ],
+	[ @cmd, '-s', 's5', '--startblock', '5', '--endblock', '4', 'db1' ],
 	qr/end block precedes start block/,
 	'pg_amcheck rejects invalid block range');
 
@@ -483,14 +483,14 @@ command_fails_like(
 # arguments are handled sensibly.
 #
 $node->command_checks_all(
-	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--parent-check' ],
+	[ @cmd, '-s', 's1', '-i', 't1_btree', '--parent-check', 'db1' ],
 	2,
 	[ $index_missing_relation_fork_re ],
 	[ $no_output_re ],
 	'pg_amcheck smoke test --parent-check');
 
 $node->command_checks_all(
-	[ @cmd, 'db1', '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend' ],
+	[ @cmd, '-s', 's1', '-i', 't1_btree', '--heapallindexed', '--rootdescend', 'db1' ],
 	2,
 	[ $index_missing_relation_fork_re ],
 	[ $no_output_re ],
-- 
2.21.1 (Apple Git-122.3)

#40Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#39)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 8:04 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

The problem with IPC::Run appears to be real, though I might just need to wait longer for the farm animals to prove me wrong about that. But there is a similar symptom caused by an unrelated problem, one entirely my fault and spotted by Robert. Here is a patch:

OK, I committed this too, along with the one I hadn't committed yet
from your previous email. Gah, tests are so annoying. :-)

--
Robert Haas
EDB: http://www.enterprisedb.com

#41Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#40)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 5:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Gah, tests are so annoying. :-)

There is another problem of non-portable option ordering in the tests.

Attachments:

v4-0001-pg_amcheck-Keep-trying-to-fix-the-tests.patchapplication/octet-stream; name=v4-0001-pg_amcheck-Keep-trying-to-fix-the-tests.patch; x-unix-mode=0644Download
From 1f6cb42bd8edfee3f1770cb760ccb8b6ee9429ee Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 20:33:16 -0800
Subject: [PATCH v4] pg_amcheck: Keep trying to fix the tests.

Fix another example of non-portable option ordering in the tests.
---
 src/bin/pg_amcheck/t/003_check.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_amcheck/t/003_check.pl b/src/bin/pg_amcheck/t/003_check.pl
index 0a7795bb64..54b2b86a49 100644
--- a/src/bin/pg_amcheck/t/003_check.pl
+++ b/src/bin/pg_amcheck/t/003_check.pl
@@ -468,7 +468,7 @@ command_fails_like(
 	'pg_amcheck rejects garbage startblock');
 
 command_fails_like(
-	[ @cmd, 'db1', '-s', 's5', '--endblock', '1234junk' ],
+	[ @cmd, '-s', 's5', '--endblock', '1234junk', 'db1' ],
 	qr/invalid end block/,
 	'pg_amcheck rejects garbage endblock');
 
-- 
2.21.1 (Apple Git-122.3)

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#41)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

There is another problem of non-portable option ordering in the tests.

Don't almost all of the following tests have the same issue?

regards, tom lane

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#41)
Re: pg_amcheck contrib application

... btw, prairiedog (which has a rather old Perl) has a
different complaint:

Invalid type 'q' in unpack at t/004_verify_heapam.pl line 104.

regards, tom lane

#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#42)
Re: pg_amcheck contrib application

I wrote:

Don't almost all of the following tests have the same issue?

Ah, nevermind, I was looking at an older version of 003_check.pl.
I concur that 24189277f missed only one here.

Pushed your fix.

regards, tom lane

#45Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#44)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 9:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

Don't almost all of the following tests have the same issue?

Ah, nevermind, I was looking at an older version of 003_check.pl.
I concur that 24189277f missed only one here.

Pushed your fix.

regards, tom lane

Thanks! Was just responding to your other email, but now I don't have to send it.

Sorry for painting so many farm animals red this evening.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#43)
Re: pg_amcheck contrib application

I wrote:

... btw, prairiedog (which has a rather old Perl) has a
different complaint:
Invalid type 'q' in unpack at t/004_verify_heapam.pl line 104.

Hmm ... "man perlfunc" on that system quoth

q A signed quad (64-bit) value.
Q An unsigned quad value.
(Quads are available only if your system supports 64-bit
integer values _and_ if Perl has been compiled to support those.
Causes a fatal error otherwise.)

It does not seem unreasonable for us to rely on Perl having that
in 2021, so I'll see about upgrading this perl installation.

(I suppose gaur will need it too, sigh.)

regards, tom lane

#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#45)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

Sorry for painting so many farm animals red this evening.

Not to worry. We go through this sort of fire drill regularly
when somebody pushes a batch of brand new test code.

regards, tom lane

#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#46)
Re: pg_amcheck contrib application

I wrote:

... btw, prairiedog (which has a rather old Perl) has a
different complaint:
Invalid type 'q' in unpack at t/004_verify_heapam.pl line 104.

Hmm ... "man perlfunc" on that system quoth
q A signed quad (64-bit) value.
Q An unsigned quad value.
(Quads are available only if your system supports 64-bit
integer values _and_ if Perl has been compiled to support those.
Causes a fatal error otherwise.)
It does not seem unreasonable for us to rely on Perl having that
in 2021, so I'll see about upgrading this perl installation.

Hm, wait a minute: hoverfly is showing the same failure, even though
it claims to be running a 64-bit Perl. Now I'm confused.

regards, tom lane

#49Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#48)
Re: pg_amcheck contrib application

On Sat, Mar 13, 2021 at 01:07:15AM -0500, Tom Lane wrote:

I wrote:

... btw, prairiedog (which has a rather old Perl) has a
different complaint:
Invalid type 'q' in unpack at t/004_verify_heapam.pl line 104.

Hmm ... "man perlfunc" on that system quoth
q A signed quad (64-bit) value.
Q An unsigned quad value.
(Quads are available only if your system supports 64-bit
integer values _and_ if Perl has been compiled to support those.
Causes a fatal error otherwise.)
It does not seem unreasonable for us to rely on Perl having that
in 2021, so I'll see about upgrading this perl installation.

Hm, wait a minute: hoverfly is showing the same failure, even though
it claims to be running a 64-bit Perl. Now I'm confused.

On that machine:

[nm@power8-aix 7:0 2021-03-13T06:09:08 ~ 0]$ /usr/bin/perl64 -e 'unpack "q", ""'
[nm@power8-aix 7:0 2021-03-13T06:09:10 ~ 0]$ /usr/bin/perl -e 'unpack "q", ""'
Invalid type 'q' in unpack at -e line 1.

hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit
Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no
"prove64".) This test should unpack the field as two 32-bit values, not a
64-bit value, to avoid requiring more from the Perl installation.

#50Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Noah Misch (#49)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 10:16 PM, Noah Misch <noah@leadboat.com> wrote:

On Sat, Mar 13, 2021 at 01:07:15AM -0500, Tom Lane wrote:

I wrote:

... btw, prairiedog (which has a rather old Perl) has a
different complaint:
Invalid type 'q' in unpack at t/004_verify_heapam.pl line 104.

Hmm ... "man perlfunc" on that system quoth
q A signed quad (64-bit) value.
Q An unsigned quad value.
(Quads are available only if your system supports 64-bit
integer values _and_ if Perl has been compiled to support those.
Causes a fatal error otherwise.)
It does not seem unreasonable for us to rely on Perl having that
in 2021, so I'll see about upgrading this perl installation.

Hm, wait a minute: hoverfly is showing the same failure, even though
it claims to be running a 64-bit Perl. Now I'm confused.

On that machine:

[nm@power8-aix 7:0 2021-03-13T06:09:08 ~ 0]$ /usr/bin/perl64 -e 'unpack "q", ""'
[nm@power8-aix 7:0 2021-03-13T06:09:10 ~ 0]$ /usr/bin/perl -e 'unpack "q", ""'
Invalid type 'q' in unpack at -e line 1.

hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit
Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no
"prove64".) This test should unpack the field as two 32-bit values, not a
64-bit value, to avoid requiring more from the Perl installation.

I will post a modified test in a bit that avoids using Q/q.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#50)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:16 PM, Noah Misch <noah@leadboat.com> wrote:

hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit
Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no
"prove64".)

Oh, that's annoying.

This test should unpack the field as two 32-bit values, not a
64-bit value, to avoid requiring more from the Perl installation.

I will post a modified test in a bit that avoids using Q/q.

Coping with both endiannesses might be painful.

regards, tom lane

#52Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#51)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:16 PM, Noah Misch <noah@leadboat.com> wrote:

hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit
Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no
"prove64".)

Oh, that's annoying.

This test should unpack the field as two 32-bit values, not a
64-bit value, to avoid requiring more from the Perl installation.

I will post a modified test in a bit that avoids using Q/q.

Coping with both endiannesses might be painful.

Not too bad if the bigint value is zero, as both the low and high 32bits will be zero, regardless of endianness. The question is whether that gives up too much in terms of what the test is trying to do. I'm not sure that it does, but if you'd rather solve this by upgrading perl, that's ok by me.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#53Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#51)
Re: pg_amcheck contrib application

Hi,

On 2021-03-13 01:22:54 -0500, Tom Lane wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:16 PM, Noah Misch <noah@leadboat.com> wrote:

hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit
Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no
"prove64".)

Oh, that's annoying.

I suspect we could solve that by invoking changing our /usr/bin/prove
invocation to instead be PERL /usr/bin/prove? That might be a good thing
independent of this issue, because it's not unreasonable for a user to
expect that we'd actually use the perl installation they configured...

Although I do not know how prove determines the perl installation it's
going to use for the test scripts...

- Andres

#54Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#52)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 10:28 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Mar 12, 2021, at 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:16 PM, Noah Misch <noah@leadboat.com> wrote:

hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit
Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no
"prove64".)

Oh, that's annoying.

This test should unpack the field as two 32-bit values, not a
64-bit value, to avoid requiring more from the Perl installation.

I will post a modified test in a bit that avoids using Q/q.

Coping with both endiannesses might be painful.

Not too bad if the bigint value is zero, as both the low and high 32bits will be zero, regardless of endianness. The question is whether that gives up too much in terms of what the test is trying to do. I'm not sure that it does, but if you'd rather solve this by upgrading perl, that's ok by me.

I'm not advocating that this be the solution, but if we don't fix up the perl end of it, then this test change might be used instead.

Attachments:

v5-0001-pg_amcheck-continuing-to-fix-portability-problems.patchapplication/octet-stream; name=v5-0001-pg_amcheck-continuing-to-fix-portability-problems.patch; x-unix-mode=0644Download
From 41a0b3155b27b8fa989ee53f1bcbc2faf867af6f Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 22:30:48 -0800
Subject: [PATCH v5] pg_amcheck: continuing to fix portability problems

Fixing a problem in the tests depending on Q/q perl pack codes, but
they don't work on all versions of perl.
---
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 4cb34a0e01..6cff3b9343 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -57,7 +57,6 @@ use Test::More;
 #	L = "Unsigned 32-bit Long",
 #	S = "Unsigned 16-bit Short",
 #	C = "Unsigned 8-bit Octet",
-#	q = "signed 64-bit quadword"
 #
 # Each tuple in our table has a layout as follows:
 #
@@ -71,7 +70,7 @@ use Test::More;
 #    xx xx              t_infomask: xx			offset = 20		S
 #    xx                     t_hoff: x			offset = 22		C
 #    xx                     t_bits: x			offset = 23		C
-#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		LL
 #    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		CCCCCCCC
 #    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		CCllLL
 #    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued
@@ -81,7 +80,7 @@ use Test::More;
 # it is convenient enough to do it this way.  We define packing code
 # constants here, where they can be compared easily against the layout.
 
-use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCCCCCCCCCCllLL';
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCLLCCCCCCCCCCllLL';
 use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
 
 # Read a tuple of our table from a heap page.
@@ -112,7 +111,8 @@ sub read_tuple
 			t_infomask => shift,
 			t_hoff => shift,
 			t_bits => shift,
-			a => shift,
+			a_1 => shift,
+			a_2 => shift,
 			b_header => shift,
 			b_body1 => shift,
 			b_body2 => shift,
@@ -156,7 +156,8 @@ sub write_tuple
 					$tup->{t_infomask},
 					$tup->{t_hoff},
 					$tup->{t_bits},
-					$tup->{a},
+					$tup->{a_1},
+					$tup->{a_2},
 					$tup->{b_header},
 					$tup->{b_body1},
 					$tup->{b_body2},
@@ -227,7 +228,7 @@ use constant ROWCOUNT => 16;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
-			12345678,
+			0,
 			'abcdefg',
 			repeat('w', 10000)
 		);
@@ -275,13 +276,14 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 	my $tup = read_tuple($file, $offset);
 
 	# Sanity-check that the data appears on the page where we expect.
-	my $a = $tup->{a};
+	my $a_1 = $tup->{a_1};
+	my $a_2 = $tup->{a_2};
 	my $b = $tup->{b};
-	if ($a ne '12345678' || $b ne 'abcdefg')
+	if ($a_1 ne '0' || $a_2 ne '0' || $b ne 'abcdefg')
 	{
 		close($file);  # ignore errors on close; we're exiting anyway
 		$node->clean_node;
-		plan skip_all => qq(Page layout differs from our expectations: expected (12345678, "abcdefg"), got ($a, "$b"));
+		plan skip_all => qq(Page layout differs from our expectations: expected (0, 0, "abcdefg"), got ($a_1, $a_2, "$b"));
 		exit;
 	}
 
-- 
2.21.1 (Apple Git-122.3)

#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#52)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coping with both endiannesses might be painful.

Not too bad if the bigint value is zero, as both the low and high 32bits will be zero, regardless of endianness. The question is whether that gives up too much in terms of what the test is trying to do. I'm not sure that it does, but if you'd rather solve this by upgrading perl, that's ok by me.

I don't mind updating the perl installations on prairiedog and gaur,
but Noah might have some difficulty with his AIX flotilla, as I believe
he's not sysadmin there.

You might think about using some symmetric-but-not-zero value,
0x01010101 or the like.

regards, tom lane

#56Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#55)
Re: pg_amcheck contrib application

On Mar 12, 2021, at 10:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coping with both endiannesses might be painful.

Not too bad if the bigint value is zero, as both the low and high 32bits will be zero, regardless of endianness. The question is whether that gives up too much in terms of what the test is trying to do. I'm not sure that it does, but if you'd rather solve this by upgrading perl, that's ok by me.

I don't mind updating the perl installations on prairiedog and gaur,
but Noah might have some difficulty with his AIX flotilla, as I believe
he's not sysadmin there.

You might think about using some symmetric-but-not-zero value,
0x01010101 or the like.

I thought about that, but I'm not sure that it proves much more than just using zero. The test doesn't really do much of interest with this value, and it doesn't seem worth complicating the test. The idea originally was that perl's "q" pack code would make reading/writing a number such as 12345678 easy, but since it's not easy, this is easy.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#56)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

You might think about using some symmetric-but-not-zero value,
0x01010101 or the like.

I thought about that, but I'm not sure that it proves much more than just using zero.

Perhaps not. I haven't really looked at any of this code, so I'll defer
to Robert's judgment about whether this represents an interesting testing
issue.

regards, tom lane

#58Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#55)
Re: pg_amcheck contrib application

On Sat, Mar 13, 2021 at 01:36:11AM -0500, Tom Lane wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Coping with both endiannesses might be painful.

Not too bad if the bigint value is zero, as both the low and high 32bits will be zero, regardless of endianness. The question is whether that gives up too much in terms of what the test is trying to do. I'm not sure that it does, but if you'd rather solve this by upgrading perl, that's ok by me.

I don't mind updating the perl installations on prairiedog and gaur,
but Noah might have some difficulty with his AIX flotilla, as I believe
he's not sysadmin there.

The AIX animals have Perl v5.28.1. hoverfly, in particular, got a big update
package less than a month ago. Hence, I doubt it's a question of applying
routine updates. The puzzle would be to either (a) compile a 32-bit Perl that
handles unpack('q') or (b) try a PostgreSQL configuration like "./configure
... PROVE='perl64 /usr/bin/prove --'" to run the TAP suites under perl64.
(For hoverfly, it's enough to run "prove" under $PERL. mandrill, however,
needs a 32-bit $PERL for plperl, regardless of what it needs for "prove".)
Future AIX packagers would face doing the same.

With v5-0001-pg_amcheck-continuing-to-fix-portability-problems.patch being so
self-contained, something like it is the way to go.

#59Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#58)
Re: pg_amcheck contrib application

Noah Misch <noah@leadboat.com> writes:

On Sat, Mar 13, 2021 at 01:36:11AM -0500, Tom Lane wrote:

I don't mind updating the perl installations on prairiedog and gaur,
but Noah might have some difficulty with his AIX flotilla, as I believe
he's not sysadmin there.

The AIX animals have Perl v5.28.1. hoverfly, in particular, got a big update
package less than a month ago. Hence, I doubt it's a question of applying
routine updates. The puzzle would be to either (a) compile a 32-bit Perl that
handles unpack('q') or (b) try a PostgreSQL configuration like "./configure
... PROVE='perl64 /usr/bin/prove --'" to run the TAP suites under perl64.
(For hoverfly, it's enough to run "prove" under $PERL. mandrill, however,
needs a 32-bit $PERL for plperl, regardless of what it needs for "prove".)
Future AIX packagers would face doing the same.

Yeah. prairiedog and gaur are both frankenstein systems: some of the
software components are years newer than the underlying OS. So I don't
mind changing them further; in the end they're both in the buildfarm
for reasons of hardware diversity, not because they represent platforms
anyone would run production PG on. On the other hand, those AIX animals
represent systems that are still considered production grade (no?), so
we have to be willing to adapt to them not vice versa.

With v5-0001-pg_amcheck-continuing-to-fix-portability-problems.patch being so
self-contained, something like it is the way to go.

I'm only concerned about whether the all-zero value causes any
significant degradation in test quality. Probably Peter G. would
have the most informed opinion about that.

regards, tom lane

#60Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#56)
Re: pg_amcheck contrib application

On Sat, Mar 13, 2021 at 1:55 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I thought about that, but I'm not sure that it proves much more than just using zero. The test doesn't really do much of interest with this value, and it doesn't seem worth complicating the test. The idea originally was that perl's "q" pack code would make reading/writing a number such as 12345678 easy, but since it's not easy, this is easy.

I think it would be good to use a non-zero value here. We're doing a
lot of poking into raw bytes here, and if something goes wrong, a zero
value is more likely to look like something normal than whatever else.
I suggest picking a value where all 8 bytes are the same, but not
zero, and ideally chosen so that they don't look much like any of the
surrounding bytes.

--
Robert Haas
EDB: http://www.enterprisedb.com

#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#60)
Re: pg_amcheck contrib application

Robert Haas <robertmhaas@gmail.com> writes:

I think it would be good to use a non-zero value here. We're doing a
lot of poking into raw bytes here, and if something goes wrong, a zero
value is more likely to look like something normal than whatever else.
I suggest picking a value where all 8 bytes are the same, but not
zero, and ideally chosen so that they don't look much like any of the
surrounding bytes.

Actually, it seems like we can let pack/unpack deal with byte-swapping
within 32-bit words; what we lose by not using 'q' format is just the
ability to correctly swap the two 32-bit words. Hence, any value in
which upper and lower halves are the same should work, say
0x1234567812345678.

regards, tom lane

#62Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#61)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 13, 2021, at 6:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I think it would be good to use a non-zero value here. We're doing a
lot of poking into raw bytes here, and if something goes wrong, a zero
value is more likely to look like something normal than whatever else.
I suggest picking a value where all 8 bytes are the same, but not
zero, and ideally chosen so that they don't look much like any of the
surrounding bytes.

Actually, it seems like we can let pack/unpack deal with byte-swapping
within 32-bit words; what we lose by not using 'q' format is just the
ability to correctly swap the two 32-bit words. Hence, any value in
which upper and lower halves are the same should work, say
0x1234567812345678.

regards, tom lane

The heap tuple in question looks as follows, with ???????? in the spot we're debating:

Tuple Layout (values in hex):

t_xmin: 223
t_xmax: 0
t_field3: 0
bi_hi: 0
bi_lo: 0
ip_posid: 1
t_infomask2: 3
t_infomask: b06
t_hoff: 18
t_bits: 0
a_1: ????????
a_2: ????????
b_header: 11 # little-endian, will be 88 on big endian
b_body1: 61
b_body2: 62
b_body3: 63
b_body4: 64
b_body5: 65
b_body6: 66
b_body7: 67
c_va_header: 1
c_va_vartag: 12
c_va_rawsize: 2714
c_va_extsize: 2710

valueid and toastrelid are not shown, as they won't be stable. Relying on t_xmin to be stable makes the test brittle, but fortunately that is separated from a_1 and a_2 far enough that we should not need to worry about it.

We want to use values that don't look like any of the others. The complete set of nibbles in the values above is [012345678B], leaving the set [9ACDEF] unused. The attached patch uses the value DEADF9F9 as it seems a little easier to remember than other permutations of those nibbles.

Attachments:

v6-0001-pg_amcheck-continuing-to-fix-portability-problems.patchapplication/octet-stream; name=v6-0001-pg_amcheck-continuing-to-fix-portability-problems.patch; x-unix-mode=0644Download
From be992e3f65fdc4d0ff0b80512734b25d78e1b0c3 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 12 Mar 2021 22:30:48 -0800
Subject: [PATCH v6] pg_amcheck: continuing to fix portability problems

Fixing a problem in the tests depending on Q/q perl pack codes,
which do not work on all versions of perl across all platforms.
---
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 4cb34a0e01..9f7114cd62 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -57,7 +57,6 @@ use Test::More;
 #	L = "Unsigned 32-bit Long",
 #	S = "Unsigned 16-bit Short",
 #	C = "Unsigned 8-bit Octet",
-#	q = "signed 64-bit quadword"
 #
 # Each tuple in our table has a layout as follows:
 #
@@ -71,7 +70,7 @@ use Test::More;
 #    xx xx              t_infomask: xx			offset = 20		S
 #    xx                     t_hoff: x			offset = 22		C
 #    xx                     t_bits: x			offset = 23		C
-#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		q
+#    xx xx xx xx xx xx xx xx   'a': xxxxxxxx	offset = 24		LL
 #    xx xx xx xx xx xx xx xx   'b': xxxxxxxx	offset = 32		CCCCCCCC
 #    xx xx xx xx xx xx xx xx   'c': xxxxxxxx	offset = 40		CCllLL
 #    xx xx xx xx xx xx xx xx      : xxxxxxxx	 ...continued
@@ -81,7 +80,7 @@ use Test::More;
 # it is convenient enough to do it this way.  We define packing code
 # constants here, where they can be compared easily against the layout.
 
-use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCqCCCCCCCCCCllLL';
+use constant HEAPTUPLE_PACK_CODE => 'LLLSSSSSCCLLCCCCCCCCCCllLL';
 use constant HEAPTUPLE_PACK_LENGTH => 58;     # Total size
 
 # Read a tuple of our table from a heap page.
@@ -112,7 +111,8 @@ sub read_tuple
 			t_infomask => shift,
 			t_hoff => shift,
 			t_bits => shift,
-			a => shift,
+			a_1 => shift,
+			a_2 => shift,
 			b_header => shift,
 			b_body1 => shift,
 			b_body2 => shift,
@@ -156,7 +156,8 @@ sub write_tuple
 					$tup->{t_infomask},
 					$tup->{t_hoff},
 					$tup->{t_bits},
-					$tup->{a},
+					$tup->{a_1},
+					$tup->{a_2},
 					$tup->{b_header},
 					$tup->{b_body1},
 					$tup->{b_body2},
@@ -227,7 +228,7 @@ use constant ROWCOUNT => 16;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
-			12345678,
+			x'DEADF9F9DEADF9F9'::bigint,
 			'abcdefg',
 			repeat('w', 10000)
 		);
@@ -275,13 +276,15 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 	my $tup = read_tuple($file, $offset);
 
 	# Sanity-check that the data appears on the page where we expect.
-	my $a = $tup->{a};
+	my $a_1 = $tup->{a_1};
+	my $a_2 = $tup->{a_2};
 	my $b = $tup->{b};
-	if ($a ne '12345678' || $b ne 'abcdefg')
+	if ($a_1 != 0xDEADF9F9 || $a_2 != 0xDEADF9F9 || $b ne 'abcdefg')
 	{
 		close($file);  # ignore errors on close; we're exiting anyway
 		$node->clean_node;
-		plan skip_all => qq(Page layout differs from our expectations: expected (12345678, "abcdefg"), got ($a, "$b"));
+		plan skip_all => sprintf("Page layout differs from our expectations: expected (%x, %x, \"%s\"), got (%x, %x, \"%s\")",
+								 0xDEADF9F9, 0xDEADF9F9, "abcdefg", $a_1, $a_2, $b);
 		exit;
 	}
 
-- 
2.21.1 (Apple Git-122.3)

#63Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#62)
Re: pg_amcheck contrib application

On Sat, Mar 13, 2021 at 10:20 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

We want to use values that don't look like any of the others. The complete set of nibbles in the values above is [012345678B], leaving the set [9ACDEF] unused. The attached patch uses the value DEADF9F9 as it seems a little easier to remember than other permutations of those nibbles.

OK, committed. The nibbles seem less relevant than the bytes as a
whole, but that's fine.

--
Robert Haas
EDB: http://www.enterprisedb.com

#64Noah Misch
noah@leadboat.com
In reply to: Mark Dilger (#39)
Re: pg_amcheck contrib application

On Fri, Mar 12, 2021 at 05:04:09PM -0800, Mark Dilger wrote:

On Mar 12, 2021, at 3:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

and the second deals with an apparent problem with IPC::Run shell expanding an asterisk on some platforms but not others. That second one, if true, seems like a problem with scope beyond the pg_amcheck project, as TestLib::command_checks_all uses IPC::Run, and it would be desirable to have consistent behavior across platforms.

One of pg_amcheck's regression tests was passing an asterisk through
TestLib's command_checks_all() command, which gets through to
pg_amcheck without difficulty on most platforms, but appears to get
shell expanded on Windows (jacana) and AIX (hoverfly).

For posterity, I can't reproduce this on hoverfly. The suite fails the same
way at f371a4c and f371a4c^. More-recently (commit 58f5749), the suite passes
even after reverting f371a4c. Self-contained IPC::Run usage also does not
corroborate the theory:

[nm@power8-aix 8:0 2021-03-13T18:32:23 clean 0]$ perl -MIPC::Run -e 'IPC::Run::run "printf", "%s\n", "*"'
*
[nm@power8-aix 8:0 2021-03-13T18:32:29 clean 0]$ perl -MIPC::Run -e 'IPC::Run::run "sh", "-c", "printf %s\\\\n *"'
COPYRIGHT
GNUmakefile.in
HISTORY
Makefile
README
README.git
aclocal.m4
config
configure
configure.ac
contrib
doc
src

there is a similar symptom caused by an unrelated problem

Subject: [PATCH v3] Avoid use of non-portable option ordering in
command_checks_all().

On a glibc system, "env POSIXLY_CORRECT=1 make check ..." tests this.

#65Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Noah Misch (#64)
Re: pg_amcheck contrib application

On Mar 13, 2021, at 10:46 AM, Noah Misch <noah@leadboat.com> wrote:

On Fri, Mar 12, 2021 at 05:04:09PM -0800, Mark Dilger wrote:

On Mar 12, 2021, at 3:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

and the second deals with an apparent problem with IPC::Run shell expanding an asterisk on some platforms but not others. That second one, if true, seems like a problem with scope beyond the pg_amcheck project, as TestLib::command_checks_all uses IPC::Run, and it would be desirable to have consistent behavior across platforms.

One of pg_amcheck's regression tests was passing an asterisk through
TestLib's command_checks_all() command, which gets through to
pg_amcheck without difficulty on most platforms, but appears to get
shell expanded on Windows (jacana) and AIX (hoverfly).

For posterity, I can't reproduce this on hoverfly. The suite fails the same
way at f371a4c and f371a4c^. More-recently (commit 58f5749), the suite passes
even after reverting f371a4c. Self-contained IPC::Run usage also does not
corroborate the theory:

[nm@power8-aix 8:0 2021-03-13T18:32:23 clean 0]$ perl -MIPC::Run -e 'IPC::Run::run "printf", "%s\n", "*"'
*
[nm@power8-aix 8:0 2021-03-13T18:32:29 clean 0]$ perl -MIPC::Run -e 'IPC::Run::run "sh", "-c", "printf %s\\\\n *"'
COPYRIGHT
GNUmakefile.in
HISTORY
Makefile
README
README.git
aclocal.m4
config
configure
configure.ac
contrib
doc
src

there is a similar symptom caused by an unrelated problem

Subject: [PATCH v3] Avoid use of non-portable option ordering in
command_checks_all().

On a glibc system, "env POSIXLY_CORRECT=1 make check ..." tests this.

Thanks for investigating!

The reason I suspected that passing the '*' through IPC::Run had to do with the error that pg_amcheck gave. It complained that too many arguments where being passed to it, and that the first such argument was "pg_amcheck.c". There's no reason pg_amcheck should know it's source file name, nor that the regression test should know that, which suggested that the asterisk was being shell expanded within the src/bin/pg_amcheck/ directory and the file listing was being passed into pg_amcheck as arguments.

That theory may have been wrong, but it was the only theory I had at the time. I don't have access to any of the machines where that happened, so it is hard for me to investigate further.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#66Noah Misch
noah@leadboat.com
In reply to: Mark Dilger (#65)
Re: pg_amcheck contrib application

On Sat, Mar 13, 2021 at 10:51:27AM -0800, Mark Dilger wrote:

On Mar 13, 2021, at 10:46 AM, Noah Misch <noah@leadboat.com> wrote:
On Fri, Mar 12, 2021 at 05:04:09PM -0800, Mark Dilger wrote:

On Mar 12, 2021, at 3:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
and the second deals with an apparent problem with IPC::Run shell expanding an asterisk on some platforms but not others. That second one, if true, seems like a problem with scope beyond the pg_amcheck project, as TestLib::command_checks_all uses IPC::Run, and it would be desirable to have consistent behavior across platforms.

One of pg_amcheck's regression tests was passing an asterisk through
TestLib's command_checks_all() command, which gets through to
pg_amcheck without difficulty on most platforms, but appears to get
shell expanded on Windows (jacana) and AIX (hoverfly).

For posterity, I can't reproduce this on hoverfly. The suite fails the same
way at f371a4c and f371a4c^. More-recently (commit 58f5749), the suite passes
even after reverting f371a4c. Self-contained IPC::Run usage also does not
corroborate the theory:

[nm@power8-aix 8:0 2021-03-13T18:32:23 clean 0]$ perl -MIPC::Run -e 'IPC::Run::run "printf", "%s\n", "*"'
*
[nm@power8-aix 8:0 2021-03-13T18:32:29 clean 0]$ perl -MIPC::Run -e 'IPC::Run::run "sh", "-c", "printf %s\\\\n *"'
COPYRIGHT
GNUmakefile.in
HISTORY
Makefile
README
README.git
aclocal.m4
config
configure
configure.ac
contrib
doc
src

The reason I suspected that passing the '*' through IPC::Run had to do with the error that pg_amcheck gave. It complained that too many arguments where being passed to it, and that the first such argument was "pg_amcheck.c". There's no reason pg_amcheck should know it's source file name, nor that the regression test should know that, which suggested that the asterisk was being shell expanded within the src/bin/pg_amcheck/ directory and the file listing was being passed into pg_amcheck as arguments.

I agree. I can reproduce the problem on Windows. Commit f371a4c fixed
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2021-03-12%2020%3A12%3A44
and I see logs of that kind of failure only on fairywren and jacana.

#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noah Misch (#66)
Re: pg_amcheck contrib application

Looks like we're not quite out of the woods, as hornet and tern are
still unhappy:

# Failed test 'pg_amcheck excluding all corrupt schemas status (got 2 vs expected 0)'
# at t/003_check.pl line 498.

# Failed test 'pg_amcheck excluding all corrupt schemas stdout /(?^:^$)/'
# at t/003_check.pl line 498.
# 'heap table "db1"."pg_catalog"."pg_statistic", block 2, offset 1, attribute 27:
# final toast chunk number 0 differs from expected value 1
# heap table "db1"."pg_catalog"."pg_statistic", block 2, offset 1, attribute 27:
# toasted value for attribute 27 missing from toast table
# '
# doesn't match '(?^:^$)'
# Looks like you failed 2 tests of 60.
[12:18:06] t/003_check.pl ...........
Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/60 subtests

These animals have somewhat weird alignment properties: MAXALIGN is 8
but ALIGNOF_DOUBLE is only 4. I speculate that that is affecting their
choices about whether an out-of-line TOAST value is needed, breaking
this test case.

regards, tom lane

#68Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#67)
Re: pg_amcheck contrib application

On Mar 15, 2021, at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Looks like we're not quite out of the woods, as hornet and tern are
still unhappy:

# Failed test 'pg_amcheck excluding all corrupt schemas status (got 2 vs expected 0)'
# at t/003_check.pl line 498.

# Failed test 'pg_amcheck excluding all corrupt schemas stdout /(?^:^$)/'
# at t/003_check.pl line 498.
# 'heap table "db1"."pg_catalog"."pg_statistic", block 2, offset 1, attribute 27:
# final toast chunk number 0 differs from expected value 1
# heap table "db1"."pg_catalog"."pg_statistic", block 2, offset 1, attribute 27:
# toasted value for attribute 27 missing from toast table
# '
# doesn't match '(?^:^$)'
# Looks like you failed 2 tests of 60.
[12:18:06] t/003_check.pl ...........
Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/60 subtests

These animals have somewhat weird alignment properties: MAXALIGN is 8
but ALIGNOF_DOUBLE is only 4. I speculate that that is affecting their
choices about whether an out-of-line TOAST value is needed, breaking
this test case.

The pg_amcheck test case is not corrupting any pg_catalog tables, but contrib/amcheck/verify_heapam is complaining about a corruption in pg_catalog.pg_statistic.

The logic in verify_heapam only looks for a value in the toast table if the tuple it gets from the main table (in this case, from pg_statistic) has an attribute that claims to be toasted. The error message we're seeing that you quoted above simply means that no entry exists in the toast table. The bit about "final toast chunk number 0 differs from expected value 1" is super unhelpful, as what it is really saying is that there were no chunks found. I should submit a patch to not print that message in cases where the attribute is missing from the toast table.

Is it possible that pg_statistic really is corrupt here, and that this is not a bug in pg_amcheck? It's not like we've been checking for corruption in the build farm up till now. I notice that this test, as well as test 005_opclass_damage.pl, neglects to turn off autovacuum for the test node. So maybe the corruptions are getting propogated during autovacuum? This is just a guess, but I will submit a patch that turns off autovacuum for the test node shortly.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#69Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#68)
1 attachment(s)
Re: pg_amcheck contrib application
Show quoted text

On Mar 15, 2021, at 11:11 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I will submit a patch that turns off autovacuum for the test node shortly.

Attachments:

v5-0001-Turning-off-autovacuum-during-corruption-tests.patchapplication/octet-stream; name=v5-0001-Turning-off-autovacuum-during-corruption-tests.patch; x-unix-mode=0644Download
From 5cee9f41862df3acf914cdce200a51a67b322c44 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 11:33:23 -0700
Subject: [PATCH v5] Turning off autovacuum during corruption tests.

Tests which intentionally corrupt relations should not run with
autovacuum turned on, as autovacuum may do unpredictable things when
processing corrupt tables.  Two such tests were committed with
autovacuum on, which was an oversight noticed while investigating
failures observed on build farm animals "hornet" and "tern".
---
 src/bin/pg_amcheck/t/003_check.pl          | 1 +
 src/bin/pg_amcheck/t/005_opclass_damage.pl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/src/bin/pg_amcheck/t/003_check.pl b/src/bin/pg_amcheck/t/003_check.pl
index 54b2b86a49..10a70b4ae3 100644
--- a/src/bin/pg_amcheck/t/003_check.pl
+++ b/src/bin/pg_amcheck/t/003_check.pl
@@ -117,6 +117,7 @@ sub perform_all_corruptions()
 # Test set-up
 $node = get_new_node('test');
 $node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 $port = $node->port;
 
diff --git a/src/bin/pg_amcheck/t/005_opclass_damage.pl b/src/bin/pg_amcheck/t/005_opclass_damage.pl
index eba8ea9cae..28a5a2d35f 100644
--- a/src/bin/pg_amcheck/t/005_opclass_damage.pl
+++ b/src/bin/pg_amcheck/t/005_opclass_damage.pl
@@ -9,6 +9,7 @@ use Test::More tests => 5;
 
 my $node = get_new_node('test');
 $node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 
 # Create a custom operator class and an index which uses it.
-- 
2.21.1 (Apple Git-122.3)

#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#68)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 15, 2021, at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

These animals have somewhat weird alignment properties: MAXALIGN is 8
but ALIGNOF_DOUBLE is only 4. I speculate that that is affecting their
choices about whether an out-of-line TOAST value is needed, breaking
this test case.

The logic in verify_heapam only looks for a value in the toast table if
the tuple it gets from the main table (in this case, from pg_statistic)
has an attribute that claims to be toasted. The error message we're
seeing that you quoted above simply means that no entry exists in the
toast table.

Yeah, that could be phrased better. Do we have a strong enough lock on
the table under examination to be sure that autovacuum couldn't remove
a dead toast entry before we reach it? But this would only be an
issue if we are trying to check validity of toasted fields within
known-dead tuples, which I would argue we shouldn't, since lock or
no lock there's no guarantee the toast entry is still there.

Not sure that I believe the theory that this is from bad luck of
concurrent autovacuum timing, though. The fact that we're seeing
this on just those two animals suggests strongly to me that it's
architecture-correlated, instead.

regards, tom lane

#71Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#70)
Re: pg_amcheck contrib application

On Mar 15, 2021, at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Do we have a strong enough lock on
the table under examination to be sure that autovacuum couldn't remove
a dead toast entry before we reach it?

The main table and the toast table are only locked with AccessShareLock. Each page in the main table is locked with BUFFER_LOCK_SHARE. Toast is not checked unless the tuple passes visibility checks verifying the tuple is not dead.

But this would only be an
issue if we are trying to check validity of toasted fields within
known-dead tuples, which I would argue we shouldn't, since lock or
no lock there's no guarantee the toast entry is still there.

It does not intentionally check toasted fields within dead tuples. If that is happening, it's a bug, possibly in the visibility function. But I'm not seeing a specific reason to assume that is the issue. If we still see the complaint on tern or hornet after committing the patch to turn off autovacuum, we will be able to rule out the theory that the toast was removed by autovacuum.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#72Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#70)
2 attachment(s)
Re: pg_amcheck contrib application

On Mar 15, 2021, at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Yeah, that could be phrased better.

Attaching the 0001 patch submitted earlier, plus 0002 which fixes the confusing corruption message.

Attachments:

v6-0001-Turning-off-autovacuum-during-corruption-tests.patchapplication/octet-stream; name=v6-0001-Turning-off-autovacuum-during-corruption-tests.patch; x-unix-mode=0644Download
From 4a78efd8dbfec08bf2af6ce1ee0a5c1dcf86794d Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 11:33:23 -0700
Subject: [PATCH v6 1/2] Turning off autovacuum during corruption tests.

Tests which intentionally corrupt relations should not run with
autovacuum turned on, as autovacuum may do unpredictable things when
processing corrupt tables.  Two such tests were committed with
autovacuum on, which was an oversight noticed while investigating
failures observed on build farm animals "hornet" and "tern".
---
 src/bin/pg_amcheck/t/003_check.pl          | 1 +
 src/bin/pg_amcheck/t/005_opclass_damage.pl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/src/bin/pg_amcheck/t/003_check.pl b/src/bin/pg_amcheck/t/003_check.pl
index 54b2b86a49..10a70b4ae3 100644
--- a/src/bin/pg_amcheck/t/003_check.pl
+++ b/src/bin/pg_amcheck/t/003_check.pl
@@ -117,6 +117,7 @@ sub perform_all_corruptions()
 # Test set-up
 $node = get_new_node('test');
 $node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 $port = $node->port;
 
diff --git a/src/bin/pg_amcheck/t/005_opclass_damage.pl b/src/bin/pg_amcheck/t/005_opclass_damage.pl
index eba8ea9cae..28a5a2d35f 100644
--- a/src/bin/pg_amcheck/t/005_opclass_damage.pl
+++ b/src/bin/pg_amcheck/t/005_opclass_damage.pl
@@ -9,6 +9,7 @@ use Test::More tests => 5;
 
 my $node = get_new_node('test');
 $node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 
 # Create a custom operator class and an index which uses it.
-- 
2.21.1 (Apple Git-122.3)

v6-0002-Fixing-a-confusing-amcheck-corruption-message.patchapplication/octet-stream; name=v6-0002-Fixing-a-confusing-amcheck-corruption-message.patch; x-unix-mode=0644Download
From 489087cc776f7166995b98245830af06aeeb9485 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 12:22:22 -0700
Subject: [PATCH v6 2/2] Fixing a confusing amcheck corruption message.

The verify_heapam() function from contrib/amcheck reports a
corruption message when toasted values have the wrong number of
chunks in the toast table, but this was being reported even when the
toasted value is entirely missing from the toast table, which is
confusing.  Fixing it so it only prints a message about the toasted
value being missing from the table in such cases.
---
 contrib/amcheck/verify_heapam.c           | 8 ++++----
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 3 +--
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 49f5ca0ef2..e614c12a14 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1100,14 +1100,14 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		check_toast_tuple(toasttup, ctx);
 		ctx->chunkno++;
 	}
-	if (ctx->chunkno != (ctx->endchunk + 1))
-		report_corruption(ctx,
-						  psprintf("final toast chunk number %u differs from expected value %u",
-								   ctx->chunkno, (ctx->endchunk + 1)));
 	if (!found_toasttup)
 		report_corruption(ctx,
 						  psprintf("toasted value for attribute %u missing from toast table",
 								   ctx->attnum));
+	else if (ctx->chunkno != (ctx->endchunk + 1))
+		report_corruption(ctx,
+						  psprintf("final toast chunk number %u differs from expected value %u",
+								   ctx->chunkno, (ctx->endchunk + 1)));
 	systable_endscan_ordered(toastscan);
 
 	return true;
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 9f7114cd62..16574cb1f8 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -296,7 +296,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 20;
+plan tests => 19;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -489,7 +489,6 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}final toast chunk number 0 differs from expected value \d+/,
 			qr/${header}toasted value for attribute 2 missing from toast table/;
 	}
 	elsif ($offnum == 14)
-- 
2.21.1 (Apple Git-122.3)

#73Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#70)
4 attachment(s)
Re: pg_amcheck contrib application

On Mar 15, 2021, at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Not sure that I believe the theory that this is from bad luck of
concurrent autovacuum timing, though. The fact that we're seeing
this on just those two animals suggests strongly to me that it's
architecture-correlated, instead.

I find it a little hard to see how amcheck is tripping over a toasted value just in this one table, pg_statistic, and not in any of the others. The error message says the problem is in attribute 27, which I believe means it is stavalues2. The comment in the header for this catalog is intriguing:

/*
* Values in these arrays are values of the column's data type, or of some
* related type such as an array element type. We presently have to cheat
* quite a bit to allow polymorphic arrays of this kind, but perhaps
* someday it'll be a less bogus facility.
*/
anyarray stavalues1;
anyarray stavalues2;
anyarray stavalues3;
anyarray stavalues4;
anyarray stavalues5;

This is hard to duplicate in a test, because you can't normally create tables with pseudo-type columns. However, if amcheck is walking the tuple and does not correctly update the offset with the length of attribute 26, it may try to read attribute 27 at the wrong offset, unsurprisingly leading to garbage, perhaps a garbage toast pointer. The attached patch v7-0004 adds a check to verify_heapam to see if the va_toastrelid matches the expected toast table oid for the table we're reading. That check almost certainly should have been included in the initial version of verify_heapam, so even if it does nothing to help us in this issue, it's good that it be committed, I think.

It is unfortunate that the failing test only runs pg_amcheck after creating numerous corruptions, as we can't know if pg_amcheck would have complained about pg_statistic before the corruptions were created in other tables, or if it only does so after. The attached patch v7-0003 adds a call to pg_amcheck after all tables are created and populated, but before any corruptions are caused. This should help narrow down what is happening, and doesn't hurt to leave in place long-term.

I don't immediately see anything wrong with how pg_statistic uses a pseudo-type, but it leads me to want to poke a bit more at pg_statistic on hornet and tern, though I don't have any regression tests specifically for doing so.

Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.

Attachments:

v7-0001-Turning-off-autovacuum-during-corruption-tests.patchapplication/octet-stream; name=v7-0001-Turning-off-autovacuum-during-corruption-tests.patch; x-unix-mode=0644Download
From b83a20e1844fdae4da3c196a1f90dc73859ea314 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 11:33:23 -0700
Subject: [PATCH v7 1/4] Turning off autovacuum during corruption tests.

Tests which intentionally corrupt relations should not run with
autovacuum turned on, as autovacuum may do unpredictable things when
processing corrupt tables.  Two such tests were committed with
autovacuum on, which was an oversight noticed while investigating
failures observed on build farm animals "hornet" and "tern".
---
 src/bin/pg_amcheck/t/003_check.pl          | 1 +
 src/bin/pg_amcheck/t/005_opclass_damage.pl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/src/bin/pg_amcheck/t/003_check.pl b/src/bin/pg_amcheck/t/003_check.pl
index 54b2b86a49..10a70b4ae3 100644
--- a/src/bin/pg_amcheck/t/003_check.pl
+++ b/src/bin/pg_amcheck/t/003_check.pl
@@ -117,6 +117,7 @@ sub perform_all_corruptions()
 # Test set-up
 $node = get_new_node('test');
 $node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 $port = $node->port;
 
diff --git a/src/bin/pg_amcheck/t/005_opclass_damage.pl b/src/bin/pg_amcheck/t/005_opclass_damage.pl
index eba8ea9cae..28a5a2d35f 100644
--- a/src/bin/pg_amcheck/t/005_opclass_damage.pl
+++ b/src/bin/pg_amcheck/t/005_opclass_damage.pl
@@ -9,6 +9,7 @@ use Test::More tests => 5;
 
 my $node = get_new_node('test');
 $node->init;
+$node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 
 # Create a custom operator class and an index which uses it.
-- 
2.21.1 (Apple Git-122.3)

v7-0002-Fixing-a-confusing-amcheck-corruption-message.patchapplication/octet-stream; name=v7-0002-Fixing-a-confusing-amcheck-corruption-message.patch; x-unix-mode=0644Download
From 4a6eb2cd515dd7fb83990e8ce8897d22ce3eb186 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 12:22:22 -0700
Subject: [PATCH v7 2/4] Fixing a confusing amcheck corruption message.

The verify_heapam() function from contrib/amcheck reports a
corruption message when toasted values have the wrong number of
chunks in the toast table, but this was being reported even when the
toasted value is entirely missing from the toast table, which is
confusing.  Fixing it so it only prints a message about the toasted
value being missing from the table in such cases.
---
 contrib/amcheck/verify_heapam.c           | 8 ++++----
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 3 +--
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 49f5ca0ef2..e614c12a14 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1100,14 +1100,14 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		check_toast_tuple(toasttup, ctx);
 		ctx->chunkno++;
 	}
-	if (ctx->chunkno != (ctx->endchunk + 1))
-		report_corruption(ctx,
-						  psprintf("final toast chunk number %u differs from expected value %u",
-								   ctx->chunkno, (ctx->endchunk + 1)));
 	if (!found_toasttup)
 		report_corruption(ctx,
 						  psprintf("toasted value for attribute %u missing from toast table",
 								   ctx->attnum));
+	else if (ctx->chunkno != (ctx->endchunk + 1))
+		report_corruption(ctx,
+						  psprintf("final toast chunk number %u differs from expected value %u",
+								   ctx->chunkno, (ctx->endchunk + 1)));
 	systable_endscan_ordered(toastscan);
 
 	return true;
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 9f7114cd62..16574cb1f8 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -296,7 +296,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 20;
+plan tests => 19;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -489,7 +489,6 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}final toast chunk number 0 differs from expected value \d+/,
 			qr/${header}toasted value for attribute 2 missing from toast table/;
 	}
 	elsif ($offnum == 14)
-- 
2.21.1 (Apple Git-122.3)

v7-0003-Extend-pg_amcheck-test-suite.patchapplication/octet-stream; name=v7-0003-Extend-pg_amcheck-test-suite.patch; x-unix-mode=0644Download
From 9c41e2a9fb9cc7cb95d1a144fe091ae411328e8b Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 15:01:43 -0700
Subject: [PATCH v7 3/4] Extend pg_amcheck test suite

One of the pg_amcheck regression tests creates numerous corruptions
and only then verifies that pg_amcheck reports all expected
corruption messages and no others.  This overlooks that if a bug
exists such that corruption is reported where none exists, it will
be harder to determine what went wrong if pg_amcheck is only run
after corruptions do exit.  Therefore, check that pg_amcheck reports
no corruptions early in the test case.
---
 src/bin/pg_amcheck/t/003_check.pl | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_amcheck/t/003_check.pl b/src/bin/pg_amcheck/t/003_check.pl
index 10a70b4ae3..66dd14e498 100644
--- a/src/bin/pg_amcheck/t/003_check.pl
+++ b/src/bin/pg_amcheck/t/003_check.pl
@@ -3,7 +3,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 60;
+use Test::More tests => 63;
 
 my ($node, $port, %corrupt_page, %remove_relation);
 
@@ -309,11 +309,6 @@ plan_to_remove_relation_file('db2', 's1.t1_btree');
 # Leave 'db3' uncorrupted
 #
 
-# Perform the corruptions we planned above using only a single database restart.
-#
-perform_all_corruptions();
-
-
 # Standard first arguments to TestLib functions
 my @cmd = ('pg_amcheck', '--quiet', '-p', $port);
 
@@ -323,6 +318,22 @@ my $line_pointer_corruption_re = qr/line pointer/;
 my $missing_file_re = qr/could not open file ".*": No such file or directory/;
 my $index_missing_relation_fork_re = qr/index ".*" lacks a main relation fork/;
 
+# We have created test databases with tables populated with data, but have not
+# yet corrupted anything.  As such, we expect no corruption and verify that
+# none is reported
+#
+$node->command_checks_all(
+	[ @cmd, '-d', 'db1', '-d', 'db2', '-d', 'db3' ],
+	0,
+	[ $no_output_re ],
+	[ $no_output_re ],
+	'pg_amcheck prior to corruption');
+
+# Perform the corruptions we planned above using only a single database restart.
+#
+perform_all_corruptions();
+
+
 # Checking databases with amcheck installed and corrupt relations, pg_amcheck
 # command itself should return exit status = 2, because tables and indexes are
 # corrupt, not exit status = 1, which would mean the pg_amcheck command itself
-- 
2.21.1 (Apple Git-122.3)

v7-0004-Add-extra-check-of-toast-pointer-in-amcheck.patchapplication/octet-stream; name=v7-0004-Add-extra-check-of-toast-pointer-in-amcheck.patch; x-unix-mode=0644Download
From ed93785d2c724fef2b4d5c43bbbe926455857367 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 17:26:47 -0700
Subject: [PATCH v7 4/4] Add extra check of toast pointer in amcheck

When toast pointers are being verified in verify_heapam, add a check
that the va_toastrelid field matches the oid of the toast table.
---
 contrib/amcheck/verify_heapam.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index e614c12a14..5ccae46375 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1069,6 +1069,15 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	 */
 	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
+	if (toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+	{
+		report_corruption(ctx,
+						  psprintf("attribute %u points to relation with oid %u not the toast relation with oid %u",
+								   ctx->attnum, toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
+		return true;
+	}
+
 	ctx->attrsize = toast_pointer.va_extsize;
 	ctx->endchunk = (ctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
 	ctx->totalchunks = ctx->endchunk + 1;
-- 
2.21.1 (Apple Git-122.3)

#74Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#70)
Re: pg_amcheck contrib application

On Mon, Mar 15, 2021 at 02:57:20PM -0400, Tom Lane wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 15, 2021, at 10:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

These animals have somewhat weird alignment properties: MAXALIGN is 8
but ALIGNOF_DOUBLE is only 4. I speculate that that is affecting their
choices about whether an out-of-line TOAST value is needed, breaking
this test case.

That machine also has awful performance for filesystem metadata operations,
like open(O_CREAT). Its CPU and read()/write() performance are normal.

The logic in verify_heapam only looks for a value in the toast table if
the tuple it gets from the main table (in this case, from pg_statistic)
has an attribute that claims to be toasted. The error message we're
seeing that you quoted above simply means that no entry exists in the
toast table.

Yeah, that could be phrased better. Do we have a strong enough lock on
the table under examination to be sure that autovacuum couldn't remove
a dead toast entry before we reach it? But this would only be an
issue if we are trying to check validity of toasted fields within
known-dead tuples, which I would argue we shouldn't, since lock or
no lock there's no guarantee the toast entry is still there.

Not sure that I believe the theory that this is from bad luck of
concurrent autovacuum timing, though.

With autovacuum_naptime=1s, on hornet, the failure reproduced twice in twelve
runs. With v6-0001-Turning-off-autovacuum-during-corruption-tests.patch
applied, 196 runs all succeeded.

The fact that we're seeing
this on just those two animals suggests strongly to me that it's
architecture-correlated, instead.

That is possible.

#75Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Noah Misch (#74)
Re: pg_amcheck contrib application

On Mar 15, 2021, at 11:09 PM, Noah Misch <noah@leadboat.com> wrote:

Not sure that I believe the theory that this is from bad luck of
concurrent autovacuum timing, though.

With autovacuum_naptime=1s, on hornet, the failure reproduced twice in twelve
runs. With v6-0001-Turning-off-autovacuum-during-corruption-tests.patch
applied, 196 runs all succeeded.

The fact that we're seeing
this on just those two animals suggests strongly to me that it's
architecture-correlated, instead.

That is possible.

I think autovacuum simply triggers the bug, and is not the cause of the bug. If I turn autovacuum off and instead do an ANALYZE in each test database rather than performing the corruptions, I get reports about problems in pg_statistic. This is on my mac laptop. This rules out the theory that autovacuum is propogating corruptions into pg_statistic, and also the theory that it is architecture dependent.

I'll investigate further.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#76Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#75)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

I think autovacuum simply triggers the bug, and is not the cause of the bug. If I turn autovacuum off and instead do an ANALYZE in each test database rather than performing the corruptions, I get reports about problems in pg_statistic. This is on my mac laptop. This rules out the theory that autovacuum is propogating corruptions into pg_statistic, and also the theory that it is architecture dependent.

I wonder whether amcheck is confused by the declaration of those columns
as "anyarray".

regards, tom lane

#77Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#76)
Re: pg_amcheck contrib application

On Mar 16, 2021, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

I think autovacuum simply triggers the bug, and is not the cause of the bug. If I turn autovacuum off and instead do an ANALYZE in each test database rather than performing the corruptions, I get reports about problems in pg_statistic. This is on my mac laptop. This rules out the theory that autovacuum is propogating corruptions into pg_statistic, and also the theory that it is architecture dependent.

I wonder whether amcheck is confused by the declaration of those columns
as "anyarray".

It uses attlen and attalign for the attribute, so that idea does make sense. It gets that via TupleDescAttr(RelationGetDescr(rel), attnum).


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#78Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#77)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 16, 2021, at 9:30 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Mar 16, 2021, at 9:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

I think autovacuum simply triggers the bug, and is not the cause of the bug. If I turn autovacuum off and instead do an ANALYZE in each test database rather than performing the corruptions, I get reports about problems in pg_statistic. This is on my mac laptop. This rules out the theory that autovacuum is propogating corruptions into pg_statistic, and also the theory that it is architecture dependent.

I wonder whether amcheck is confused by the declaration of those columns
as "anyarray".

It uses attlen and attalign for the attribute, so that idea does make sense. It gets that via TupleDescAttr(RelationGetDescr(rel), attnum).

Yeah, that looks related:

regression=# select attname, attlen, attnum, attalign from pg_attribute where attrelid = 2619 and attname like 'stavalue%';
attname | attlen | attnum | attalign
------------+--------+--------+----------
stavalues1 | -1 | 27 | d
stavalues2 | -1 | 28 | d
stavalues3 | -1 | 29 | d
stavalues4 | -1 | 30 | d
stavalues5 | -1 | 31 | d
(5 rows)

It shows them all has having attalign = 'd', but for some array types the alignment will be 'i', not 'd'. So it's lying a bit about the contents. But I'm now confused why this caused problems on the two hosts where integer and double have the same alignment? It seems like that would be the one place where the bug would not happen, not the one place where it does.

I'm attaching a test that reliably reproduces this for me:

Attachments:

v8-0005-Adding-pg_amcheck-special-test-for-pg_statistic.patchapplication/octet-stream; name=v8-0005-Adding-pg_amcheck-special-test-for-pg_statistic.patch; x-unix-mode=0644Download
From 74b483dc67458acddd137a2d89e3bfe82d466833 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 16 Mar 2021 09:43:44 -0700
Subject: [PATCH v8 5/5] Adding pg_amcheck special test for pg_statistic

The pg_statistic catalog contains columns of type anyarray, which is
unusual.  Failures have been observed on build animals hornet and
tern when pg_amcheck checks pg_statistic.  These types of failures
are reproduced with this new test.
---
 src/bin/pg_amcheck/t/006_pg_statistic.pl | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_pg_statistic.pl

diff --git a/src/bin/pg_amcheck/t/006_pg_statistic.pl b/src/bin/pg_amcheck/t/006_pg_statistic.pl
new file mode 100644
index 0000000000..0fa84e118a
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_pg_statistic.pl
@@ -0,0 +1,20 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use Test::More tests => 3;
+
+# Test set-up
+my $node = get_new_node('test');
+$node->init;
+$node->start;
+
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(ANALYZE));
+
+$node->command_checks_all(
+	[ 'pg_amcheck','-t', 'pg_catalog.pg_statistic', 'postgres' ],
+	0,
+	[ qr/^$/ ],
+	[ qr/^$/ ],
+	'pg_amcheck pg_statistic');
-- 
2.21.1 (Apple Git-122.3)

#79Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#78)
Re: pg_amcheck contrib application

On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Yeah, that looks related:

regression=# select attname, attlen, attnum, attalign from pg_attribute where attrelid = 2619 and attname like 'stavalue%';
attname | attlen | attnum | attalign
------------+--------+--------+----------
stavalues1 | -1 | 27 | d
stavalues2 | -1 | 28 | d
stavalues3 | -1 | 29 | d
stavalues4 | -1 | 30 | d
stavalues5 | -1 | 31 | d
(5 rows)

It shows them all has having attalign = 'd', but for some array types the alignment will be 'i', not 'd'. So it's lying a bit about the contents. But I'm now confused why this caused problems on the two hosts where integer and double have the same alignment? It seems like that would be the one place where the bug would not happen, not the one place where it does.

Wait, so the value in the attalign column isn't the alignment we're
actually using? I can understand how we might generate tuples like
that, if we pass the actual type to construct_array(), but shouldn't
we then get garbage when we deform the tuple? I mean,
heap_deform_tuple() is going to get the alignment from the tuple
descriptor, which is a table property. It doesn't (and can't) know
what type of value is stored inside any of these fields for real,
right?

In other words, isn't this actually corruption, caused by a bug in our
code, and how have we not noticed it before now?

--
Robert Haas
EDB: http://www.enterprisedb.com

#80Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#79)
Re: pg_amcheck contrib application

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

It shows them all has having attalign = 'd', but for some array types the alignment will be 'i', not 'd'. So it's lying a bit about the contents. But I'm now confused why this caused problems on the two hosts where integer and double have the same alignment? It seems like that would be the one place where the bug would not happen, not the one place where it does.

Wait, so the value in the attalign column isn't the alignment we're
actually using? I can understand how we might generate tuples like
that, if we pass the actual type to construct_array(), but shouldn't
we then get garbage when we deform the tuple?

No. What should be happening there is that some arrays in the column
get larger alignment than they actually need, but that shouldn't cause
a problem (and has not done so, AFAIK, in the decades that it's been
like this). As you say, deforming the tuple is going to rely on the
table's tupdesc for alignment; it can't know what is in the data.

I'm not entirely sure what's going on, but I think coming at this
with the mindset that "amcheck has detected some corruption" is
just going to lead you astray. Almost certainly, it's "amcheck
is incorrectly claiming corruption". That might be from mis-decoding
a TOAST-referencing datum. (Too bad the message doesn't report the
TOAST OID it probed for, so we can see if that's sane or not.)

regards, tom lane

#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#80)
Re: pg_amcheck contrib application

... btw, I now see that tern and hornet are passing this test
at least as much as they're failing, proving that there's some
timing or random chance involved. That doesn't completely
eliminate the idea that there may be an architecture component
to the issue, but it sure reduces its credibility. I now
believe the theory that the triggering condition is an auto-analyze
happening at the right time, and populating pg_statistic with
some data that other runs don't see.

regards, tom lane

#82Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#80)
Re: pg_amcheck contrib application

On Mar 16, 2021, at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 16, 2021 at 12:51 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

It shows them all has having attalign = 'd', but for some array types the alignment will be 'i', not 'd'. So it's lying a bit about the contents. But I'm now confused why this caused problems on the two hosts where integer and double have the same alignment? It seems like that would be the one place where the bug would not happen, not the one place where it does.

Wait, so the value in the attalign column isn't the alignment we're
actually using? I can understand how we might generate tuples like
that, if we pass the actual type to construct_array(), but shouldn't
we then get garbage when we deform the tuple?

No. What should be happening there is that some arrays in the column
get larger alignment than they actually need, but that shouldn't cause
a problem (and has not done so, AFAIK, in the decades that it's been
like this). As you say, deforming the tuple is going to rely on the
table's tupdesc for alignment; it can't know what is in the data.

I'm not entirely sure what's going on, but I think coming at this
with the mindset that "amcheck has detected some corruption" is
just going to lead you astray. Almost certainly, it's "amcheck
is incorrectly claiming corruption". That might be from mis-decoding
a TOAST-referencing datum. (Too bad the message doesn't report the
TOAST OID it probed for, so we can see if that's sane or not.)

I've added that and now get the toast pointer's va_valueid in the message:

mark.dilger@laptop280-ma-us amcheck % pg_amcheck -t "pg_catalog.pg_statistic" postgres
heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 2, attribute 29:
toasted value id 13227 for attribute 29 missing from toast table
heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 5, attribute 27:
toasted value id 13228 for attribute 27 missing from toast table

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 5ccae46375..a0be71bb7f 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1111,8 +1111,8 @@ check_tuple_attribute(HeapCheckContext *ctx)
        }
        if (!found_toasttup)
                report_corruption(ctx,
-                                                 psprintf("toasted value for attribute %u missing from toast table",
-                                                                  ctx->attnum));
+                                                 psprintf("toasted value id %u for attribute %u missing from toast table",
+                                                                  toast_pointer.va_valueid, ctx->attnum));
        else if (ctx->chunkno != (ctx->endchunk + 1))
                report_corruption(ctx,
                                                  psprintf("final toast chunk number %u differs from expected value %u",


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#82)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 16, 2021, at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

(Too bad the message doesn't report the
TOAST OID it probed for, so we can see if that's sane or not.)

I've added that and now get the toast pointer's va_valueid in the message:

heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 2, attribute 29:
toasted value id 13227 for attribute 29 missing from toast table
heap table "postgres"."pg_catalog"."pg_statistic", block 4, offset 5, attribute 27:
toasted value id 13228 for attribute 27 missing from toast table

That's awfully interesting, because OIDs less than 16384 should
only be generated during initdb. So what we seem to be looking at
here is pg_statistic entries that were made during the ANALYZE
done by initdb (cf. vacuum_db()), and then were deleted during
a later auto-analyze (in the buildfarm) or deliberate analyze
(in your reproducer). But if they're deleted, why is amcheck
looking for them?

I'm circling back around to the idea that amcheck is trying to
validate TOAST references that are already dead, and it's getting
burnt because something-or-other has already removed the toast
rows, though not the referencing datums. That's legal behavior
once the rows are marked dead. Upthread it was claimed that
amcheck isn't doing that, but this looks like a smoking gun to me.

regards, tom lane

#84Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#80)
Re: pg_amcheck contrib application

On Tue, Mar 16, 2021 at 1:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

No. What should be happening there is that some arrays in the column
get larger alignment than they actually need, but that shouldn't cause
a problem (and has not done so, AFAIK, in the decades that it's been
like this). As you say, deforming the tuple is going to rely on the
table's tupdesc for alignment; it can't know what is in the data.

OK, I don't understand this. attalign is 'd', which is already the
maximum possible, and even if it weren't, individual rows can't decide
to use a larger OR smaller alignment than expected without breaking
stuff.

In what context is it OK to just add extra alignment padding?

--
Robert Haas
EDB: http://www.enterprisedb.com

#85Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#80)
Re: pg_amcheck contrib application

On Mar 16, 2021, at 10:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not entirely sure what's going on, but I think coming at this
with the mindset that "amcheck has detected some corruption" is
just going to lead you astray. Almost certainly, it's "amcheck
is incorrectly claiming corruption". That might be from mis-decoding
a TOAST-referencing datum. (Too bad the message doesn't report the
TOAST OID it probed for, so we can see if that's sane or not.)

CopyStatistics seems to just copy Form_pg_statistic without regard for who owns the toast. Is this safe? Looking at RemoveStatistics, I'm not sure that it is. Anybody more familiar with this code care to give an opinion?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#86Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#83)
Re: pg_amcheck contrib application

On Tue, Mar 16, 2021 at 2:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm circling back around to the idea that amcheck is trying to
validate TOAST references that are already dead, and it's getting
burnt because something-or-other has already removed the toast
rows, though not the referencing datums. That's legal behavior
once the rows are marked dead. Upthread it was claimed that
amcheck isn't doing that, but this looks like a smoking gun to me.

I think this theory has some legs. From check_tuple_header_and_visibilty():

else if (!(infomask & HEAP_XMAX_COMMITTED))
return true; /*
HEAPTUPLE_DELETE_IN_PROGRESS or
*
HEAPTUPLE_LIVE */
else
return false; /*
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
}
return true; /* not dead */
}

That first case looks wrong to me. Don't we need to call
get_xid_status() here, Mark? As coded, it seems that if the xmin is ok
and the xmax is marked committed, we consider the tuple checkable. The
comment says it must be HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_LIVE, but it seems to me that if the actual answer is either
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD depending on whether the
xmax is all-visible. And in the second case the comment says it's
either HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, but I think in that
case it's either HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, depending on the XID
status.

Another thought here is that it's probably not wicked smart to be
relying on the hint bits to match the actual status of the tuple in
cases of corruption. Maybe we should be warning about tuples that are
have xmin or xmax flagged as committed or invalid when in fact clog
disagrees. That's not a particularly uncommon case, and it's hard to
check.

--
Robert Haas
EDB: http://www.enterprisedb.com

#87Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#84)
Re: pg_amcheck contrib application

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Mar 16, 2021 at 1:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

No. What should be happening there is that some arrays in the column
get larger alignment than they actually need, but that shouldn't cause
a problem (and has not done so, AFAIK, in the decades that it's been
like this). As you say, deforming the tuple is going to rely on the
table's tupdesc for alignment; it can't know what is in the data.

OK, I don't understand this. attalign is 'd', which is already the
maximum possible, and even if it weren't, individual rows can't decide
to use a larger OR smaller alignment than expected without breaking
stuff.

In what context is it OK to just add extra alignment padding?

It's *not* extra, according to pg_statistic's tuple descriptor.
Both forming and deforming of pg_statistic tuples should honor
that and locate stavaluesX values on d-aligned boundaries.

It could be that a particular entry is of an array type that
only requires i-alignment. But that doesn't break anything,
it just means we inserted more padding than an omniscient
implementation would do.

(I suppose in some parallel universe there could be a machine
where i-alignment is stricter than d-alignment, and then we'd
have trouble.)

regards, tom lane

#88Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#85)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

CopyStatistics seems to just copy Form_pg_statistic without regard for
who owns the toast. Is this safe?

No less so than a ton of other places that insert values that might've
come from on-disk storage. heap_toast_insert_or_update() is responsible
for dealing with the problem. These days it looks like it's actually
toast_tuple_init() that takes care of dereferencing previously-toasted
values during an INSERT.

regards, tom lane

#89Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#87)
Re: pg_amcheck contrib application

On Tue, Mar 16, 2021 at 2:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In what context is it OK to just add extra alignment padding?

It's *not* extra, according to pg_statistic's tuple descriptor.
Both forming and deforming of pg_statistic tuples should honor
that and locate stavaluesX values on d-aligned boundaries.

It could be that a particular entry is of an array type that
only requires i-alignment. But that doesn't break anything,
it just means we inserted more padding than an omniscient
implementation would do.

OK, yeah, I just misunderstood what you were saying.

--
Robert Haas
EDB: http://www.enterprisedb.com

#90Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#53)
Re: pg_amcheck contrib application

On 3/13/21 1:30 AM, Andres Freund wrote:

Hi,

On 2021-03-13 01:22:54 -0500, Tom Lane wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 12, 2021, at 10:16 PM, Noah Misch <noah@leadboat.com> wrote:

hoverfly does configure with PERL=perl64. /usr/bin/prove is from the 32-bit
Perl, so I suspect the TAP suites get 32-bit Perl that way. (There's no
"prove64".)

Oh, that's annoying.

I suspect we could solve that by invoking changing our /usr/bin/prove
invocation to instead be PERL /usr/bin/prove? That might be a good thing
independent of this issue, because it's not unreasonable for a user to
expect that we'd actually use the perl installation they configured...

Although I do not know how prove determines the perl installation it's
going to use for the test scripts...

There's a very good chance this would break msys builds, which are
configured to build against a pure native (i.e. non-msys) perl sucj as
AS or Strawberry, but need to run msys perl for TAP tests, so it gets
the paths right.

(Don't get me started on the madness that can result from managing this.
I've lost weeks of my life to it ... If you add cygwin into the mix and
you're trying to coordinate builds among three buildfarm animals it's a
major pain.)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#91Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#86)
Re: pg_amcheck contrib application

On Mar 16, 2021, at 11:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Mar 16, 2021 at 2:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm circling back around to the idea that amcheck is trying to
validate TOAST references that are already dead, and it's getting
burnt because something-or-other has already removed the toast
rows, though not the referencing datums. That's legal behavior
once the rows are marked dead. Upthread it was claimed that
amcheck isn't doing that, but this looks like a smoking gun to me.

I think this theory has some legs. From check_tuple_header_and_visibilty():

else if (!(infomask & HEAP_XMAX_COMMITTED))
return true; /*
HEAPTUPLE_DELETE_IN_PROGRESS or
*
HEAPTUPLE_LIVE */
else
return false; /*
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
}
return true; /* not dead */
}

That first case looks wrong to me. Don't we need to call
get_xid_status() here, Mark? As coded, it seems that if the xmin is ok
and the xmax is marked committed, we consider the tuple checkable. The
comment says it must be HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_LIVE, but it seems to me that if the actual answer is either
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD depending on whether the
xmax is all-visible. And in the second case the comment says it's
either HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, but I think in that
case it's either HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD, depending on the XID
status.

Another thought here is that it's probably not wicked smart to be
relying on the hint bits to match the actual status of the tuple in
cases of corruption. Maybe we should be warning about tuples that are
have xmin or xmax flagged as committed or invalid when in fact clog
disagrees. That's not a particularly uncommon case, and it's hard to
check.

This code was not committed as part of the recent pg_amcheck work, but longer ago, and I'm having trouble reconstructing exactly why it was written that way.

Changing check_tuple_header_and_visibilty() fixes the regression test and also manual tests against the "regression" database that I've been using. I'd like to ponder the changes a while longer before I post, but the fact that these changes fix the tests seems to add credibility to this theory.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#92Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#73)
Re: pg_amcheck contrib application

On Mon, Mar 15, 2021 at 10:10 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

It is unfortunate that the failing test only runs pg_amcheck after creating numerous corruptions, as we can't know if pg_amcheck would have complained about pg_statistic before the corruptions were created in other tables, or if it only does so after. The attached patch v7-0003 adds a call to pg_amcheck after all tables are created and populated, but before any corruptions are caused. This should help narrow down what is happening, and doesn't hurt to leave in place long-term.

I don't immediately see anything wrong with how pg_statistic uses a pseudo-type, but it leads me to want to poke a bit more at pg_statistic on hornet and tern, though I don't have any regression tests specifically for doing so.

Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.

Since we now know that shutting autovacuum off makes the problem go
away, I don't see a reason to commit 0001. We should fix pg_amcheck
instead, if, as presently seems to be the case, that's where the
problem is.

I just committed 0002.

I think 0003 is probably a good idea, but I haven't committed it yet.

As for 0004, it seems to me that we might want to do a little more
rewording of these messages and perhaps we should try to do it all at
once. Like, for example, your other change to print out the toast
value ID seems like a good idea, and could apply to any new messages
as well as some existing ones. Maybe there are also more fields in the
TOAST pointer for which we could add checks.

--
Robert Haas
EDB: http://www.enterprisedb.com

#93Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#92)
2 attachment(s)
Re: pg_amcheck contrib application

On Mar 16, 2021, at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 15, 2021 at 10:10 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

It is unfortunate that the failing test only runs pg_amcheck after creating numerous corruptions, as we can't know if pg_amcheck would have complained about pg_statistic before the corruptions were created in other tables, or if it only does so after. The attached patch v7-0003 adds a call to pg_amcheck after all tables are created and populated, but before any corruptions are caused. This should help narrow down what is happening, and doesn't hurt to leave in place long-term.

I don't immediately see anything wrong with how pg_statistic uses a pseudo-type, but it leads me to want to poke a bit more at pg_statistic on hornet and tern, though I don't have any regression tests specifically for doing so.

Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.

Since we now know that shutting autovacuum off makes the problem go
away, I don't see a reason to commit 0001. We should fix pg_amcheck
instead, if, as presently seems to be the case, that's where the
problem is.

If you get unlucky, autovacuum will process one of the tables that the test intentionally corrupted, with bad consequences, ultimately causing build farm intermittent test failures. We could wait to see if this ever happens before fixing it, if you prefer. I'm not sure what that buys us, though.

The right approach, I think, is to extend the contrib/amcheck tests to include regressing this particular case to see if it fails. I've written that test and verified that it fails against the old code and passes against the new.

I just committed 0002.

Thanks!

I think 0003 is probably a good idea, but I haven't committed it yet.

It won't do anything for us in this particular case, but it might make debugging failures quicker in the future.

As for 0004, it seems to me that we might want to do a little more
rewording of these messages and perhaps we should try to do it all at
once. Like, for example, your other change to print out the toast
value ID seems like a good idea, and could apply to any new messages
as well as some existing ones. Maybe there are also more fields in the
TOAST pointer for which we could add checks.

Of the toast pointer fields:

int32 va_rawsize; /* Original data size (includes header) */
int32 va_extsize; /* External saved size (doesn't) */
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */

all seem worth getting as part of any toast error message, even if these fields themselves are not corrupt. It just makes it easier to understand the context of the error you're looking at. At first I tried putting these into each message, but it is very wordy to say things like "toast pointer with rawsize %u and extsize %u pointing at relation with oid %u" and such. It made more sense to just add these four fields to the verify_heapam tuple format. That saves putting them in the message text itself, and has the benefit that you could filter the rows coming from verify_heapam() for ones where valueid is or is not null, for example. This changes the external interface of verify_heapam, but I didn't bother with a amcheck--1.3--1.4.sql because amcheck--1.2--1.3. sql was added as part of the v14 development work and has not yet been released. My assumption is that I can just change it, rather than making a new upgrade file.

These patches fix the visibility rules and add extra toast checking. Some of the previous patch material is not included, since it is not clear to me if you wanted to commit any of it.

Attachments:

v9-0001-Fixing-amcheck-tuple-visibility-rules.patchapplication/octet-stream; name=v9-0001-Fixing-amcheck-tuple-visibility-rules.patch; x-unix-mode=0644Download
From 18b00da2e9c918baa5b970c755088e409f70e709 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 16 Mar 2021 12:32:07 -0700
Subject: [PATCH v9 1/2] Fixing amcheck tuple visibility rules

amcheck was considering a tuple as visible when it should have
considered it dead, leading to checking of dead tuples and
complaints when their toast was missing from the toast table, and
perhaps to other problems.

Extend amcheck's regression tests with a test case that reliably
reproduces the buggy behavior fixed in this commit, to be sure it
does not come back.
---
 contrib/amcheck/t/001_verify_heapam.pl | 13 +++-
 contrib/amcheck/verify_heapam.c        | 93 ++++++++++++--------------
 2 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/contrib/amcheck/t/001_verify_heapam.pl b/contrib/amcheck/t/001_verify_heapam.pl
index 6050feb712..b6fc640a53 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -4,7 +4,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 
-use Test::More tests => 80;
+use Test::More tests => 128;
 
 my ($node, $result);
 
@@ -17,6 +17,17 @@ $node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
 
+#
+# Check for false positives against pg_statistic.  There was a bug in the
+# visibility checking logic that resulted in a consistently reproducible
+# complaint about missing toast table entries for table pg_statistic.  The
+# problem was that main table entries were being checked despite being dead,
+# which is wrong, and though the main table entries were not corrupt, the
+# missing toast was reported.
+#
+$node->safe_psql('postgres', q(ANALYZE));
+check_all_options_uncorrupted('pg_catalog.pg_statistic', 'plain');
+
 #
 # Check a table with data loaded but no corruption, freezing, etc.
 #
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index e614c12a14..dc57fe5774 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -758,58 +758,53 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 
 	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
 	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
-		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+		XidCommitStatus status;
+		TransactionId xmax;
 
-			switch (get_xid_status(xmax, ctx, &status))
-			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
-			}
+		if (infomask & HEAP_XMAX_IS_MULTI)
+			xmax = HeapTupleGetUpdateXid(tuphdr);
+		else
+			xmax = HeapTupleHeaderGetRawXmax(tuphdr);
 
-			/* Ok, the tuple is live */
+		switch (get_xid_status(xmax, ctx, &status))
+		{
+				/* not LOCKED_ONLY, so it has to have an xmax */
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("xmax is invalid"));
+				return false;	/* corrupt */
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->next_fxid),
+										   XidFromFullTransactionId(ctx->next_fxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("xmax %u precedes relation freeze threshold %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->relfrozenfxid),
+										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->oldest_fxid),
+										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				return false;	/* corrupt */
+			case XID_BOUNDS_OK:
+				switch (status)
+				{
+					case XID_IN_PROGRESS:
+						return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
+					case XID_COMMITTED:
+					case XID_ABORTED:
+						return false;	/* HEAPTUPLE_RECENTLY_DEAD or
+										 * HEAPTUPLE_DEAD */
+				}
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
 	}
 	return true;				/* not dead */
 }
-- 
2.21.1 (Apple Git-122.3)

v9-0002-pg_amcheck-provide-additional-toast-corruption-in.patchapplication/octet-stream; name=v9-0002-pg_amcheck-provide-additional-toast-corruption-in.patch; x-unix-mode=0644Download
From a6bbab22a5958e13203c63b70189a7bd4ba9484d Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 15 Mar 2021 17:26:47 -0700
Subject: [PATCH v9 2/2] pg_amcheck: provide additional toast corruption
 information

Modifying amcheck to provide additional information about corrupted
toast, and adjusting pg_amcheck to expect the new amcheck output
format.  Part of the additional toast information was known to
amcheck all along but unhelpfully not included in the corruption
reports, but this commit also adds more thorough checking, including
whether toasted data matches the rawsize and extsize claimed in the
main table's toast pointer, and whether compressed toasted data is
corrupted.  These additional checks were quite intentionally not
performed in the original amcheck version due to performance and
stability concerns, but on reflection those concerns could better be
addressed by adding options to turn the checks off if desired.

While at it, fixing some amcheck messages not to include the
attribute number.  Doing so is redundant, given that attnum is
already one of the returned columns, and just makes the message text
needlessly longer.
---
 contrib/amcheck/amcheck--1.2--1.3.sql     |   4 +
 contrib/amcheck/expected/check_heap.out   |  88 +++++------
 contrib/amcheck/verify_heapam.c           | 175 ++++++++++++++++++----
 src/bin/pg_amcheck/pg_amcheck.c           |  20 ++-
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 131 ++++++++++++++--
 5 files changed, 322 insertions(+), 96 deletions(-)

diff --git a/contrib/amcheck/amcheck--1.2--1.3.sql b/contrib/amcheck/amcheck--1.2--1.3.sql
index 7237ab738c..c77090b5e9 100644
--- a/contrib/amcheck/amcheck--1.2--1.3.sql
+++ b/contrib/amcheck/amcheck--1.2--1.3.sql
@@ -15,6 +15,10 @@ CREATE FUNCTION verify_heapam(relation regclass,
 							  blkno OUT bigint,
 							  offnum OUT integer,
 							  attnum OUT integer,
+							  rawsize OUT integer,
+							  extsize OUT integer,
+							  valueid OUT oid,
+							  toastrelid OUT oid,
 							  msg OUT text)
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'verify_heapam'
diff --git a/contrib/amcheck/expected/check_heap.out b/contrib/amcheck/expected/check_heap.out
index 1fb3823142..54c327583e 100644
--- a/contrib/amcheck/expected/check_heap.out
+++ b/contrib/amcheck/expected/check_heap.out
@@ -6,60 +6,60 @@ ERROR:  invalid skip option
 HINT:  Valid skip options are "all-visible", "all-frozen", and "none".
 -- Check specifying invalid block ranges when verifying an empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 5, endblock := 8);
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 -- Check that valid options are not rejected nor corruption reported
 -- for an empty table, and that skip enum-like parameter is case-insensitive
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'None');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'All-Frozen');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'All-Visible');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'NONE');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-FROZEN');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'ALL-VISIBLE');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 -- Add some data so subsequent tests are not entirely trivial
@@ -69,23 +69,23 @@ INSERT INTO heaptest (a, b)
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 CREATE ROLE regress_heaptest_role;
@@ -98,8 +98,8 @@ GRANT EXECUTE ON FUNCTION verify_heapam(regclass, boolean, boolean, text, bigint
 -- verify permissions are now sufficient
 SET ROLE regress_heaptest_role;
 SELECT * FROM verify_heapam(relation := 'heaptest');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 RESET ROLE;
@@ -113,23 +113,23 @@ VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) heaptest;
 -- Check that valid options are not rejected nor corruption reported
 -- for a non-empty frozen table
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'none');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-frozen');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', skip := 'all-visible');
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 SELECT * FROM verify_heapam(relation := 'heaptest', startblock := 0, endblock := 0);
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 -- Check that partitioned tables (the parent ones) which don't have visibility
@@ -146,8 +146,8 @@ CREATE TABLE test_partition partition OF test_partitioned FOR VALUES IN (1);
 SELECT * FROM verify_heapam('test_partition',
 							startblock := NULL,
 							endblock := NULL);
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 -- Check that valid options are not rejected nor corruption reported
@@ -156,8 +156,8 @@ INSERT INTO test_partitioned (a) (SELECT 1 FROM generate_series(1,1000) gs);
 SELECT * FROM verify_heapam('test_partition',
 							startblock := NULL,
 							endblock := NULL);
- blkno | offnum | attnum | msg 
--------+--------+--------+-----
+ blkno | offnum | attnum | rawsize | extsize | valueid | toastrelid | msg 
+-------+--------+--------+---------+---------+---------+------------+-----
 (0 rows)
 
 -- Check that indexes are rejected
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index dc57fe5774..50edc2d6a8 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -18,6 +18,7 @@
 #include "access/toast_internals.h"
 #include "access/visibilitymap.h"
 #include "catalog/pg_am.h"
+#include "common/pg_lzcompress.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -28,7 +29,7 @@
 PG_FUNCTION_INFO_V1(verify_heapam);
 
 /* The number of columns in tuples returned by verify_heapam */
-#define HEAPCHECK_RELATION_COLS 4
+#define HEAPCHECK_RELATION_COLS 8
 
 /*
  * Despite the name, we use this for reporting problems with both XIDs and
@@ -114,6 +115,8 @@ typedef struct HeapCheckContext
 	AttrNumber	attnum;
 
 	/* Values for iterating over toast for the attribute */
+	struct varatt_external toast_pointer;
+	bool		checking_toastptr;
 	int32		chunkno;
 	int32		attrsize;
 	int32		endchunk;
@@ -130,7 +133,7 @@ typedef struct HeapCheckContext
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
+static int32 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
@@ -343,6 +346,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 	if (TransactionIdIsNormal(ctx.relfrozenxid))
 		ctx.oldest_xid = ctx.relfrozenxid;
 
+	ctx.checking_toastptr = false;
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
@@ -517,7 +521,21 @@ report_corruption(HeapCheckContext *ctx, char *msg)
 	values[1] = Int32GetDatum(ctx->offnum);
 	values[2] = Int32GetDatum(ctx->attnum);
 	nulls[2] = (ctx->attnum < 0);
-	values[3] = CStringGetTextDatum(msg);
+	if (ctx->checking_toastptr)
+	{
+		values[3] = Int32GetDatum(ctx->toast_pointer.va_rawsize);
+		values[4] = Int32GetDatum(ctx->toast_pointer.va_extsize);
+		values[5] = ObjectIdGetDatum(ctx->toast_pointer.va_valueid);
+		values[6] = ObjectIdGetDatum(ctx->toast_pointer.va_toastrelid);
+	}
+	else
+	{
+		nulls[3] = true;
+		nulls[4] = true;
+		nulls[5] = true;
+		nulls[6] = true;
+	}
+	values[7] = CStringGetTextDatum(msg);
 
 	/*
 	 * In principle, there is nothing to prevent a scan over a large, highly
@@ -548,6 +566,10 @@ verify_heapam_tupdesc(void)
 	TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0);
 	TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0);
+	TupleDescInitEntry(tupdesc, ++a, "rawsize", INT4OID, -1, 0);
+	TupleDescInitEntry(tupdesc, ++a, "extsize", INT4OID, -1, 0);
+	TupleDescInitEntry(tupdesc, ++a, "valueid", OIDOID, -1, 0);
+	TupleDescInitEntry(tupdesc, ++a, "toastrelid", OIDOID, -1, 0);
 	TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0);
 	Assert(a == HEAPCHECK_RELATION_COLS);
 
@@ -819,8 +841,11 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
  * tuples that store the toasted value are retrieved and checked in order, with
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
+ *
+ * Returns the size of the chunk, not including the header, or zero if it
+ * cannot be determined due to corruption.
  */
-static void
+static int32
 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 {
 	int32		curchunk;
@@ -838,7 +863,7 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	{
 		report_corruption(ctx,
 						  pstrdup("toast chunk sequence number is null"));
-		return;
+		return 0;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
@@ -846,7 +871,7 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	{
 		report_corruption(ctx,
 						  pstrdup("toast chunk data is null"));
-		return;
+		return 0;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
 		chunksize = VARSIZE(chunk) - VARHDRSZ;
@@ -865,7 +890,7 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		report_corruption(ctx,
 						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
 								   header, curchunk));
-		return;
+		return 0;
 	}
 
 	/*
@@ -876,14 +901,14 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		report_corruption(ctx,
 						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
 								   curchunk, ctx->chunkno));
-		return;
+		return chunksize;
 	}
 	if (curchunk > ctx->endchunk)
 	{
 		report_corruption(ctx,
 						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
 								   curchunk, ctx->endchunk));
-		return;
+		return chunksize;
 	}
 
 	expected_size = curchunk < ctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
@@ -893,8 +918,10 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		report_corruption(ctx,
 						  psprintf("toast chunk size %u differs from the expected size %u",
 								   chunksize, expected_size));
-		return;
+		return chunksize;
 	}
+
+	return chunksize;
 }
 
 /*
@@ -920,11 +947,11 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 static bool
 check_tuple_attribute(HeapCheckContext *ctx)
 {
-	struct varatt_external toast_pointer;
 	ScanKeyData toastkey;
 	SysScanDesc toastscan;
 	SnapshotData SnapshotToast;
 	HeapTuple	toasttup;
+	int64		toastsize;		/* corrupt toast could overflow 32 bits */
 	bool		found_toasttup;
 	Datum		attdatum;
 	struct varlena *attr;
@@ -932,6 +959,8 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	uint16		infomask;
 	Form_pg_attribute thisatt;
 
+	Assert(! ctx->checking_toastptr);
+
 	infomask = ctx->tuphdr->t_infomask;
 	thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), ctx->attnum);
 
@@ -940,8 +969,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u starts at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -961,8 +989,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 		{
 			report_corruption(ctx,
-							  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-									   ctx->attnum,
+							  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 									   thisatt->attlen,
 									   ctx->tuphdr->t_hoff + ctx->offset,
 									   ctx->lp_len));
@@ -994,8 +1021,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (va_tag != VARTAG_ONDISK)
 		{
 			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
+							  psprintf("toasted attribute has unexpected TOAST tag %u",
 									   va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
@@ -1009,8 +1035,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1037,12 +1062,18 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	/* It is external, and we're looking at a page on disk */
 
+	/*
+	 * Must copy attr into toast_pointer for alignment considerations
+	 */
+	VARATT_EXTERNAL_GET_POINTER(ctx->toast_pointer, attr);
+	ctx->checking_toastptr = true;
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+						  pstrdup("attribute is external but tuple header flag HEAP_HASEXTERNAL not set"));
+		ctx->checking_toastptr = false;
 		return true;
 	}
 
@@ -1050,21 +1081,33 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+						  pstrdup("attribute is external but relation has no toast relation"));
+		ctx->checking_toastptr = false;
 		return true;
 	}
 
 	/* If we were told to skip toast checking, then we're done. */
 	if (ctx->toast_rel == NULL)
+	{
+		ctx->checking_toastptr = false;
 		return true;
+	}
 
-	/*
-	 * Must copy attr into toast_pointer for alignment considerations
-	 */
-	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+	if (ctx->toast_pointer.va_extsize > ctx->toast_pointer.va_rawsize - VARHDRSZ)
+		report_corruption(ctx,
+						  pstrdup("toast pointer external size exceeds maximum expected for rawsize"));
+
+
+	if (ctx->toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+	{
+		report_corruption(ctx,
+						  psprintf("toast pointer relation oid differs from expected value %u",
+								   ctx->rel->rd_rel->reltoastrelid));
+		ctx->checking_toastptr = false;
+		return true;
+	}
 
-	ctx->attrsize = toast_pointer.va_extsize;
+	ctx->attrsize = ctx->toast_pointer.va_extsize;
 	ctx->endchunk = (ctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
 	ctx->totalchunks = ctx->endchunk + 1;
 
@@ -1074,7 +1117,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	ScanKeyInit(&toastkey,
 				(AttrNumber) 1,
 				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(toast_pointer.va_valueid));
+				ObjectIdGetDatum(ctx->toast_pointer.va_valueid));
 
 	/*
 	 * Check if any chunks for this toasted object exist in the toast table,
@@ -1087,24 +1130,92 @@ check_tuple_attribute(HeapCheckContext *ctx)
 										   &toastkey);
 	ctx->chunkno = 0;
 	found_toasttup = false;
+	toastsize = 0;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx);
+		toastsize += check_toast_tuple(toasttup, ctx);
 		ctx->chunkno++;
 	}
 	if (!found_toasttup)
 		report_corruption(ctx,
-						  psprintf("toasted value for attribute %u missing from toast table",
-								   ctx->attnum));
+						  pstrdup("toasted value missing from toast table"));
 	else if (ctx->chunkno != (ctx->endchunk + 1))
 		report_corruption(ctx,
 						  psprintf("final toast chunk number %u differs from expected value %u",
 								   ctx->chunkno, (ctx->endchunk + 1)));
 	systable_endscan_ordered(toastscan);
+	if (toastsize != ctx->toast_pointer.va_extsize)
+		report_corruption(ctx,
+						  psprintf("total toast size " INT64_FORMAT " differs from expected extsize",
+								   toastsize));
+	else
+	{
+		Size			allocsize;
+
+		if (!AllocSizeIsValid(ctx->toast_pointer.va_rawsize))
+			report_corruption(ctx,
+							  pstrdup("rawsize too large for attribute to be allocated"));
+
+		allocsize = ctx->toast_pointer.va_extsize + VARHDRSZ;
+		if (!AllocSizeIsValid(allocsize))
+			report_corruption(ctx,
+							  pstrdup("extsize too large for attribute to be allocated"));
+		else
+		{
+			struct varlena *attr;
+
+			/* Fetch all chunks */
+			attr = (struct varlena *) palloc(allocsize);
+			if (VARATT_EXTERNAL_IS_COMPRESSED(ctx->toast_pointer))
+				SET_VARSIZE_COMPRESSED(attr, allocsize);
+			else
+				SET_VARSIZE(attr, allocsize);
+
+			table_relation_fetch_toast_slice(ctx->toast_rel, ctx->toast_pointer.va_valueid,
+											 toastsize, 0, toastsize, attr);
+
+			if (VARATT_IS_COMPRESSED(attr))
+			{
+				struct varlena *uncompressed;
+
+				allocsize = TOAST_COMPRESS_RAWSIZE(attr) + VARHDRSZ;
+				if (allocsize != ctx->toast_pointer.va_rawsize)
+					report_corruption(ctx,
+									  psprintf("toast data rawsize %zu differs from expected rawsize",
+									  allocsize));
+				else if (AllocSizeIsValid(allocsize))
+				{
+					uncompressed = (struct varlena *) palloc(allocsize);
+					SET_VARSIZE(uncompressed, allocsize);
+					if (pglz_decompress(TOAST_COMPRESS_RAWDATA(attr),
+										TOAST_COMPRESS_SIZE(attr),
+										VARDATA(uncompressed),
+										TOAST_COMPRESS_RAWSIZE(attr), true) < 0)
+					{
+						report_corruption(ctx,
+										  pstrdup("compressed toast data is corrupted"));
+					}
+					else if (VARSIZE(uncompressed) != ctx->toast_pointer.va_rawsize)
+						report_corruption(ctx,
+										  psprintf("decompressed toast size %u differs from expected rawsize",
+												   VARSIZE(attr)));
+
+					pfree(uncompressed);
+				}
+			}
+			else if (VARSIZE(attr) != ctx->toast_pointer.va_rawsize)
+				report_corruption(ctx,
+								  psprintf("detoasted attribute size %u differs from expected rawsize",
+								  VARSIZE(attr)));
+
+			pfree(attr);
+		}
+	}
 
+	ctx->checking_toastptr = false;
 	return true;
 }
 
diff --git a/src/bin/pg_amcheck/pg_amcheck.c b/src/bin/pg_amcheck/pg_amcheck.c
index c9d9900693..e5ec7bf2e9 100644
--- a/src/bin/pg_amcheck/pg_amcheck.c
+++ b/src/bin/pg_amcheck/pg_amcheck.c
@@ -799,7 +799,7 @@ prepare_heap_command(PQExpBuffer sql, RelationInfo *rel, PGconn *conn)
 {
 	resetPQExpBuffer(sql);
 	appendPQExpBuffer(sql,
-					  "SELECT blkno, offnum, attnum, msg FROM %s.verify_heapam("
+					  "SELECT blkno, offnum, attnum, rawsize, extsize, valueid, toastrelid, msg FROM %s.verify_heapam("
 					  "\nrelation := %u, on_error_stop := %s, check_toast := %s, skip := '%s'",
 					  rel->datinfo->amcheck_schema,
 					  rel->reloid,
@@ -990,12 +990,24 @@ verify_heap_slot_handler(PGresult *res, PGconn *conn, void *context)
 			const char *msg;
 
 			/* The message string should never be null, but check */
-			if (PQgetisnull(res, i, 3))
+			if (PQgetisnull(res, i, 7))
 				msg = "NO MESSAGE";
 			else
-				msg = PQgetvalue(res, i, 3);
+				msg = PQgetvalue(res, i, 7);
 
-			if (!PQgetisnull(res, i, 2))
+			if (!PQgetisnull(res, i, 6))
+				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s, attribute %s, rawsize %s, extsize %s, valueid %s, toastrelid %s:\n    %s\n",
+					   rel->datinfo->datname, rel->nspname, rel->relname,
+					   PQgetvalue(res, i, 0),	/* blkno */
+					   PQgetvalue(res, i, 1),	/* offnum */
+					   PQgetvalue(res, i, 2),	/* attnum */
+					   PQgetvalue(res, i, 3),	/* toast rawsize */
+					   PQgetvalue(res, i, 4),	/* toast extsize */
+					   PQgetvalue(res, i, 5),	/* toast valueid */
+					   PQgetvalue(res, i, 6),	/* toast relid */
+					   msg);
+
+			else if (!PQgetisnull(res, i, 2))
 				printf("heap table \"%s\".\"%s\".\"%s\", block %s, offset %s, attribute %s:\n    %s\n",
 					   rel->datinfo->datname, rel->nspname, rel->relname,
 					   PQgetvalue(res, i, 0),	/* blkno */
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 16574cb1f8..243b8fc71d 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -224,7 +224,7 @@ my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.te
 my $relpath = "$pgdata/$rel";
 
 # Insert data and freeze public.test
-use constant ROWCOUNT => 16;
+use constant ROWCOUNT => 21;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
@@ -259,6 +259,13 @@ select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
 	offset $tup limit 1)));
 }
 
+# Find our toast relation id
+my $toastrelid = $node->safe_psql('postgres', qq(
+	SELECT c.reltoastrelid
+		FROM pg_catalog.pg_class c
+		WHERE c.oid = 'public.test'::regclass
+		));
+
 # Sanity check that our 'test' table on disk layout matches expectations.  If
 # this is not so, we will have to skip the test until somebody updates the test
 # to work on this platform.
@@ -296,7 +303,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 19;
+plan tests => 31;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -310,6 +317,7 @@ $node->stop;
 
 # Some #define constants from access/htup_details.h for use while corrupting.
 use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_HASEXTERNAL        => 0x0004;
 use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
 use constant HEAP_XMIN_COMMITTED     => 0x0100;
 use constant HEAP_XMIN_INVALID       => 0x0200;
@@ -323,7 +331,22 @@ use constant HEAP_KEYS_UPDATED       => 0x2000;
 # expect verify_heapam() to return given which fields we expect to be non-null.
 sub header
 {
-	my ($blkno, $offnum, $attnum) = @_;
+	my %fields = @_;
+	my $blkno = $fields{blkno};
+	my $offnum = $fields{offnum};
+	my $attnum = $fields{attnum};
+
+	if (exists $fields{rawsize} ||
+		exists $fields{extsize} ||
+		exists $fields{valueid} ||
+		exists $fields{toastrelid})
+	{
+		my $rawsize = defined $fields{rawsize} ? $fields{rawsize} : '\d+';
+		my $extsize = defined $fields{extsize} ? $fields{extsize} : '\d+';
+		my $valueid = defined $fields{valueid} ? $fields{valueid} : '\d+';
+		my $toastrelid = defined $fields{toastrelid} ? $fields{toastrelid} : '\d+';
+		return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum, attribute $attnum, rawsize $rawsize, extsize $extsize, valueid $valueid, toastrelid $toastrelid:\s+/ms;
+	}
 	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum, attribute $attnum:\s+/ms
 		if (defined $attnum);
 	return qr/heap table "postgres"\."public"\."test", block $blkno, offset $offnum:\s+/ms
@@ -349,7 +372,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 	my $offset = $lp_off[$tupidx];
 	my $tup = read_tuple($file, $offset);
 
-	my $header = header(0, $offnum, undef);
+	my $header = header(blkno => 0, offnum => $offnum);
 	if ($offnum == 1)
 	{
 		# Corruptly set xmin < relfrozenxid
@@ -459,6 +482,20 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 			qr/${$header}number of attributes 67 exceeds maximum expected for table 3/;
 	}
 	elsif ($offnum == 12)
+	{
+		# Corrupt infomask to claim there are no external attributes, which conflicts
+		# with column 'c' which is toasted
+		$tup->{t_infomask} &= ~HEAP_HASEXTERNAL;
+		$header = header(blkno => 0,
+						 offnum => $offnum,
+						 attnum => 2,
+						 rawsize => 10004,
+						 extsize => 10000,
+						 toastrelid => $toastrelid);
+		push @expected,
+			qr/${header}attribute is external but tuple header flag HEAP_HASEXTERNAL not set/;
+	}
+	elsif ($offnum == 13)
 	{
 		# Overwrite column 'b' 1-byte varlena header and initial characters to
 		# look like a long 4-byte varlena
@@ -478,18 +515,9 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		$tup->{b_body2} = 0xFF;
 		$tup->{b_body3} = 0xFF;
 
-		$header = header(0, $offnum, 1);
+		$header = header(blkno => 0, offnum => $offnum, attnum => 1);
 		push @expected,
-			qr/${header}attribute \d+ with length \d+ ends at offset \d+ beyond total tuple length \d+/;
-	}
-	elsif ($offnum == 13)
-	{
-		# Corrupt the bits in column 'c' toast pointer
-		$tup->{c_va_valueid} = 0xFFFFFFFF;
-
-		$header = header(0, $offnum, 2);
-		push @expected,
-			qr/${header}toasted value for attribute 2 missing from toast table/;
+			qr/${header}attribute with length \d+ ends at offset \d+ beyond total tuple length \d+/;
 	}
 	elsif ($offnum == 14)
 	{
@@ -501,7 +529,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
 	}
-	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	elsif ($offnum == 15)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
@@ -511,6 +539,77 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
 	}
+	elsif ($offnum == 16)
+	{
+		# Corrupt column c's toast pointer va_vartag field
+		$tup->{c_va_vartag} = 42;
+		$header = header(blkno => 0,
+						 offnum => $offnum,
+						 attnum => 2);
+		push @expected,
+			qr/$header/,
+			qr/toasted attribute has unexpected TOAST tag 42/;
+	}
+	elsif ($offnum == 17)
+	{
+		# Corrupt column c's toast pointer va_rawsize field to corruptly
+		# make the toast data appear to be compressed, though it is not.
+		$tup->{c_va_rawsize} = 10005;
+		$header = header(blkno => 0,
+						 offnum => $offnum,
+						 attnum => 2,
+						 rawsize => 10005,
+						 extsize => 10000,
+						 toastrelid => $toastrelid);
+		push @expected,
+			qr/${header}/,
+			qr/toast data rawsize \d+ differs from expected rawsize/;
+	}
+	elsif ($offnum == 18)
+	{
+		# Corrupt column c's toast pointer va_extsize field
+		$tup->{c_va_extsize} = 9999999;
+		$header = header(blkno => 0,
+						 offnum => $offnum,
+						 attnum => 2,
+						 rawsize => 10004,
+						 extsize => $tup->{c_va_extsize},
+						 toastrelid => $toastrelid);
+		push @expected,
+			qr/$header/,
+			qr/toast pointer external size exceeds maximum expected for rawsize/,
+			qr/toast chunk size \d+ differs from the expected size \d+/,
+			qr/toast chunk number \d+ differs from expected value \d+/,
+			qr/total toast size 10000 differs from expected extsize/;
+
+	}
+	elsif ($offnum == 19)
+	{
+		# Corrupt column c's toast pointer va_valueid field.  We have not
+		# consumed enough oids for any valueid in the toast table to be large.
+		# Use a large oid for the corruption to avoid colliding with an
+		# existent entry in the toast.
+		my $corrupt = $tup->{c_va_valueid} + 100000000;
+		$tup->{c_va_valueid} = $corrupt;
+		$header = header(blkno => 0,
+						 offnum => $offnum,
+						 attnum => 2,
+						 rawsize => 10004,
+						 extsize => 10000,
+						 valueid => $corrupt,
+						 toastrelid => $toastrelid);
+		push @expected,
+			qr/${header}/,
+			qr/toasted value missing from toast table/;
+	}
+	elsif ($offnum == 20)	# Last offnum must less than or equal to ROWCOUNT-1
+	{
+		# Corrupt column c's toast pointer va_toastrelid field
+		my $otherid = $toastrelid + 1;
+		$tup->{c_va_toastrelid} = $otherid;
+		push @expected,
+			qr/toast pointer relation oid differs from expected value $toastrelid/;
+	}
 	write_tuple($file, $offset, $tup);
 }
 close($file)
-- 
2.21.1 (Apple Git-122.3)

#94Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#93)
Re: pg_amcheck contrib application

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 16, 2021, at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Since we now know that shutting autovacuum off makes the problem go
away, I don't see a reason to commit 0001. We should fix pg_amcheck
instead, if, as presently seems to be the case, that's where the
problem is.

If you get unlucky, autovacuum will process one of the tables that the test intentionally corrupted, with bad consequences, ultimately causing build farm intermittent test failures.

Um, yeah, the test had better shut off autovacuum on any table that
it intentionally corrupts.

regards, tom lane

#95Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#94)
Re: pg_amcheck contrib application

On Thu, Mar 18, 2021 at 12:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 16, 2021, at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
Since we now know that shutting autovacuum off makes the problem go
away, I don't see a reason to commit 0001. We should fix pg_amcheck
instead, if, as presently seems to be the case, that's where the
problem is.

If you get unlucky, autovacuum will process one of the tables that the test intentionally corrupted, with bad consequences, ultimately causing build farm intermittent test failures.

Um, yeah, the test had better shut off autovacuum on any table that
it intentionally corrupts.

Right, good point. But... does that really apply to
005_opclass_damage.pl? I feel like with the kind of physical damage
we're doing in 003_check.pl, it makes a lot of sense to stop vacuum
from crashing headlong into that table. But, 005 is doing "logical"
damage rather than "physical" damage, and I don't see why autovacuum
should misbehave in that kind of case. In fact, the fact that
autovacuum can handle such cases is one of the selling points for the
whole design of vacuum, as opposed to, for example, retail index
lookups.

Pending resolution of that question, I've committed the change to
disable autovacuum in 003, and also Mark's changes to have it also run
pg_amcheck BEFORE corrupting anything, so the post-corruption tests
fail - say by finding the wrong kind of corruption - we can see
whether it was also failing before the corruption was even introduced.

--
Robert Haas
EDB: http://www.enterprisedb.com

#96Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#95)
Re: pg_amcheck contrib application

On Mar 23, 2021, at 12:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

005 is doing "logical"
damage rather than "physical" damage, and I don't see why autovacuum
should misbehave in that kind of case. In fact, the fact that
autovacuum can handle such cases is one of the selling points for the
whole design of vacuum, as opposed to, for example, retail index
lookups.

That is a good point. Checking that autovacuum behaves sensibly despite sort order breakage sounds reasonable, but test 005 doesn't do that reliably, because it does nothing to make sure that autovacuum runs against the affected table during the short window when the affected table exists. All the same, I don't see that turning autovacuum off is required. If autovacuum is broken in this regard, we may get occasional, hard to reproduce build farm failures, but that would be more informative than no failures at all.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Robert Haas (#95)
Re: pg_amcheck contrib application

On Tue, Mar 23, 2021 at 12:05 PM Robert Haas <robertmhaas@gmail.com> wrote:

Right, good point. But... does that really apply to
005_opclass_damage.pl? I feel like with the kind of physical damage
we're doing in 003_check.pl, it makes a lot of sense to stop vacuum
from crashing headlong into that table. But, 005 is doing "logical"
damage rather than "physical" damage, and I don't see why autovacuum
should misbehave in that kind of case. In fact, the fact that
autovacuum can handle such cases is one of the selling points for the
whole design of vacuum, as opposed to, for example, retail index
lookups.

FWIW that is only 99.9% true (contrary to what README.HOT says). This
is the case because nbtree page deletion will in fact search the tree
to find a downlink to the target page, which must be removed at the
same time -- see the call to _bt_search() made within nbtpage.c.

This is much less of a problem than you'd think, though, even with an
opclass that gives wrong answers all the time. Because it's also true
that _bt_getstackbuf() is remarkably tolerant when it actually goes to
locate the downlink -- because that happens via a linear search that
matches on downlink block number (it doesn't use the opclass for that
part). This means that we'll accidentally fail to fail if the page is
*somewhere* to the right in the "true" key space. Which probably means
that it has a greater than 50% chance of not failing with a 100%
broken opclass. Which probably makes our odds better with more
plausible levels of misbehavior (e.g. collation changes).

That being said, I should make _bt_lock_subtree_parent() return false
and back out of page deletion without raising an error in the case
where we really cannot locate a valid downlink. We really ought to
soldier on when that happens, since we'll do that for a bunch of other
reasons already. I believe that the only reason we throw an error
today is for parity with the page split case (the main
_bt_getstackbuf() call). But this isn't the same situation at all --
this is VACUUM.

I will make this change to HEAD soon, barring objections.

--
Peter Geoghegan

#98Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#97)
Re: pg_amcheck contrib application

Peter Geoghegan <pg@bowt.ie> writes:

That being said, I should make _bt_lock_subtree_parent() return false
and back out of page deletion without raising an error in the case
where we really cannot locate a valid downlink. We really ought to
soldier on when that happens, since we'll do that for a bunch of other
reasons already. I believe that the only reason we throw an error
today is for parity with the page split case (the main
_bt_getstackbuf() call). But this isn't the same situation at all --
this is VACUUM.

I will make this change to HEAD soon, barring objections.

+1. Not deleting the upper page seems better than the alternatives.

regards, tom lane

In reply to: Tom Lane (#98)
Re: pg_amcheck contrib application

On Tue, Mar 23, 2021 at 12:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I will make this change to HEAD soon, barring objections.

+1. Not deleting the upper page seems better than the alternatives.

FWIW it might also work that way as a holdover from the old page
deletion algorithm. These days we decide exactly which pages (leaf
page plus possible internal pages) are to be deleted as a whole up
front (these are a subtree, though usually just a degenerate
single-leaf-page subtree -- internal page deletions are rare).

One of the advantages of this design is that we verify practically all
of the work involved in deleting an entire subtree up-front, inside
_bt_lock_subtree_parent(). It's clearly safe to back out of it if it
looks dicey.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#99)
Re: pg_amcheck contrib application

On Tue, Mar 23, 2021 at 12:53 PM Peter Geoghegan <pg@bowt.ie> wrote:

One of the advantages of this design is that we verify practically all
of the work involved in deleting an entire subtree up-front, inside
_bt_lock_subtree_parent(). It's clearly safe to back out of it if it
looks dicey.

That's taken care of. I just pushed a commit that teaches
_bt_lock_subtree_parent() to press on.

--
Peter Geoghegan

#101Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#93)
2 attachment(s)
Re: pg_amcheck contrib application

On Mar 17, 2021, at 9:00 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Of the toast pointer fields:

int32 va_rawsize; /* Original data size (includes header) */
int32 va_extsize; /* External saved size (doesn't) */
Oid va_valueid; /* Unique ID of value within TOAST table */
Oid va_toastrelid; /* RelID of TOAST table containing it */

all seem worth getting as part of any toast error message, even if these fields themselves are not corrupt. It just makes it easier to understand the context of the error you're looking at. At first I tried putting these into each message, but it is very wordy to say things like "toast pointer with rawsize %u and extsize %u pointing at relation with oid %u" and such. It made more sense to just add these four fields to the verify_heapam tuple format. That saves putting them in the message text itself, and has the benefit that you could filter the rows coming from verify_heapam() for ones where valueid is or is not null, for example. This changes the external interface of verify_heapam, but I didn't bother with a amcheck--1.3--1.4.sql because amcheck--1.2--1.3. sql was added as part of the v14 development work and has not yet been released. My assumption is that I can just change it, rather than making a new upgrade file.

These patches fix the visibility rules and add extra toast checking.

These new patches address the same issues as v9 (which was never committed), and v10 (which was never even posted to this list), with some changes.

Rather than print out all four toast pointer fields for each toast failure, va_rawsize, va_extsize, and va_toastrelid are only mentioned in the corruption message if they are related to the specific corruption. Otherwise, just the va_valueid is mentioned in the corruption message.

The visibility rules fix is different in v11, relying on a visibility check which more closely follows the implementation of HeapTupleSatisfiesVacuumHorizon.

Attachments:

v11-0001-Fixing-amcheck-tuple-visibility-rules.patchapplication/octet-stream; name=v11-0001-Fixing-amcheck-tuple-visibility-rules.patch; x-unix-mode=0644Download
From ccb6f0146445b2fd2205b228c247f0ef2a515dfc Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 16 Mar 2021 12:32:07 -0700
Subject: [PATCH v11 1/2] Fixing amcheck tuple visibility rules

The implementation of visibility rules in the heap checking code had
diverged considerably from HeapTupleSatisfiesVacuum, upon which it
was based, and with which it was supposed to be compatible.  In
fact, the two functions were not compatible.

Removing the divergent implementation from the heap checking code,
instead copying HeapTupleSatisfiesVacuumHorizon, renaming it,
modifying it to return boolean, to not call SetHintBits, and to not
perform work to distinguish between cases where the same boolean
result will be returned anyway.

Extending amcheck's regression tests with a test case that reliably
reproduces the buggy behavior fixed in this commit, to be sure it
does not come back.
---
 contrib/amcheck/t/001_verify_heapam.pl |  13 +-
 contrib/amcheck/verify_heapam.c        | 292 ++++++++++++-------------
 2 files changed, 150 insertions(+), 155 deletions(-)

diff --git a/contrib/amcheck/t/001_verify_heapam.pl b/contrib/amcheck/t/001_verify_heapam.pl
index 6050feb712..b6fc640a53 100644
--- a/contrib/amcheck/t/001_verify_heapam.pl
+++ b/contrib/amcheck/t/001_verify_heapam.pl
@@ -4,7 +4,7 @@ use warnings;
 use PostgresNode;
 use TestLib;
 
-use Test::More tests => 80;
+use Test::More tests => 128;
 
 my ($node, $result);
 
@@ -17,6 +17,17 @@ $node->append_conf('postgresql.conf', 'autovacuum=off');
 $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
 
+#
+# Check for false positives against pg_statistic.  There was a bug in the
+# visibility checking logic that resulted in a consistently reproducible
+# complaint about missing toast table entries for table pg_statistic.  The
+# problem was that main table entries were being checked despite being dead,
+# which is wrong, and though the main table entries were not corrupt, the
+# missing toast was reported.
+#
+$node->safe_psql('postgres', q(ANALYZE));
+check_all_options_uncorrupted('pg_catalog.pg_statistic', 'plain');
+
 #
 # Check a table with data loaded but no corruption, freezing, etc.
 #
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..e377b9ab8e 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -150,6 +150,8 @@ static XidBoundsViolation get_xid_status(TransactionId xid,
 										 HeapCheckContext *ctx,
 										 XidCommitStatus *status);
 
+static bool heap_tuple_satisfies_corruption_checking(HeapTupleHeader tuphdr);
+
 /*
  * Scan and report corruption in heap pages, optionally reconciling toasted
  * attributes with entries in the associated toast table.  Intended to be
@@ -658,160 +660,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
 	 * HTSV_Result we think that function might return for this tuple.
 	 */
-	if (!HeapTupleHeaderXminCommitted(tuphdr))
-	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
-
-		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
-		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
-		{
-			XidCommitStatus status;
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
-
-			switch (get_xid_status(xvac, ctx, &status))
-			{
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-					break;
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-					break;
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
-			}
-		}
-		else
-		{
-			XidCommitStatus status;
-
-			switch (get_xid_status(raw_xmin, ctx, &status))
-			{
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
-					return false;
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
-			}
-		}
-	}
-
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
-	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
-		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
-
-			switch (get_xid_status(xmax, ctx, &status))
-			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
-			}
-
-			/* Ok, the tuple is live */
-		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
-	}
-	return true;				/* not dead */
+	return heap_tuple_satisfies_corruption_checking(tuphdr);
 }
 
 /*
@@ -1461,3 +1310,138 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 	ctx->cached_status = *status;
 	return XID_BOUNDS_OK;
 }
+
+/*
+ * heap_tuple_satisfies_corruption_checking
+ *
+ * Determine the visibility of tuples for corruption checking purposes.  If a
+ * tuple might not be visible to any running transaction, then we must not
+ * check it.  This function is based heavily on
+ * HeapTupleSatisfiesVacuumHorizon, with comments indicating what we think that
+ * function would return for the tuple.
+ *
+ * Callers should first check for tuple header corruption that might cause this
+ * function to error or assert prior to calling.  Changing this function to be
+ * more robust against errors is less desireable than hardening the code prior
+ * to this function, as we don't want this function to diverge from
+ * HeapTupleSatisfiesVacuumHorizon more than necessary.
+ *
+ * Reliance on hint bits is a bit dubious during corruption checking, as the
+ * hint bits in question may themselves be corrupt.  We rely on them only to
+ * the extent that other visibility functions do, as if other functions say a
+ * tuple is visible, then it had better not be corrupt, else regular code may
+ * hit the corruption.
+ *
+ * Returns whether the tuple may be checked.
+ */
+bool
+heap_tuple_satisfies_corruption_checking(HeapTupleHeader tuphdr)
+{
+	/*
+	 * Has inserting transaction committed?
+	 *
+	 * If the inserting transaction aborted, then the tuple was never visible
+	 * to any other transaction.
+	 */
+	if (!HeapTupleHeaderXminCommitted(tuphdr))
+	{
+		if (HeapTupleHeaderXminInvalid(tuphdr))
+			return false;		/* HEAPTUPLE_DEAD */
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
+		{
+			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+
+			if (TransactionIdIsCurrentTransactionId(xvac))
+				return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
+			if (TransactionIdIsInProgress(xvac))
+				return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
+			if (TransactionIdDidCommit(xvac))
+				return false;	/* HEAPTUPLE_DEAD */
+		}
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
+		{
+			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+
+			if (TransactionIdIsCurrentTransactionId(xvac))
+				return false;	/* HEAPTUPLE_INSERT_IN_PROGRESS */
+			if (TransactionIdIsInProgress(xvac))
+				return false;	/* HEAPTUPLE_INSERT_IN_PROGRESS */
+			if (!TransactionIdDidCommit(xvac))
+				return false;	/* HEAPTUPLE_DEAD */
+		}
+		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuphdr)))
+		{
+			if (tuphdr->t_infomask & HEAP_XMAX_INVALID) /* xid invalid */
+				return false;	/* HEAPTUPLE_INSERT_IN_PROGRESS */
+			/* only locked? run infomask-only check first, for performance */
+			if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask) ||
+				HeapTupleHeaderIsOnlyLocked(tuphdr))
+				return false;	/* HEAPTUPLE_INSERT_IN_PROGRESS */
+			/* inserted and then deleted by same xact */
+			if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tuphdr)))
+				return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
+			/* deleting subtransaction must have aborted */
+			return false;		/* HEAPTUPLE_INSERT_IN_PROGRESS */
+		}
+		else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuphdr)))
+		{
+			return false;		/* HEAPTUPLE_INSERT_IN_PROGRESS */
+		}
+		else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr)))
+			return false;		/* HEAPTUPLE_DEAD */
+
+		/* At this point the xmin is known committed */
+	}
+
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+	if (tuphdr->t_infomask & HEAP_XMAX_INVALID)
+		return true;			/* HEAPTUPLE_LIVE */
+
+	if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+		return true;			/* HEAPTUPLE_LIVE */
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+
+		/* already checked above */
+		Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask));
+
+		/* not LOCKED_ONLY, so it has to have an xmax */
+		Assert(TransactionIdIsValid(xmax));
+
+		if (TransactionIdIsInProgress(xmax))
+			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS */
+		else if (TransactionIdDidCommit(xmax))
+		{
+			return false;		/* HEAPTUPLE_RECENTLY_DEAD */
+		}
+
+		return true;			/* HEAPTUPLE_LIVE */
+	}
+
+	if (!(tuphdr->t_infomask & HEAP_XMAX_COMMITTED))
+	{
+		if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmax(tuphdr)))
+			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS */
+		else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuphdr)))
+
+			/*
+			 * Not in Progress, Not Committed, so either Aborted or crashed
+			 */
+			return true;		/* HEAPTUPLE_LIVE */
+
+		/* At this point the xmax is known committed */
+	}
+
+	/*
+	 * Deleter committed, and it may have been recent enough that some open
+	 * transactions could still see the tuple, but don't count on that.
+	 */
+	return false;				/* HEAPTUPLE_RECENTLY_DEAD */
+}
-- 
2.21.1 (Apple Git-122.3)

v11-0002-pg_amcheck-extend-toast-corruption-reports.patchapplication/octet-stream; name=v11-0002-pg_amcheck-extend-toast-corruption-reports.patch; x-unix-mode=0644Download
From d3872df7b117425e21c055e4459de1bfd24ddeaa Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 23 Mar 2021 22:57:50 -0700
Subject: [PATCH v11 2/2] pg_amcheck: extend toast corruption reports

Modify amcheck to perform additional toast corruption checks.  Fix
some amcheck messages not to include the attribute number, which is
redundant given that attnum is already one of the returned columns.
Fix others to include the toast value ID in the message text.

Commit bbe0a81db69bd10bd166907c3701492a29aca294 introduced lz4 as a
toast compression method.  Add checks for the new compression method
ID field and report corruption on unexpected methods and for
compression method TOAST_LZ4_COMPRESSION_ID if encountered on a
server built without lz4.  Stop short of calling the appropriate
decompression function on the toasted attribute, as users may not
appreciate the extra CPU overhead that entails, unless compiled with
new symbol DECOMPRESSION_CORRUPTION_CHECKING defined, in which case
decompression failures are reported as corruption.
---
 contrib/amcheck/Makefile                  |   4 +
 contrib/amcheck/verify_heapam.c           | 327 ++++++++++++++++------
 src/bin/pg_amcheck/t/004_verify_heapam.pl |  70 ++++-
 3 files changed, 312 insertions(+), 89 deletions(-)

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index b82f221e50..78c4051096 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -12,6 +12,10 @@ PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree check_heap
 
+## Uncomment if compiling with DECOMPRESSION_CORRUPTION_CHECKING
+#
+# SHLIB_LINK += $(filter -llz4, $(LIBS))
+
 TAP_TESTS = 1
 
 ifdef USE_PGXS
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index e377b9ab8e..1f80db82f4 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -10,6 +10,10 @@
  */
 #include "postgres.h"
 
+#ifdef USE_LZ4
+#include <lz4.h>
+#endif
+
 #include "access/detoast.h"
 #include "access/genam.h"
 #include "access/heapam.h"
@@ -18,6 +22,7 @@
 #include "access/toast_internals.h"
 #include "access/visibilitymap.h"
 #include "catalog/pg_am.h"
+#include "common/pg_lzcompress.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
@@ -114,6 +119,8 @@ typedef struct HeapCheckContext
 	AttrNumber	attnum;
 
 	/* Values for iterating over toast for the attribute */
+	struct varatt_external toast_pointer;
+	bool		checking_toastptr;
 	int32		chunkno;
 	int32		attrsize;
 	int32		endchunk;
@@ -130,11 +137,11 @@ typedef struct HeapCheckContext
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
+static int32 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+							   bool *toast_error);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static bool check_tuple_header(HeapTupleHeader tuphdr, HeapCheckContext *ctx);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
@@ -345,6 +352,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 	if (TransactionIdIsNormal(ctx.relfrozenxid))
 		ctx.oldest_xid = ctx.relfrozenxid;
 
+	ctx.checking_toastptr = false;
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
@@ -557,16 +565,11 @@ verify_heapam_tupdesc(void)
 }
 
 /*
- * Check for tuple header corruption and tuple visibility.
- *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * Check for tuple header corruption.
  *
  * Some kinds of corruption make it unsafe to check the tuple attributes, for
  * example when the line pointer refers to a range of bytes outside the page.
- * In such cases, we return false (not visible) after recording appropriate
+ * In such cases, we return false (not checkable) after recording appropriate
  * corruption messages.
  *
  * Some other kinds of tuple header corruption confuse the question of where
@@ -581,23 +584,11 @@ verify_heapam_tupdesc(void)
  * messages for them but do not base our visibility determination on them.  (In
  * other words, we do not return false merely because we detected them.)
  *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
- *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
- *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Returns whether the tuple is sufficiently sensible to undergo attribute
+ * checks.
  */
 static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+check_tuple_header(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 {
 	uint16		infomask = tuphdr->t_infomask;
 	bool		header_garbled = false;
@@ -653,14 +644,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 	if (header_garbled)
 		return false;			/* checking of this tuple should not continue */
 
-	/*
-	 * Ok, we can examine the header for tuple visibility purposes, though we
-	 * still need to be careful about a few remaining types of header
-	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
-	 */
-	return heap_tuple_satisfies_corruption_checking(tuphdr);
+	return true;
 }
 
 /*
@@ -673,9 +657,12 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
  * tuples that store the toasted value are retrieved and checked in order, with
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
+ *
+ * Returns the size of the chunk, not including the header, or zero if it
+ * cannot be determined due to corruption.
  */
-static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
+static int32
+check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx, bool *toast_error)
 {
 	int32		curchunk;
 	Pointer		chunk;
@@ -692,7 +679,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	{
 		report_corruption(ctx,
 						  pstrdup("toast chunk sequence number is null"));
-		return;
+		*toast_error = true;
+		return 0;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
@@ -700,7 +688,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	{
 		report_corruption(ctx,
 						  pstrdup("toast chunk data is null"));
-		return;
+		*toast_error = true;
+		return 0;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
 		chunksize = VARSIZE(chunk) - VARHDRSZ;
@@ -717,9 +706,11 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
 		report_corruption(ctx,
-						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-								   header, curchunk));
-		return;
+						  psprintf("toast value ID %u corrupt extended chunk has invalid varlena header: %0x (sequence number %d)",
+								   ctx->toast_pointer.va_valueid, header,
+								   curchunk));
+		*toast_error = true;
+		return 0;
 	}
 
 	/*
@@ -728,16 +719,20 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	if (curchunk != ctx->chunkno)
 	{
 		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, ctx->chunkno));
-		return;
+						  psprintf("toast value ID %u chunk sequence number %u does not match the expected sequence number %u",
+								   ctx->toast_pointer.va_valueid, curchunk,
+								   ctx->chunkno));
+		*toast_error = true;
+		return chunksize;
 	}
 	if (curchunk > ctx->endchunk)
 	{
 		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, ctx->endchunk));
-		return;
+						  psprintf("toast value ID %u chunk sequence number %u exceeds the end chunk sequence number %u",
+								   ctx->toast_pointer.va_valueid, curchunk,
+								   ctx->endchunk));
+		*toast_error = true;
+		return chunksize;
 	}
 
 	expected_size = curchunk < ctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
@@ -745,10 +740,14 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	if (chunksize != expected_size)
 	{
 		report_corruption(ctx,
-						  psprintf("toast chunk size %u differs from the expected size %u",
-								   chunksize, expected_size));
-		return;
+						  psprintf("toast value ID %u chunk size %u differs from the expected size %u",
+								   ctx->toast_pointer.va_valueid, chunksize,
+								   expected_size));
+		*toast_error = true;
+		return chunksize;
 	}
+
+	return chunksize;
 }
 
 /*
@@ -774,18 +773,21 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 static bool
 check_tuple_attribute(HeapCheckContext *ctx)
 {
-	struct varatt_external toast_pointer;
 	ScanKeyData toastkey;
 	SysScanDesc toastscan;
 	SnapshotData SnapshotToast;
 	HeapTuple	toasttup;
+	int64		toastsize;		/* corrupt toast could overflow 32 bits */
 	bool		found_toasttup;
+	bool		toast_error;
 	Datum		attdatum;
 	struct varlena *attr;
 	char	   *tp;				/* pointer to the tuple data */
 	uint16		infomask;
 	Form_pg_attribute thisatt;
 
+	Assert(!ctx->checking_toastptr);
+
 	infomask = ctx->tuphdr->t_infomask;
 	thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), ctx->attnum);
 
@@ -794,8 +796,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u starts at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -815,8 +816,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 		{
 			report_corruption(ctx,
-							  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-									   ctx->attnum,
+							  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 									   thisatt->attlen,
 									   ctx->tuphdr->t_hoff + ctx->offset,
 									   ctx->lp_len));
@@ -848,8 +848,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (va_tag != VARTAG_ONDISK)
 		{
 			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
+							  psprintf("toasted attribute has unexpected TOAST tag %u",
 									   va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
@@ -863,8 +862,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -891,12 +889,20 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	/* It is external, and we're looking at a page on disk */
 
+	/*
+	 * Must copy attr into toast_pointer for alignment considerations
+	 */
+	VARATT_EXTERNAL_GET_POINTER(ctx->toast_pointer, attr);
+	ctx->checking_toastptr = true;
+	toast_error = false;
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+						  psprintf("toast value ID %u is external but tuple header flag HEAP_HASEXTERNAL not set",
+								   ctx->toast_pointer.va_valueid));
+		ctx->checking_toastptr = false;
 		return true;
 	}
 
@@ -904,21 +910,41 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+						  psprintf("toast value ID %u is external but relation has no toast relation",
+								   ctx->toast_pointer.va_valueid));
+		ctx->checking_toastptr = false;
+		return true;
+	}
+
+	if (VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer) > ctx->toast_pointer.va_rawsize - VARHDRSZ)
+	{
+		report_corruption(ctx,
+						  psprintf("toast value ID %u external size %u exceeds maximum expected for rawsize %u",
+								   ctx->toast_pointer.va_valueid,
+								   VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer),
+								   ctx->toast_pointer.va_rawsize));
+		toast_error = true;
+	}
+
+	if (ctx->toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+	{
+		report_corruption(ctx,
+						  psprintf("toast value ID %u toast relation oid %u differs from expected oid %u",
+								   ctx->toast_pointer.va_valueid,
+								   ctx->toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
+		ctx->checking_toastptr = false;
 		return true;
 	}
 
 	/* If we were told to skip toast checking, then we're done. */
 	if (ctx->toast_rel == NULL)
+	{
+		ctx->checking_toastptr = false;
 		return true;
+	}
 
-	/*
-	 * Must copy attr into toast_pointer for alignment considerations
-	 */
-	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
-
-	ctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
+	ctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer);
 	ctx->endchunk = (ctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
 	ctx->totalchunks = ctx->endchunk + 1;
 
@@ -928,7 +954,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	ScanKeyInit(&toastkey,
 				(AttrNumber) 1,
 				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(toast_pointer.va_valueid));
+				ObjectIdGetDatum(ctx->toast_pointer.va_valueid));
 
 	/*
 	 * Check if any chunks for this toasted object exist in the toast table,
@@ -941,24 +967,140 @@ check_tuple_attribute(HeapCheckContext *ctx)
 										   &toastkey);
 	ctx->chunkno = 0;
 	found_toasttup = false;
+	toastsize = 0;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx);
+		toastsize += check_toast_tuple(toasttup, ctx, &toast_error);
 		ctx->chunkno++;
 	}
+	systable_endscan_ordered(toastscan);
+
 	if (!found_toasttup)
 		report_corruption(ctx,
-						  psprintf("toasted value for attribute %u missing from toast table",
-								   ctx->attnum));
+						  psprintf("toasted value ID %u missing from toast table",
+								   ctx->toast_pointer.va_valueid));
 	else if (ctx->chunkno != (ctx->endchunk + 1))
 		report_corruption(ctx,
-						  psprintf("final toast chunk number %u differs from expected value %u",
-								   ctx->chunkno, (ctx->endchunk + 1)));
-	systable_endscan_ordered(toastscan);
+						  psprintf("toast value ID %u final chunk number %u differs from expected value %u",
+								   ctx->toast_pointer.va_valueid, ctx->chunkno,
+								   (ctx->endchunk + 1)));
+	else if (toastsize != VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer))
+		report_corruption(ctx,
+						  psprintf("toast value ID %u total toast size " INT64_FORMAT " differs from expected size %u",
+								   ctx->toast_pointer.va_valueid, toastsize,
+								   VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer)));
+	else if (!toast_error)
+	{
+		if (!AllocSizeIsValid(ctx->toast_pointer.va_rawsize))
+		{
+			report_corruption(ctx,
+							  psprintf("toast value ID %u rawsize %u too large to be allocated",
+									   ctx->toast_pointer.va_valueid,
+									   ctx->toast_pointer.va_rawsize));
+			toast_error = true;
+		}
+
+		if (!AllocSizeIsValid(VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer)))
+		{
+			report_corruption(ctx,
+							  psprintf("toast value ID %u extsize %u too large to be allocated",
+									   VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer),
+									   ctx->toast_pointer.va_valueid));
+			toast_error = true;
+		}
+
+		if (!toast_error)
+		{
+			Size		allocsize;
+			struct varlena *attr;
+
+			/* Fetch all chunks */
+			allocsize = VARATT_EXTERNAL_GET_EXTSIZE(ctx->toast_pointer) + VARHDRSZ;
+			attr = (struct varlena *) palloc(allocsize);
+			if (VARATT_EXTERNAL_IS_COMPRESSED(ctx->toast_pointer))
+				SET_VARSIZE_COMPRESSED(attr, allocsize);
+			else
+				SET_VARSIZE(attr, allocsize);
+
+			table_relation_fetch_toast_slice(ctx->toast_rel, ctx->toast_pointer.va_valueid,
+											 toastsize, 0, toastsize, attr);
 
+			if (VARATT_IS_COMPRESSED(attr))
+			{
+#ifdef DECOMPRESSION_CORRUPTION_CHECKING
+				struct varlena *uncompressed;
+				int32		rawsize;
+#endif
+				Size		allocsize;
+				ToastCompressionId cmid;
+
+				/* allocate memory for the uncompressed data */
+				allocsize = VARDATA_COMPRESSED_GET_EXTSIZE(attr) + VARHDRSZ;
+				if (!AllocSizeIsValid(allocsize))
+					report_corruption(ctx,
+									  psprintf("toast value ID %u invalid uncompressed size %zu",
+											   ctx->toast_pointer.va_valueid,
+											   allocsize));
+				cmid = TOAST_COMPRESS_METHOD(attr);
+				switch (cmid)
+				{
+					case TOAST_PGLZ_COMPRESSION_ID:
+#ifdef DECOMPRESSION_CORRUPTION_CHECKING
+						/* decompress the data */
+						uncompressed = (struct varlena *) palloc(allocsize);
+						rawsize = pglz_decompress((char *) attr + VARHDRSZ_COMPRESSED,
+												  VARSIZE(attr) - VARHDRSZ_COMPRESSED,
+												  VARDATA(uncompressed),
+												  VARDATA_COMPRESSED_GET_EXTSIZE(attr), true);
+						if (rawsize < 0)
+							report_corruption(ctx,
+											  psprintf("toast value ID %u compressed pglz data is corrupt",
+													   ctx->toast_pointer.va_valueid));
+						pfree(uncompressed);
+#endif
+						break;
+					case TOAST_LZ4_COMPRESSION_ID:
+#ifndef USE_LZ4
+						report_corruption(ctx,
+										  psprintf("toast value ID %u unsupported LZ4 compression method",
+												   ctx->toast_pointer.va_valueid));
+#else
+#ifdef DECOMPRESSION_CORRUPTION_CHECKING
+						/* decompress the data */
+						uncompressed = (struct varlena *) palloc(allocsize);
+						rawsize = LZ4_decompress_safe((char *) attr + VARHDRSZ_COMPRESSED,
+													  VARDATA(uncompressed),
+													  VARSIZE(attr) - VARHDRSZ_COMPRESSED,
+													  VARDATA_COMPRESSED_GET_EXTSIZE(attr));
+						if (rawsize < 0)
+							report_corruption(ctx,
+											  psprintf("toast value ID %u compressed lz4 data is corrupt",
+													   ctx->toast_pointer.va_valueid));
+						pfree(uncompressed);
+#endif
+#endif
+						break;
+					default:
+						report_corruption(ctx,
+										  psprintf("toast value ID %u invalid compression method id %d",
+												   ctx->toast_pointer.va_valueid,
+												   cmid));
+				}
+			}
+			else if (VARSIZE(attr) != ctx->toast_pointer.va_rawsize)
+				report_corruption(ctx,
+								  psprintf("toast value ID %u detoasted attribute size %u differs from expected rawsize %u",
+										   ctx->toast_pointer.va_valueid,
+										   VARSIZE(attr),
+										   ctx->toast_pointer.va_rawsize));
+			pfree(attr);
+		}
+	}
+
+	ctx->checking_toastptr = false;
 	return true;
 }
 
@@ -1093,10 +1235,16 @@ check_tuple(HeapCheckContext *ctx)
 
 	/*
 	 * Check various forms of tuple header corruption.  If the header is too
-	 * corrupt to continue checking, or if the tuple is not visible to anyone,
-	 * we cannot continue with other checks.
+	 * corrupt to continue checking, we cannot continue with other checks.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	if (!check_tuple_header(ctx->tuphdr, ctx))
+		return;
+
+	/*
+	 * If we know the tuple is visible to at least some running transactions,
+	 * we can check it.  Otherwise, we are done with this tuple.
+	 */
+	if (!heap_tuple_satisfies_corruption_checking(ctx->tuphdr))
 		return;
 
 	/*
@@ -1314,12 +1462,25 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 /*
  * heap_tuple_satisfies_corruption_checking
  *
- * Determine the visibility of tuples for corruption checking purposes.  If a
- * tuple might not be visible to any running transaction, then we must not
- * check it.  This function is based heavily on
- * HeapTupleSatisfiesVacuumHorizon, with comments indicating what we think that
- * function would return for the tuple.
+ * Since we do not hold a snapshot, tuple visibility is not a question of
+ * whether we should be able to see the tuple relative to any particular
+ * snapshot, but rather a question of whether it is safe and reasonable to
+ * check the tuple attributes.
+ *
+ * For visibility determination, what we want to know is if a tuple is
+ * potentially visible to any running transaction.  If you are tempted to
+ * replace this function's visibility logic with a call to another visibility
+ * checking function, keep in mind that this function does not update hint
+ * bits, as it seems imprudent to write hint bits (or anything at all) to a
+ * table during a corruption check.  Nor does this function bother classifying
+ * tuple visibility beyond a boolean visible vs. not visible.
+ *
+ * The caller should already have checked that xmin and xmax are not out of
+ * bounds for the relation, and that the tuple header is not too garbled for
+ * header fields to be consulted.
  *
+ * This function is based heavily on HeapTupleSatisfiesVacuumHorizon, with
+ * comments indicating what we think that function would return for the tuple.
  * Callers should first check for tuple header corruption that might cause this
  * function to error or assert prior to calling.  Changing this function to be
  * more robust against errors is less desireable than hardening the code prior
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 36607596b1..ef5043ba30 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -224,7 +224,7 @@ my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.te
 my $relpath = "$pgdata/$rel";
 
 # Insert data and freeze public.test
-use constant ROWCOUNT => 16;
+use constant ROWCOUNT => 21;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
@@ -259,6 +259,13 @@ select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
 	offset $tup limit 1)));
 }
 
+# Find our toast relation id
+my $toastrelid = $node->safe_psql('postgres', qq(
+	SELECT c.reltoastrelid
+		FROM pg_catalog.pg_class c
+		WHERE c.oid = 'public.test'::regclass
+		));
+
 # Sanity check that our 'test' table on disk layout matches expectations.  If
 # this is not so, we will have to skip the test until somebody updates the test
 # to work on this platform.
@@ -296,7 +303,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 19;
+plan tests => 29;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -310,6 +317,7 @@ $node->stop;
 
 # Some #define constants from access/htup_details.h for use while corrupting.
 use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_HASEXTERNAL        => 0x0004;
 use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
 use constant HEAP_XMIN_COMMITTED     => 0x0100;
 use constant HEAP_XMIN_INVALID       => 0x0200;
@@ -362,7 +370,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
 	}
-	if ($offnum == 2)
+	elsif ($offnum == 2)
 	{
 		# Corruptly set xmin < datfrozenxid
 		my $xmin = 3;
@@ -480,7 +488,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 1);
 		push @expected,
-			qr/${header}attribute \d+ with length \d+ ends at offset \d+ beyond total tuple length \d+/;
+			qr/${header}attribute with length \d+ ends at offset \d+ beyond total tuple length \d+/;
 	}
 	elsif ($offnum == 13)
 	{
@@ -489,9 +497,18 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}toasted value for attribute 2 missing from toast table/;
+			qr/${header}toasted value ID \d+ missing from toast table/;
 	}
 	elsif ($offnum == 14)
+	{
+		# Corrupt infomask to claim there are no external attributes, which conflicts
+		# with column 'c' which is toasted
+		$tup->{t_infomask} &= ~HEAP_HASEXTERNAL;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value ID \d+ is external but tuple header flag HEAP_HASEXTERNAL not set/;
+	}
+	elsif ($offnum == 15)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
@@ -501,7 +518,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
 	}
-	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	elsif ($offnum == 16)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
@@ -511,6 +528,47 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
 	}
+	elsif ($offnum == 17)
+	{
+		# Corrupt column c's toast pointer va_vartag field
+		$tup->{c_va_vartag} = 42;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/$header/,
+			qr/toasted attribute has unexpected TOAST tag 42/;
+	}
+	elsif ($offnum == 18)
+	{
+		# Corrupt column c's toast pointer va_extinfo field
+		$tup->{c_va_extinfo} = 7654321;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/$header/,
+			qr/toast value ID \d+ external size 7654321 exceeds maximum expected for rawsize 10004/,
+			qr/toast value ID \d+ chunk size \d+ differs from the expected size \d+/,
+			qr/toast value ID \d+ final chunk number \d+ differs from expected value \d+/;
+	}
+	elsif ($offnum == 19)
+	{
+		# Corrupt column c's toast pointer va_valueid field.  We have not
+		# consumed enough oids for any valueid in the toast table to be large.
+		# Use a large oid for the corruption to avoid colliding with an
+		# existent entry in the toast.
+		my $corrupt = $tup->{c_va_valueid} + 100000000;
+		$tup->{c_va_valueid} = $corrupt;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}/,
+			qr/toasted value ID \d+ missing from toast table/;
+	}
+	elsif ($offnum == 20)	# Last offnum must less than or equal to ROWCOUNT-1
+	{
+		# Corrupt column c's toast pointer va_toastrelid field
+		my $otherid = $toastrelid + 1;
+		$tup->{c_va_toastrelid} = $otherid;
+		push @expected,
+			qr/toast value ID \d+ toast relation oid $otherid differs from expected oid $toastrelid/;
+	}
 	write_tuple($file, $offset, $tup);
 }
 close($file)
-- 
2.21.1 (Apple Git-122.3)

#102Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#101)
Re: pg_amcheck contrib application

On Wed, Mar 24, 2021 at 2:13 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

The visibility rules fix is different in v11, relying on a visibility check which more closely follows the implementation of HeapTupleSatisfiesVacuumHorizon.

Hmm. The header comment you wrote says "If a tuple might not be
visible to any running transaction, then we must not check it." But, I
don't find that statement very clear: does it mean "if there could be
even one transaction to which this tuple is not visible, we must not
check it"? Or does it mean "if the number of transactions that can see
this tuple could potentially be zero, then we must not check it"? I
don't think either of those is actually what we care about. I think
what we should be saying is "if the tuple could have been inserted by
a transaction that also added a column to the table, but which
ultimately did not commit, then the table's current TupleDesc might
differ from the one used to construct this tuple, so we must not check
it."

--
Robert Haas
EDB: http://www.enterprisedb.com

#103Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#102)
Re: pg_amcheck contrib application

On Wed, Mar 24, 2021 at 9:12 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 24, 2021 at 2:13 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

The visibility rules fix is different in v11, relying on a visibility check which more closely follows the implementation of HeapTupleSatisfiesVacuumHorizon.

Hmm. The header comment you wrote says "If a tuple might not be
visible to any running transaction, then we must not check it." But, I
don't find that statement very clear: does it mean "if there could be
even one transaction to which this tuple is not visible, we must not
check it"? Or does it mean "if the number of transactions that can see
this tuple could potentially be zero, then we must not check it"? I
don't think either of those is actually what we care about. I think
what we should be saying is "if the tuple could have been inserted by
a transaction that also added a column to the table, but which
ultimately did not commit, then the table's current TupleDesc might
differ from the one used to construct this tuple, so we must not check
it."

Hit send too soon. And I was wrong, too. Wahoo. Thinking about the
buildfarm failure, I realized that there's a second danger here,
unrelated to the possibility of different TupleDescs, which we talked
about before: if the tuple is dead, we can't safely follow any TOAST
pointers, because the TOAST chunks might disappear at any time. So
technically we could split the return value up into something
three-way: if the inserted is known to have committed, we can check
the tuple itself, because the TupleDesc has to be reasonable. And, if
the tuple is known not to be dead already, and known not to be in a
state where it could become dead while we're doing stuff, we can
follow the TOAST pointer. I'm not sure whether it's worth trying to be
that fancy or not.

If we were only concerned about the mismatched-TupleDesc problem, this
function could return true in a lot more cases. Once we get to the
comment that says "Okay, the inserter committed..." we could just
return true. Similarly, the HEAP_MOVED_IN and HEAP_MOVED_OFF cases
could just skip all the interior test and return true, because if the
tuple is being moved, the original inserter has to have committed.
Conversely, however, the !HeapTupleHeaderXminCommitted ->
TransactionIdIsCurrentTransactionId case probably ought to always
return false. One could argue otherwise: if we're the inserter, then
the only in-progress transaction that might have changed the TupleDesc
is us, so we could just consider this case to be a true return value
also, regardless of what's going on with xmax. In that case, we're not
asking "did the inserter definitely commit?" but "are the inserter's
possible DDL changes definitely visible to us?" which might be an OK
definition too.

However, the could-the-TOAST-data-disappear problem is another story.
I don't see how we can answer that question correctly with the logic
you've got here, because you have no XID threshold. Consider the case
where we reach this code:

+       if (!(tuphdr->t_infomask & HEAP_XMAX_COMMITTED))
+       {
+               if
(TransactionIdIsInProgress(HeapTupleHeaderGetRawXmax(tuphdr)))
+                       return true;            /*
HEAPTUPLE_DELETE_IN_PROGRESS */
+               else if
(!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuphdr)))
+
+                       /*
+                        * Not in Progress, Not Committed, so either
Aborted or crashed
+                        */
+                       return true;            /* HEAPTUPLE_LIVE */
+
+               /* At this point the xmax is known committed */
+       }

If we reach the case where the code comment says
HEAPTUPLE_DELETE_IN_PROGRESS, we know that the tuple isn't dead right
now, and so the TOAST tuples aren't dead either. But, by the time we
go try to look at the TOAST tuples, they might have become dead and
been pruned away, because the deleting transaction can commit at any
time, and after that pruning can happen at any time. Our only
guarantee that that won't happen is if the deleting XID is new enough
that it's invisible to some snapshot that our backend has registered.
That's approximately why HeapTupleSatisfiesVacuumHorizon needs to set
*dead_after in this case and one other, and I think you have the same
requirement.

I just noticed that this whole thing has another, related problem:
check_tuple_header_and_visibilty() and check_tuple_attribute() are
called from within check_tuple(), which is called while we hold a
buffer lock on the heap page. We should not be going and doing complex
operations that might take their own buffer locks - like TOAST index
checks - while we're holding an lwlock. That's going to have to be
changed so that the TOAST pointer checking happens after
UnlockReleaseBuffer(); in other words, we'll need to remember the
TOAST pointers to go look up and actually look them up after
UnlockReleaseBuffer(). But, when we do that, then the HEAPTUPLE_LIVE
case above has the same race condition that is already present in the
HEAPTUPLE_DELETE_IN_PROGRESS case: after we release the buffer pin,
some other transaction might delete the tuple and the associated TOAST
tuples, and they might then commit, and the tuple might become dead
and get pruned away before we check the TOAST table.

On a related note, I notice that your latest patch removes all the
logic that complains about XIDs being out of bounds. I don't think
that's good. Those seem like important checks. They're important for
finding problems with the relation, and I think we probably also need
them because of the XID-horizon issue mentioned above. One possible
way of looking at it is to say that the XID_BOUNDS_OK case has two
sub-cases: either the XID is within bounds and is one that cannot
become all-visible concurrently because it's not visible to all of our
backend's registered snapshots, or it's within bounds but does have
the possibility of becoming all-visible. In the former case, if it
appears as XMAX we can safely follow TOAST pointers found within the
tuple; in the latter case, we can't.

--
Robert Haas
EDB: http://www.enterprisedb.com

#104Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#103)
1 attachment(s)
Re: pg_amcheck contrib application

Mark,

Here's a quick and very dirty sketch of what I think perhaps this
logic could look like. This is pretty much untested and it might be
buggy, but at least you can see whether we're thinking at all in the
same direction.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

very-rough-visibility-ideas.patchapplication/octet-stream; name=very-rough-visibility-ideas.patchDownload
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..066e13a1e3 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -72,6 +72,8 @@ typedef struct HeapCheckContext
 	TransactionId oldest_xid;	/* ShmemVariableCache->oldestXid */
 	FullTransactionId oldest_fxid;	/* 64-bit version of oldest_xid, computed
 									 * relative to next_fxid */
+	TransactionId safe_xmin;		/* this XID and newer ones can't become
+									 * all-visible while we're running */
 
 	/*
 	 * Cached copy of value from MultiXactState
@@ -133,8 +135,10 @@ static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static void check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
+											 HeapCheckContext *ctx,
+											 bool *tuple_is_readable,
+											 bool *tuple_cannot_die_now);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
@@ -248,6 +252,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
 
+	/*
+	 * Any xmin newer than the xmin of our snapshot can't become all-visible
+	 * while we're running.
+	 */
+	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+
 	/*
 	 * If we report corruption when not examining some individual attribute,
 	 * we need attnum to be reported as NULL.  Set that up before any
@@ -580,27 +590,33 @@ verify_heapam_tupdesc(void)
  * other words, we do not return false merely because we detected them.)
  *
  * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
- *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
+ * want to know is (1) whether the original inserter committed and (2)
+ * whether it's possible for the tuple to be pruned while we're still checking
+ * the relation. If (1) is not the case, then the tuple descriptor used to
+ * construct the table might include additional columns that are we don't
+ * know about, so we don't try to decode the tuple. If (2) is not the case,
+ * it's OK to check the tuple, but it's not safe to follow any TOAST pointers,
+ * because if this tuple can be pruned away at any time, the same is true
+ * of its TOAST chunks.
  *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Unlike other visibility-checking functions, this does not update hint bits,
+ * as it seems imprudent to write hint bits (or anything at all) to a table
+ * during a corruption check.  Nor does this function bother classifying tuple
+ * visibility beyond answering the questions mentioned above.
  */
-static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+static void
+check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx,
+								 bool *tuple_is_readable,
+								 bool *tuple_cannot_die_now)
 {
 	uint16		infomask = tuphdr->t_infomask;
 	bool		header_garbled = false;
 	unsigned	expected_hoff;
 
+	/* We haven't proven anything yet. */
+	*tuple_is_readable = false;
+	*tuple_cannot_die_now = false;
+
 	if (ctx->tuphdr->t_hoff > ctx->lp_len)
 	{
 		report_corruption(ctx,
@@ -649,169 +665,153 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 	}
 
 	if (header_garbled)
-		return false;			/* checking of this tuple should not continue */
+		return;			/* checking of this tuple should not continue */
+
+	/*
+	 * XXX check whether raw xmin is ok by caling get_xid_status, unless
+	 * HeapTupleHeaderXminFrozen in which case we should skip it
+	 */
 
 	/*
 	 * Ok, we can examine the header for tuple visibility purposes, though we
 	 * still need to be careful about a few remaining types of header
 	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
+	 * HeapTupleSatisfiesVacuum and similar functions, but we don't need to
+	 * distinguish quite as many cases.
 	 */
 	if (!HeapTupleHeaderXminCommitted(tuphdr))
 	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
-
 		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
-		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
+			return;
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
 		{
-			XidCommitStatus status;
 			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xvac, ctx, &status))
+			/* XXX sanity check xvac with get_xid_status */
+
+			/*
+			 * Used by pre-9.0 binary upgrades. It should be impossible for
+			 * xvac to still be running, since we've removed all that code,
+			 * but even if it were, it ought to be safe to read the tuple,
+			 * since the original inserter must have committed. But, if the
+			 * xvac transaction committed, this tuple (and its associated
+			 * TOAST tuples) could be pruned at any time.
+			 */
+			if (TransactionIdDidCommit(xvac))
 			{
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-					break;
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-					break;
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+				*tuple_is_readable = true;
+				return;
 			}
 		}
-		else
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
 		{
-			XidCommitStatus status;
+			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+
+			/* XXX sanity check xvac with get_xid_status */
 
-			switch (get_xid_status(raw_xmin, ctx, &status))
+			/*
+			 * Same as above, but now pruning can happen if xvac did not
+			 * commit.
+			 */
+			if (!TransactionIdDidCommit(xvac))
 			{
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
-					return false;
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+				*tuple_is_readable = true;
+				return;
 			}
 		}
+		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuphdr)))
+		{
+			/*
+			 * Don't try to check tuples from uncommitted transactions, even
+			 * though technically it might be safe when it's our own
+			 * transaction (since we can any DDL we did ourselves).
+			 */
+			return;
+		}
+		else if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmin(tuphdr)))
+		{
+			/* Still running, not us, definitely can't check. */
+			return;
+		}
+		else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr)))
+		{
+			/* Inserter aborted or crashed, definitely can't check. */
+			return;
+		}
+	}
+
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+	*tuple_is_readable = true;
+
+	if ((tuphdr->t_infomask & HEAP_XMAX_INVALID) ||
+		HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+	{
+		/*
+		 * The tuple is not deleted yet. Even if it gets deleted in the near
+		 * future, it can't be pruned while we're still running because
+		 * it must still be visible to our snapshot.
+		 */
+		*tuple_cannot_die_now = true;
+		return;
 	}
 
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
+		TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+
+		/* XXX sanity check the update XID with get_xid_status */
+
+		if (TransactionIdIsInProgress(xmax))
 		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+			/*
+			 * Since the deleting transaction is still in progress, the
+			 * delete can't be visible to our snapshot.
+			 */
+			*tuple_cannot_die_now = true;
+			return;
+		}
+		else if (TransactionIdDidCommit(xmax))
+		{
+			/*
+			 * The update XID is no longer running, and it did commit. So the
+			 * tuple could be pruned if the XID is old enough.
+			 */
+			if (TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuphdr),
+									  ctx->safe_xmin))
+				return;
+		}
 
-			switch (get_xid_status(xmax, ctx, &status))
-			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
-			}
+		*tuple_cannot_die_now = true;
+		return;
+	}
+
+	/* XXX sanity check xmax with get_xid_status */
 
-			/* Ok, the tuple is live */
+	/*
+	 * Test for cases where there's been an update or delete, but recently
+	 * enough that we don't have to worry about pruning yet.
+	 */
+	if (!(tuphdr->t_infomask & HEAP_XMAX_COMMITTED))
+	{
+		if (TransactionIdIsInProgress(HeapTupleHeaderGetRawXmax(tuphdr)))
+		{
+			*tuple_cannot_die_now = true;
+			return;
+		}
+		else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuphdr)))
+		{
+			*tuple_cannot_die_now = true;
+			return;
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
 	}
-	return true;				/* not dead */
+
+	/* The delete might be old enough that we have to worry about pruning. */
+	if (!TransactionIdPrecedes(HeapTupleHeaderGetRawXmax(tuphdr),
+							   ctx->safe_xmin))
+		*tuple_cannot_die_now = true;
 }
 
 /*
@@ -1124,6 +1124,8 @@ check_tuple(HeapCheckContext *ctx)
 	TransactionId xmax;
 	bool		fatal = false;
 	uint16		infomask = ctx->tuphdr->t_infomask;
+	bool		tuple_is_readable;
+	bool		tuple_cannot_die_now;
 
 	/* If xmin is normal, it should be within valid range */
 	xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
@@ -1244,11 +1246,15 @@ check_tuple(HeapCheckContext *ctx)
 
 	/*
 	 * Check various forms of tuple header corruption.  If the header is too
-	 * corrupt to continue checking, or if the tuple is not visible to anyone,
-	 * we cannot continue with other checks.
+	 * corrupt to continue checking, or if the inserter aborted, we cannot
+	 * continue with other checks.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	check_tuple_header_and_visibilty(ctx->tuphdr, ctx,
+									 &tuple_is_readable,
+									 &tuple_cannot_die_now);
+	if (!tuple_is_readable)
 		return;
+	/* XXX skip TOAST checks if tuple_cannot_die_now is false */
 
 	/*
 	 * The tuple is visible, so it must be compatible with the current version
#105Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#104)
1 attachment(s)
Re: pg_amcheck contrib application

On Mar 24, 2021, at 1:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Mark,

Here's a quick and very dirty sketch of what I think perhaps this
logic could look like. This is pretty much untested and it might be
buggy, but at least you can see whether we're thinking at all in the
same direction.

Thanks! The attached patch addresses your comments here and in your prior email. In particular, this patch changes the tuple visibility logic to not check tuples for which the inserting transaction aborted or is still in progress, and to not check toast for tuples deleted in transactions older than our transaction snapshot's xmin. A list of toasted attributes which are safe to check is compiled per main table page during the scan of the page, then checked after the buffer lock on the main page is released.

In the perhaps unusual case where verify_heapam() is called in a transaction which has also added tuples to the table being checked, this patch's visibility logic chooses not to check such tuples. I'm on the fence about this choice, and am mostly following your lead. I like that this decision maintains the invariant that we never check tuples which have not yet been committed.

The patch includes a bit of refactoring. In the old code, heap_check() performed clog bounds checking on xmin and xmax prior to calling check_tuple_header_and_visibilty(), but I think that's not such a great choice. If the tuple header is garbled to have random bytes in the xmin and xmax fields, and we can detect that situation because other tuple header fields are garbled in detectable ways, I'd rather get a report about the header being garbled than a report about the xmin or xmax being out of bounds. In the new code, the tuple header is checked first, then the visibility is checked, then the tuple is checked against the current relation description, then the tuple attributes are checked. I think the layout is easier to follow, too.

Attachments:

v12-0001-Fix-visibility-and-locking-issues.patchapplication/octet-stream; name=v12-0001-Fix-visibility-and-locking-issues.patch; x-unix-mode=0644Download
From 0b202c41a189ff23bb2a73cce9b5c59fa9490b38 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 24 Mar 2021 18:18:56 -0700
Subject: [PATCH v12] Fix visibility and locking issues.

Fix amcheck's verify_heapam() visibility rules and refactor toast
checking code to be performed after releasing the buffer lock on the
main table page.
---
 contrib/amcheck/verify_heapam.c  | 1123 ++++++++++++++++++------------
 src/tools/pgindent/typedefs.list |    1 +
 2 files changed, 693 insertions(+), 431 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..e3163e4bfa 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -46,6 +46,7 @@ typedef enum XidBoundsViolation
 typedef enum XidCommitStatus
 {
 	XID_COMMITTED,
+	XID_IS_CURRENT_XID,
 	XID_IN_PROGRESS,
 	XID_ABORTED
 } XidCommitStatus;
@@ -57,6 +58,26 @@ typedef enum SkipPages
 	SKIP_PAGES_NONE
 } SkipPages;
 
+/*
+ * Struct holding information necessary to check a toasted attribute, including
+ * the toast pointer, state about the current toast chunk being checked, and
+ * the location in the main table of the toasted attribute.  We have to track
+ * the tuple's location in the main table for reporting purposes because by the
+ * time the toast is checked our HeapCheckContext will no longer be pointing to
+ * the relevant tuple.
+ */
+typedef struct ToastCheckContext
+{
+	struct varatt_external toast_pointer;
+	BlockNumber blkno;			/* block in main table */
+	OffsetNumber offnum;		/* offset in main table */
+	AttrNumber	attnum;			/* attribute in main table */
+	int32		chunkno;		/* chunk number in toast table */
+	int32		attrsize;		/* size of toasted attribute */
+	int32		endchunk;		/* last chunk number in toast table */
+	int32		totalchunks;	/* total chunks in toast table */
+} ToastCheckContext;
+
 /*
  * Struct holding the running context information during
  * a lifetime of a verify_heapam execution.
@@ -72,6 +93,8 @@ typedef struct HeapCheckContext
 	TransactionId oldest_xid;	/* ShmemVariableCache->oldestXid */
 	FullTransactionId oldest_fxid;	/* 64-bit version of oldest_xid, computed
 									 * relative to next_fxid */
+	TransactionId safe_xmin;	/* this XID and newer ones can't become
+								 * all-visible while we're running */
 
 	/*
 	 * Cached copy of value from MultiXactState
@@ -113,11 +136,14 @@ typedef struct HeapCheckContext
 	uint32		offset;			/* offset in tuple data */
 	AttrNumber	attnum;
 
-	/* Values for iterating over toast for the attribute */
-	int32		chunkno;
-	int32		attrsize;
-	int32		endchunk;
-	int32		totalchunks;
+	/* True if toast for this tuple could be vacuumed away */
+	bool		tuple_is_volatile;
+
+	/*
+	 * List of ToastCheckContext structs for toasted attributes which are not
+	 * in danger of being vacuumed way and should be checked
+	 */
+	List	   *toasted_attributes;
 
 	/* Whether verify_heapam has yet encountered any corrupt tuples */
 	bool		is_corrupt;
@@ -130,13 +156,18 @@ typedef struct HeapCheckContext
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
+static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+							  ToastCheckContext *tctx);
+
+static bool check_tuple_header(HeapCheckContext *ctx);
+static bool check_tuple_visibility(HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static void check_toasted_attributes(HeapCheckContext *ctx);
 
-static void report_corruption(HeapCheckContext *ctx, char *msg);
+static void report_main_corruption(HeapCheckContext *ctx, char *msg);
+static void report_toast_corruption(HeapCheckContext *ctx,
+									ToastCheckContext *tctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
 static FullTransactionId FullTransactionIdFromXidAndCtx(TransactionId xid,
 														const HeapCheckContext *ctx);
@@ -247,6 +278,13 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
+	ctx.toasted_attributes = NIL;
+
+	/*
+	 * Any xmin newer than the xmin of our snapshot can't become all-visible
+	 * while we're running.
+	 */
+	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
 
 	/*
 	 * If we report corruption when not examining some individual attribute,
@@ -395,25 +433,25 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 				if (rdoffnum < FirstOffsetNumber)
 				{
-					report_corruption(&ctx,
-									  psprintf("line pointer redirection to item at offset %u precedes minimum offset %u",
-											   (unsigned) rdoffnum,
-											   (unsigned) FirstOffsetNumber));
+					report_main_corruption(&ctx,
+										   psprintf("line pointer redirection to item at offset %u precedes minimum offset %u",
+													(unsigned) rdoffnum,
+													(unsigned) FirstOffsetNumber));
 					continue;
 				}
 				if (rdoffnum > maxoff)
 				{
-					report_corruption(&ctx,
-									  psprintf("line pointer redirection to item at offset %u exceeds maximum offset %u",
-											   (unsigned) rdoffnum,
-											   (unsigned) maxoff));
+					report_main_corruption(&ctx,
+										   psprintf("line pointer redirection to item at offset %u exceeds maximum offset %u",
+													(unsigned) rdoffnum,
+													(unsigned) maxoff));
 					continue;
 				}
 				rditem = PageGetItemId(ctx.page, rdoffnum);
 				if (!ItemIdIsUsed(rditem))
-					report_corruption(&ctx,
-									  psprintf("line pointer redirection to unused item at offset %u",
-											   (unsigned) rdoffnum));
+					report_main_corruption(&ctx,
+										   psprintf("line pointer redirection to unused item at offset %u",
+													(unsigned) rdoffnum));
 				continue;
 			}
 
@@ -423,26 +461,26 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 			if (ctx.lp_off != MAXALIGN(ctx.lp_off))
 			{
-				report_corruption(&ctx,
-								  psprintf("line pointer to page offset %u is not maximally aligned",
-										   ctx.lp_off));
+				report_main_corruption(&ctx,
+									   psprintf("line pointer to page offset %u is not maximally aligned",
+												ctx.lp_off));
 				continue;
 			}
 			if (ctx.lp_len < MAXALIGN(SizeofHeapTupleHeader))
 			{
-				report_corruption(&ctx,
-								  psprintf("line pointer length %u is less than the minimum tuple header size %u",
-										   ctx.lp_len,
-										   (unsigned) MAXALIGN(SizeofHeapTupleHeader)));
+				report_main_corruption(&ctx,
+									   psprintf("line pointer length %u is less than the minimum tuple header size %u",
+												ctx.lp_len,
+												(unsigned) MAXALIGN(SizeofHeapTupleHeader)));
 				continue;
 			}
 			if (ctx.lp_off + ctx.lp_len > BLCKSZ)
 			{
-				report_corruption(&ctx,
-								  psprintf("line pointer to page offset %u with length %u ends beyond maximum page offset %u",
-										   ctx.lp_off,
-										   ctx.lp_len,
-										   (unsigned) BLCKSZ));
+				report_main_corruption(&ctx,
+									   psprintf("line pointer to page offset %u with length %u ends beyond maximum page offset %u",
+												ctx.lp_off,
+												ctx.lp_len,
+												(unsigned) BLCKSZ));
 				continue;
 			}
 
@@ -457,6 +495,14 @@ verify_heapam(PG_FUNCTION_ARGS)
 		/* clean up */
 		UnlockReleaseBuffer(ctx.buffer);
 
+		/*
+		 * Check any toast pointers from the page whose lock we just released
+		 * and reset the list to NIL.
+		 */
+		if (ctx.toasted_attributes != NIL)
+			check_toasted_attributes(&ctx);
+		Assert(ctx.toasted_attributes == NIL);
+
 		if (on_error_stop && ctx.is_corrupt)
 			break;
 	}
@@ -498,14 +544,13 @@ sanity_check_relation(Relation rel)
 }
 
 /*
- * Record a single corruption found in the table.  The values in ctx should
- * reflect the location of the corruption, and the msg argument should contain
- * a human-readable description of the corruption.
- *
- * The msg argument is pfree'd by this function.
+ * Shared internal implementation for report_main_corruption and
+ * report_toast_corruption.
  */
 static void
-report_corruption(HeapCheckContext *ctx, char *msg)
+report_corruption(Tuplestorestate *tupstore, TupleDesc tupdesc,
+				  BlockNumber blkno, OffsetNumber offnum, AttrNumber attnum,
+				  char *msg)
 {
 	Datum		values[HEAPCHECK_RELATION_COLS];
 	bool		nulls[HEAPCHECK_RELATION_COLS];
@@ -513,10 +558,10 @@ report_corruption(HeapCheckContext *ctx, char *msg)
 
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
-	values[0] = Int64GetDatum(ctx->blkno);
-	values[1] = Int32GetDatum(ctx->offnum);
-	values[2] = Int32GetDatum(ctx->attnum);
-	nulls[2] = (ctx->attnum < 0);
+	values[0] = Int64GetDatum(blkno);
+	values[1] = Int32GetDatum(offnum);
+	values[2] = Int32GetDatum(attnum);
+	nulls[2] = (attnum < 0);
 	values[3] = CStringGetTextDatum(msg);
 
 	/*
@@ -529,8 +574,39 @@ report_corruption(HeapCheckContext *ctx, char *msg)
 	 */
 	pfree(msg);
 
-	tuple = heap_form_tuple(ctx->tupdesc, values, nulls);
-	tuplestore_puttuple(ctx->tupstore, tuple);
+	tuple = heap_form_tuple(tupdesc, values, nulls);
+	tuplestore_puttuple(tupstore, tuple);
+}
+
+/*
+ * Record a single corruption found in the main table.  The values in ctx should
+ * indicate the location of the corruption, and the msg argument should contain
+ * a human-readable description of the corruption.
+ *
+ * The msg argument is pfree'd by this function.
+ */
+static void
+report_main_corruption(HeapCheckContext *ctx, char *msg)
+{
+	report_corruption(ctx->tupstore, ctx->tupdesc, ctx->blkno, ctx->offnum,
+					  ctx->attnum, msg);
+	ctx->is_corrupt = true;
+}
+
+/*
+ * Record corruption found in the toast table.  The values in tctx should
+ * indicate the location in the main table where the toast pointer was
+ * encountered, and the msg argument should contain a human-readable
+ * description of the toast table corruption.
+ *
+ * As above, the msg argument is pfree'd by this function.
+ */
+static void
+report_toast_corruption(HeapCheckContext *ctx, ToastCheckContext *tctx,
+						char *msg)
+{
+	report_corruption(ctx->tupstore, ctx->tupdesc, tctx->blkno, tctx->offnum,
+					  tctx->attnum, msg);
 	ctx->is_corrupt = true;
 }
 
@@ -555,16 +631,11 @@ verify_heapam_tupdesc(void)
 }
 
 /*
- * Check for tuple header corruption and tuple visibility.
- *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * Check for tuple header corruption.
  *
  * Some kinds of corruption make it unsafe to check the tuple attributes, for
  * example when the line pointer refers to a range of bytes outside the page.
- * In such cases, we return false (not visible) after recording appropriate
+ * In such cases, we return false (not checkable) after recording appropriate
  * corruption messages.
  *
  * Some other kinds of tuple header corruption confuse the question of where
@@ -576,44 +647,33 @@ verify_heapam_tupdesc(void)
  *
  * Other kinds of tuple header corruption do not bear on the question of
  * whether the tuple attributes can be checked, so we record corruption
- * messages for them but do not base our visibility determination on them.  (In
- * other words, we do not return false merely because we detected them.)
- *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
- *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
+ * messages for them but we do not return false merely because we detected
+ * them.
  *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Returns whether the tuple is sufficiently sensible to undergo visibility and
+ * attribute checks.
  */
 static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+check_tuple_header(HeapCheckContext *ctx)
 {
+	HeapTupleHeader tuphdr = ctx->tuphdr;
 	uint16		infomask = tuphdr->t_infomask;
 	bool		header_garbled = false;
 	unsigned	expected_hoff;
 
 	if (ctx->tuphdr->t_hoff > ctx->lp_len)
 	{
-		report_corruption(ctx,
-						  psprintf("data begins at offset %u beyond the tuple length %u",
-								   ctx->tuphdr->t_hoff, ctx->lp_len));
+		report_main_corruption(ctx,
+							   psprintf("data begins at offset %u beyond the tuple length %u",
+										ctx->tuphdr->t_hoff, ctx->lp_len));
 		header_garbled = true;
 	}
 
 	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
 		(ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI))
 	{
-		report_corruption(ctx,
-						  pstrdup("multixact should not be marked committed"));
+		report_main_corruption(ctx,
+							   pstrdup("multixact should not be marked committed"));
 
 		/*
 		 * This condition is clearly wrong, but we do not consider the header
@@ -630,188 +690,464 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff != expected_hoff)
 	{
 		if ((infomask & HEAP_HASNULL) && ctx->natts == 1)
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, has nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, has nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff));
 		else if ((infomask & HEAP_HASNULL))
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, has nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, has nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
 		else if (ctx->natts == 1)
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, no nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, no nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff));
 		else
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, no nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, no nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
 		header_garbled = true;
 	}
 
 	if (header_garbled)
 		return false;			/* checking of this tuple should not continue */
 
+	return true;				/* header ok */
+}
+
+/*
+ * Checks whether a tuple is visible to our transaction for checking, which is
+ * not a question of whether we should be able to see the tuple relative to any
+ * particular snapshot, but rather a question of whether it is safe and
+ * reasonable to check the tuple attributes.  The caller should already have
+ * checked that the tuple is sufficiently sensible for us to evaluate.
+ *
+ * If a tuple could have been inserted by a transaction that also added a
+ * column to the table, but which ultimately did not commit, or which has not
+ * yet committed, then the table's current TupleDesc might differ from the one
+ * used to construct this tuple, so we must not check it.
+ *
+ * As a special case, if our own transaction inserted the tuple, even if we
+ * added a column to the table, our TupleDesc should match.  We could check the
+ * tuple, but choose not to do so.
+ *
+ * If a tuple has been updated or deleted, we can still read the old tuple for
+ * corruption checking purposes, as long as we are careful about concurrent
+ * vacuums.  The main table tuple itself cannot be vacuumed away because we
+ * hold a buffer lock on the page, but if the deleting transaction is older
+ * than our transaction snapshot's xmin, then vacuum could remove the toast at
+ * any time, so we must not check the toast.
+ *
+ * If xmin or xmax values are older than can be checked against clog, or appear
+ * to be in the future (possibly due to wrap-around), then we cannot make a
+ * determination about the visibility of the tuple, so we must not check it.
+ *
+ * Returns true if the tuple should be checked, false otherwise.  Sets
+ * ctx->toast_is_volatile true if the toast might be vacuumed away, false
+ * otherwise.
+ */
+static bool
+check_tuple_visibility(HeapCheckContext *ctx)
+{
+	TransactionId xmin;
+	TransactionId xvac;
+	TransactionId xmax;
+	XidCommitStatus xmin_status;
+	XidCommitStatus xvac_status;
+	XidCommitStatus xmax_status;
+	HeapTupleHeader tuphdr = ctx->tuphdr;
+
+	ctx->tuple_is_volatile = true;	/* have not yet proven otherwise */
+
+	/* If xmin is normal, it should be within valid range */
+	xmin = HeapTupleHeaderGetXmin(tuphdr);
+	switch (get_xid_status(xmin, ctx, &xmin_status))
+	{
+		case XID_INVALID:
+		case XID_BOUNDS_OK:
+			break;
+		case XID_IN_FUTURE:
+			report_main_corruption(ctx,
+								   psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->next_fxid),
+											XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_main_corruption(ctx,
+								   psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->oldest_fxid),
+											XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_main_corruption(ctx,
+								   psprintf("xmin %u precedes relation freeze threshold %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->relfrozenfxid),
+											XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+	}
+
 	/*
-	 * Ok, we can examine the header for tuple visibility purposes, though we
-	 * still need to be careful about a few remaining types of header
-	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
+	 * Has inserting transaction committed?
 	 */
 	if (!HeapTupleHeaderXminCommitted(tuphdr))
 	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
-
 		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
+
+			/*
+			 * The inserting transaction aborted.  The structure of the tuple
+			 * may not match our relation description, so we cannot check it.
+			 */
+			return false;		/* uncheckable */
 		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
 		{
-			XidCommitStatus status;
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xvac, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
+					report_main_corruption(ctx,
+										   pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->next_fxid),
+													XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->relfrozenfxid),
+													XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
-					break;
 				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->oldest_fxid),
+													XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
-					break;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
 			}
-		}
-		else
-		{
-			XidCommitStatus status;
 
-			switch (get_xid_status(raw_xmin, ctx, &status))
+			switch (xvac_status)
 			{
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
-					return false;
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
+				case XID_IS_CURRENT_XID:
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
+													xvac));
 					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
+				case XID_IN_PROGRESS:
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
+													xvac));
 					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+
+				case XID_COMMITTED:
+
+					/*
+					 * The VACUUM FULL committed, so this tuple is dead and
+					 * could be vacuumed away at any time.  It's ok to check
+					 * the tuple because we have a buffer lock for the page,
+					 * but not safe to check the toast.  We don't bother
+					 * comparing against safe_xmin because the VACUUM FULL
+					 * must have committed prior to an upgrade and can't still
+					 * be running.
+					 */
+					return true;	/* checkable */
+
+				case XID_ABORTED:
+					break;
 			}
 		}
-	}
-
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
-	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
 		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xmax, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
 				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
+					report_main_corruption(ctx,
+										   pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->next_fxid),
+													XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->relfrozenfxid),
+													XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->oldest_fxid),
+													XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
+					break;
 			}
 
-			/* Ok, the tuple is live */
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
+													xvac));
+					return false;	/* corrupt */
+				case XID_IN_PROGRESS:
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
+													xvac));
+					return false;	/* corrupt */
+
+				case XID_COMMITTED:
+					break;
+
+				case XID_ABORTED:
+
+					/*
+					 * The VACUUM FULL aborted, so this tuple is dead and
+					 * could be vacuumed away at any time.  It's ok to check
+					 * the tuple because we have a buffer lock for the page,
+					 * but not safe to check the toast.
+					 */
+					return true;	/* checkable */
+			}
+		}
+		else if (xmin_status == XID_IS_CURRENT_XID)
+		{
+			/*
+			 * Don't check tuples from currently running transactions, not
+			 * even our own.
+			 */
+			return false;		/* checkable, but don't check */
+		}
+		else if (xmin_status == XID_IN_PROGRESS)
+		{
+			/* Don't check tuples from currently running transactions */
+			return false;		/* uncheckable */
+		}
+		else if (xmin_status != XID_COMMITTED)
+		{
+			/*
+			 * Inserting transaction is not in progress, and not committed, so
+			 * it either aborted or crashed. We cannot check.
+			 */
+			return false;		/* uncheckable */
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
 	}
-	return true;				/* not dead */
+
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * xmax is a multixact, so it should be within valid MXID range.  We
+		 * cannot safely look up the update xid if the multixact is out of
+		 * bounds, and must stop checking this tuple.
+		 */
+		xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+		switch (check_mxid_valid_in_rel(xmax, ctx))
+		{
+			case XID_INVALID:
+				report_main_corruption(ctx,
+									   pstrdup("multitransaction ID is invalid"));
+				return false;	/* corrupt */
+			case XID_PRECEDES_RELMIN:
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+												xmax, ctx->relminmxid));
+				return false;	/* corrupt */
+			case XID_PRECEDES_CLUSTERMIN:
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+												xmax, ctx->oldest_mxact));
+				return false;	/* corrupt */
+			case XID_IN_FUTURE:
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+												xmax,
+												ctx->next_mxact));
+				return false;	/* corrupt */
+			case XID_BOUNDS_OK:
+				break;
+		}
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_INVALID)
+	{
+		/*
+		 * This tuple is live.  A concurrently running transaction could
+		 * delete it before we get around to checking the toast, but any such
+		 * running transaction is surely not less than our safe_xmin, so the
+		 * toast cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_is_volatile = false;
+		return true;			/* checkable */
+	}
+
+	if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+	{
+		/*
+		 * "Deleting" xact really only locked it, so the tuple is live in any
+		 * case.  As above, a concurrently running transaction could delete
+		 * it, but it cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_is_volatile = false;
+		return true;			/* checkable */
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * We already checked above that this multixact is within limits for
+		 * this table.  Now check the update xid from this multixact.
+		 */
+		xmax = HeapTupleGetUpdateXid(tuphdr);
+		switch (get_xid_status(xmax, ctx, &xmax_status))
+		{
+				/* not LOCKED_ONLY, so it has to have an xmax */
+			case XID_INVALID:
+				report_main_corruption(ctx,
+									   pstrdup("update xid is invalid"));
+				return false;	/* corrupt */
+			case XID_IN_FUTURE:
+				report_main_corruption(ctx,
+									   psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->next_fxid),
+												XidFromFullTransactionId(ctx->next_fxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_RELMIN:
+				report_main_corruption(ctx,
+									   psprintf("update xid %u precedes relation freeze threshold %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->relfrozenfxid),
+												XidFromFullTransactionId(ctx->relfrozenfxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_CLUSTERMIN:
+				report_main_corruption(ctx,
+									   psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->oldest_fxid),
+												XidFromFullTransactionId(ctx->oldest_fxid)));
+				return false;	/* corrupt */
+			case XID_BOUNDS_OK:
+				break;
+		}
+
+		switch (xmax_status)
+		{
+			case XID_IS_CURRENT_XID:
+			case XID_IN_PROGRESS:
+
+				/*
+				 * The delete is in progress, so it cannot be visible to our
+				 * snapshot.
+				 */
+				ctx->tuple_is_volatile = false;
+				return true;	/* checkable */
+			case XID_COMMITTED:
+
+				/*
+				 * The delete committed.  Whether the toast can be vacuumed
+				 * away depends on how old the deleting transaction is.
+				 */
+				ctx->tuple_is_volatile = TransactionIdPrecedes(xmax,
+															   ctx->safe_xmin);
+				return true;	/* checkable */
+			case XID_ABORTED:
+
+				/*
+				 * The delete aborted or crashed.  The tuple is still live.
+				 */
+				ctx->tuple_is_volatile = false;
+				return true;	/* checkable */
+		}
+	}
+
+	/*
+	 * The tuple is deleted.  Whether the toast can be vacuumed away depends
+	 * on how old the deleting transaction is.
+	 */
+	xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+
+	switch (get_xid_status(xmax, ctx, &xmax_status))
+	{
+		case XID_IN_FUTURE:
+			report_main_corruption(ctx,
+								   psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+											xmax,
+											EpochFromFullTransactionId(ctx->next_fxid),
+											XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_main_corruption(ctx,
+								   psprintf("xmax %u precedes relation freeze threshold %u:%u",
+											xmax,
+											EpochFromFullTransactionId(ctx->relfrozenfxid),
+											XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_main_corruption(ctx,
+								   psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+											xmax,
+											EpochFromFullTransactionId(ctx->oldest_fxid),
+											XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_BOUNDS_OK:
+		case XID_INVALID:
+			break;
+	}
+
+	switch (xmax_status)
+	{
+		case XID_IS_CURRENT_XID:
+		case XID_IN_PROGRESS:
+
+			/*
+			 * The delete is in progress, so it cannot be visible to our
+			 * snapshot.
+			 */
+			ctx->tuple_is_volatile = false;
+			return true;		/* checkable */
+		case XID_COMMITTED:
+
+			/*
+			 * The delete committed.  Whether the toast can be vacuumed away
+			 * depends on how old the deleting transaction is.
+			 */
+			ctx->tuple_is_volatile = TransactionIdPrecedes(xmax,
+														   ctx->safe_xmin);
+			return true;		/* checkable */
+		case XID_ABORTED:
+
+			/*
+			 * The delete aborted or crashed.  The tuple is still live.
+			 */
+			ctx->tuple_is_volatile = false;
+			return true;		/* checkable */
+	}
+
+	return false;				/* not reached */
 }
 
 /*
@@ -826,7 +1162,8 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
  * as each toast tuple having its varlena structure sanity checked.
  */
 static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
+check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+				  ToastCheckContext *tctx)
 {
 	int32		curchunk;
 	Pointer		chunk;
@@ -841,16 +1178,16 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 										 ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
-						  pstrdup("toast chunk sequence number is null"));
+		report_toast_corruption(ctx, tctx,
+								pstrdup("toast chunk sequence number is null"));
 		return;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
-						  pstrdup("toast chunk data is null"));
+		report_toast_corruption(ctx, tctx,
+								pstrdup("toast chunk data is null"));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -867,37 +1204,37 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		/* should never happen */
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_corruption(ctx,
-						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-								   header, curchunk));
+		report_toast_corruption(ctx, tctx,
+								psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
+										 header, curchunk));
 		return;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != ctx->chunkno)
+	if (curchunk != tctx->chunkno)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, ctx->chunkno));
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
+										 curchunk, tctx->chunkno));
 		return;
 	}
-	if (curchunk > ctx->endchunk)
+	if (curchunk > tctx->endchunk)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, ctx->endchunk));
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
+										 curchunk, tctx->endchunk));
 		return;
 	}
 
-	expected_size = curchunk < ctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
-		: ctx->attrsize - ((ctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
+	expected_size = curchunk < tctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
+		: tctx->attrsize - ((tctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
 	if (chunksize != expected_size)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk size %u differs from the expected size %u",
-								   chunksize, expected_size));
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast chunk size %u differs from the expected size %u",
+										 chunksize, expected_size));
 		return;
 	}
 }
@@ -907,17 +1244,17 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
  * found in ctx->tupstore.
  *
  * This function follows the logic performed by heap_deform_tuple(), and in the
- * case of a toasted value, optionally continues along the logic of
- * detoast_external_attr(), checking for any conditions that would result in
- * either of those functions Asserting or crashing the backend.  The checks
- * performed by Asserts present in those two functions are also performed here.
- * In cases where those two functions are a bit cavalier in their assumptions
- * about data being correct, we perform additional checks not present in either
- * of those two functions.  Where some condition is checked in both of those
- * functions, we perform it here twice, as we parallel the logical flow of
- * those two functions.  The presence of duplicate checks seems a reasonable
- * price to pay for keeping this code tightly coupled with the code it
- * protects.
+ * case of a toasted value, optionally stores the toast pointer so later it can
+ * be checked following the logic of detoast_external_attr(), checking for any
+ * conditions that would result in either of those functions Asserting or
+ * crashing the backend.  The checks performed by Asserts present in those two
+ * functions are also performed here and in check_toasted_attributes.  In cases
+ * where those two functions are a bit cavalier in their assumptions about data
+ * being correct, we perform additional checks not present in either of those
+ * two functions.  Where some condition is checked in both of those functions,
+ * we perform it here twice, as we parallel the logical flow of those two
+ * functions.  The presence of duplicate checks seems a reasonable price to pay
+ * for keeping this code tightly coupled with the code it protects.
  *
  * Returns true if the tuple attribute is sane enough for processing to
  * continue on to the next attribute, false otherwise.
@@ -925,12 +1262,6 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 static bool
 check_tuple_attribute(HeapCheckContext *ctx)
 {
-	struct varatt_external toast_pointer;
-	ScanKeyData toastkey;
-	SysScanDesc toastscan;
-	SnapshotData SnapshotToast;
-	HeapTuple	toasttup;
-	bool		found_toasttup;
 	Datum		attdatum;
 	struct varlena *attr;
 	char	   *tp;				/* pointer to the tuple data */
@@ -944,12 +1275,12 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
-								   ctx->attnum,
-								   thisatt->attlen,
-								   ctx->tuphdr->t_hoff + ctx->offset,
-								   ctx->lp_len));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
+										ctx->attnum,
+										thisatt->attlen,
+										ctx->tuphdr->t_hoff + ctx->offset,
+										ctx->lp_len));
 		return false;
 	}
 
@@ -965,12 +1296,12 @@ check_tuple_attribute(HeapCheckContext *ctx)
 											tp + ctx->offset);
 		if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 		{
-			report_corruption(ctx,
-							  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-									   ctx->attnum,
-									   thisatt->attlen,
-									   ctx->tuphdr->t_hoff + ctx->offset,
-									   ctx->lp_len));
+			report_main_corruption(ctx,
+								   psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
+											ctx->attnum,
+											thisatt->attlen,
+											ctx->tuphdr->t_hoff + ctx->offset,
+											ctx->lp_len));
 			return false;
 		}
 		return true;
@@ -998,10 +1329,10 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 		if (va_tag != VARTAG_ONDISK)
 		{
-			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
-									   va_tag));
+			report_main_corruption(ctx,
+								   psprintf("toasted attribute %u has unexpected TOAST tag %u",
+											ctx->attnum,
+											va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
 		}
@@ -1013,12 +1344,12 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
-								   thisatt->attlen,
-								   ctx->tuphdr->t_hoff + ctx->offset,
-								   ctx->lp_len));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
+										ctx->attnum,
+										thisatt->attlen,
+										ctx->tuphdr->t_hoff + ctx->offset,
+										ctx->lp_len));
 
 		return false;
 	}
@@ -1045,18 +1376,18 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
+										ctx->attnum));
 		return true;
 	}
 
 	/* The relation better have a toast table */
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u is external but relation has no toast relation",
+										ctx->attnum));
 		return true;
 	}
 
@@ -1065,189 +1396,115 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 
 	/*
-	 * Must copy attr into toast_pointer for alignment considerations
+	 * If this tuple is at risk of being vacuumed away, we cannot check the
+	 * toast.  Otherwise, we push a copy of the toast tuple so we can check it
+	 * after releasing the main table buffer lock.
 	 */
-	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
-
-	ctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
-	ctx->endchunk = (ctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
-	ctx->totalchunks = ctx->endchunk + 1;
+	if (!ctx->tuple_is_volatile)
+	{
+		ToastCheckContext *tctx;
 
-	/*
-	 * Setup a scan key to find chunks in toast table with matching va_valueid
-	 */
-	ScanKeyInit(&toastkey,
-				(AttrNumber) 1,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(toast_pointer.va_valueid));
+		tctx = (ToastCheckContext *) palloc0fast(sizeof(ToastCheckContext));
 
-	/*
-	 * Check if any chunks for this toasted object exist in the toast table,
-	 * accessible via the index.
-	 */
-	init_toast_snapshot(&SnapshotToast);
-	toastscan = systable_beginscan_ordered(ctx->toast_rel,
-										   ctx->valid_toast_index,
-										   &SnapshotToast, 1,
-										   &toastkey);
-	ctx->chunkno = 0;
-	found_toasttup = false;
-	while ((toasttup =
-			systable_getnext_ordered(toastscan,
-									 ForwardScanDirection)) != NULL)
-	{
-		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx);
-		ctx->chunkno++;
+		VARATT_EXTERNAL_GET_POINTER(tctx->toast_pointer, attr);
+		tctx->blkno = ctx->blkno;
+		tctx->offnum = ctx->offnum;
+		tctx->attnum = ctx->attnum;
+		ctx->toasted_attributes = lappend(ctx->toasted_attributes, tctx);
 	}
-	if (!found_toasttup)
-		report_corruption(ctx,
-						  psprintf("toasted value for attribute %u missing from toast table",
-								   ctx->attnum));
-	else if (ctx->chunkno != (ctx->endchunk + 1))
-		report_corruption(ctx,
-						  psprintf("final toast chunk number %u differs from expected value %u",
-								   ctx->chunkno, (ctx->endchunk + 1)));
-	systable_endscan_ordered(toastscan);
 
 	return true;
 }
 
 /*
- * Check the current tuple as tracked in ctx, recording any corruption found in
- * ctx->tupstore.
+ * For each attribute collected in ctx->toasted_attributes, look up the value
+ * in the toast table and perform checks on it.  This function should only be
+ * called on toast pointers which cannot be vacuumed away during our
+ * processing.
  */
 static void
-check_tuple(HeapCheckContext *ctx)
+check_toasted_attributes(HeapCheckContext *ctx)
 {
-	TransactionId xmin;
-	TransactionId xmax;
-	bool		fatal = false;
-	uint16		infomask = ctx->tuphdr->t_infomask;
+	ListCell   *cell;
 
-	/* If xmin is normal, it should be within valid range */
-	xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
-	switch (get_xid_status(xmin, ctx, NULL))
+	foreach(cell, ctx->toasted_attributes)
 	{
-		case XID_INVALID:
-		case XID_BOUNDS_OK:
-			break;
-		case XID_IN_FUTURE:
-			report_corruption(ctx,
-							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->next_fxid),
-									   XidFromFullTransactionId(ctx->next_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_CLUSTERMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->oldest_fxid),
-									   XidFromFullTransactionId(ctx->oldest_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_RELMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->relfrozenfxid),
-									   XidFromFullTransactionId(ctx->relfrozenfxid)));
-			fatal = true;
-			break;
-	}
+		ToastCheckContext *tctx;
+		SnapshotData SnapshotToast;
+		ScanKeyData toastkey;
+		SysScanDesc toastscan;
+		bool		found_toasttup;
+		HeapTuple	toasttup;
+
+		tctx = lfirst(cell);
+		tctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer);
+		tctx->endchunk = (tctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
+		tctx->totalchunks = tctx->endchunk + 1;
 
-	xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
+		/*
+		 * Setup a scan key to find chunks in toast table with matching
+		 * va_valueid
+		 */
+		ScanKeyInit(&toastkey,
+					(AttrNumber) 1,
+					BTEqualStrategyNumber, F_OIDEQ,
+					ObjectIdGetDatum(tctx->toast_pointer.va_valueid));
 
-	if (infomask & HEAP_XMAX_IS_MULTI)
-	{
-		/* xmax is a multixact, so it should be within valid MXID range */
-		switch (check_mxid_valid_in_rel(xmax, ctx))
-		{
-			case XID_INVALID:
-				report_corruption(ctx,
-								  pstrdup("multitransaction ID is invalid"));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
-										   xmax, ctx->relminmxid));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
-										   xmax, ctx->oldest_mxact));
-				fatal = true;
-				break;
-			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
-										   xmax,
-										   ctx->next_mxact));
-				fatal = true;
-				break;
-			case XID_BOUNDS_OK:
-				break;
-		}
-	}
-	else
-	{
 		/*
-		 * xmax is not a multixact and is normal, so it should be within the
-		 * valid XID range.
+		 * Check if any chunks for this toasted object exist in the toast
+		 * table, accessible via the index.
 		 */
-		switch (get_xid_status(xmax, ctx, NULL))
+		init_toast_snapshot(&SnapshotToast);
+		toastscan = systable_beginscan_ordered(ctx->toast_rel,
+											   ctx->valid_toast_index,
+											   &SnapshotToast, 1,
+											   &toastkey);
+		tctx->chunkno = 0;
+		found_toasttup = false;
+		while ((toasttup =
+				systable_getnext_ordered(toastscan,
+										 ForwardScanDirection)) != NULL)
 		{
-			case XID_INVALID:
-			case XID_BOUNDS_OK:
-				break;
-			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->next_fxid),
-										   XidFromFullTransactionId(ctx->next_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->oldest_fxid),
-										   XidFromFullTransactionId(ctx->oldest_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->relfrozenfxid),
-										   XidFromFullTransactionId(ctx->relfrozenfxid)));
-				fatal = true;
+			found_toasttup = true;
+			check_toast_tuple(toasttup, ctx, tctx);
+			tctx->chunkno++;
 		}
+		if (!found_toasttup)
+			report_toast_corruption(ctx, tctx,
+									psprintf("toasted value for attribute %u missing from toast table",
+											 tctx->attnum));
+		else if (tctx->chunkno != (tctx->endchunk + 1))
+			report_toast_corruption(ctx, tctx,
+									psprintf("final toast chunk number %u differs from expected value %u",
+											 tctx->chunkno, (tctx->endchunk + 1)));
+		systable_endscan_ordered(toastscan);
+
+		pfree(tctx);
 	}
+	list_free(ctx->toasted_attributes);
+	ctx->toasted_attributes = NIL;
+}
 
+/*
+ * Check the current tuple as tracked in ctx, recording any corruption found in
+ * ctx->tupstore.
+ */
+static void
+check_tuple(HeapCheckContext *ctx)
+{
 	/*
-	 * Cannot process tuple data if tuple header was corrupt, as the offsets
-	 * within the page cannot be trusted, leaving too much risk of reading
-	 * garbage if we continue.
-	 *
-	 * We also cannot process the tuple if the xmin or xmax were invalid
-	 * relative to relfrozenxid or relminmxid, as clog entries for the xids
-	 * may already be gone.
+	 * Check various forms of tuple header corruption.  If the header is too
+	 * corrupt to continue checking, we cannot continue with other checks.
 	 */
-	if (fatal)
+	if (!check_tuple_header(ctx))
 		return;
 
 	/*
-	 * Check various forms of tuple header corruption.  If the header is too
-	 * corrupt to continue checking, or if the tuple is not visible to anyone,
-	 * we cannot continue with other checks.
+	 * Check tuple visibility.  If the inserting transaction aborted, we
+	 * cannot assume our relation description matches the tuple structure, and
+	 * therefore cannot check it.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	if (!check_tuple_visibility(ctx))
 		return;
 
 	/*
@@ -1257,10 +1514,10 @@ check_tuple(HeapCheckContext *ctx)
 	 */
 	if (RelationGetDescr(ctx->rel)->natts < ctx->natts)
 	{
-		report_corruption(ctx,
-						  psprintf("number of attributes %u exceeds maximum expected for table %u",
-								   ctx->natts,
-								   RelationGetDescr(ctx->rel)->natts));
+		report_main_corruption(ctx,
+							   psprintf("number of attributes %u exceeds maximum expected for table %u",
+										ctx->natts,
+										RelationGetDescr(ctx->rel)->natts));
 		return;
 	}
 
@@ -1269,6 +1526,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * next, at which point we abort further attribute checks for this tuple.
 	 * Note that we don't abort for all types of corruption, only for those
 	 * types where we don't know how to continue.
+	 *
+	 * While checking the tuple attributes, we build a list of toast pointers
+	 * we encounter, to be checked later.  If further attribute checking is
+	 * aborted, we still have the pointers collected prior to aborting.
 	 */
 	ctx->offset = 0;
 	for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
@@ -1448,7 +1709,7 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 	if (FullTransactionIdPrecedesOrEquals(clog_horizon, fxid))
 	{
 		if (TransactionIdIsCurrentTransactionId(xid))
-			*status = XID_IN_PROGRESS;
+			*status = XID_IS_CURRENT_XID;
 		else if (TransactionIdDidCommit(xid))
 			*status = XID_COMMITTED;
 		else if (TransactionIdDidAbort(xid))
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..0ce261e2a2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2557,6 +2557,7 @@ TimestampTz
 TmFromChar
 TmToChar
 ToastAttrInfo
+ToastCheckContext
 ToastTupleContext
 TocEntry
 TokenAuxData
-- 
2.21.1 (Apple Git-122.3)

#106Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#105)
Re: pg_amcheck contrib application

On Mon, Mar 29, 2021 at 1:45 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Thanks! The attached patch addresses your comments here and in your prior email. In particular, this patch changes the tuple visibility logic to not check tuples for which the inserting transaction aborted or is still in progress, and to not check toast for tuples deleted in transactions older than our transaction snapshot's xmin. A list of toasted attributes which are safe to check is compiled per main table page during the scan of the page, then checked after the buffer lock on the main page is released.

In the perhaps unusual case where verify_heapam() is called in a transaction which has also added tuples to the table being checked, this patch's visibility logic chooses not to check such tuples. I'm on the fence about this choice, and am mostly following your lead. I like that this decision maintains the invariant that we never check tuples which have not yet been committed.

The patch includes a bit of refactoring. In the old code, heap_check() performed clog bounds checking on xmin and xmax prior to calling check_tuple_header_and_visibilty(), but I think that's not such a great choice. If the tuple header is garbled to have random bytes in the xmin and xmax fields, and we can detect that situation because other tuple header fields are garbled in detectable ways, I'd rather get a report about the header being garbled than a report about the xmin or xmax being out of bounds. In the new code, the tuple header is checked first, then the visibility is checked, then the tuple is checked against the current relation description, then the tuple attributes are checked. I think the layout is easier to follow, too.

Hmm, so this got ~10x bigger from my version. Could you perhaps
separate it out into a series of patches for easier review? Say, one
that just fixes the visibility logic, and then a second to avoid doing
the TOAST check with a buffer lock held, and then more than that if
there are other pieces that make sense to separate out?

--
Robert Haas
EDB: http://www.enterprisedb.com

#107Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#106)
4 attachment(s)
Re: pg_amcheck contrib application

On Mar 29, 2021, at 1:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, so this got ~10x bigger from my version. Could you perhaps
separate it out into a series of patches for easier review? Say, one
that just fixes the visibility logic, and then a second to avoid doing
the TOAST check with a buffer lock held, and then more than that if
there are other pieces that make sense to separate out?

Sure, here are four patches which do the same as the single v12 patch did.

Attachments:

v13-0001-Refactoring-function-check_tuple_header_and_visi.patchapplication/octet-stream; name=v13-0001-Refactoring-function-check_tuple_header_and_visi.patch; x-unix-mode=0644Download
From acee08646354ddb001d7b31a1b8932237bd40405 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 24 Mar 2021 18:18:56 -0700
Subject: [PATCH v13 1/4] Refactoring function
 check_tuple_header_and_visibility

Extending enum XidCommitStatus to include XID_IS_CURRENT_XID.  The
visibility code for verify_heapam() was conflating XID_IN_PROGRESS
and XID_IS_CURRENT_XID under just one enum, making it harder to
compare the logic to that used by vacuum's visibility function,
which treats those two cases separately.

Simplifying check_tuple_header_and_visibilty signature.  It was
taking both tuphdr and ctx arguments, but the tuphdr is just
ctx->tuphdr, so it is a bit absurd to pass two arguments for this.

Splitting check_tuple_header_and_visibilty() into two functions.
check_tuple_header() and check_tuple_visibility() are split out as
separate functions, but otherwise behave exactly as before.
---
 contrib/amcheck/verify_heapam.c | 82 +++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..9172b5fd81 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -46,6 +46,7 @@ typedef enum XidBoundsViolation
 typedef enum XidCommitStatus
 {
 	XID_COMMITTED,
+	XID_IS_CURRENT_XID,
 	XID_IN_PROGRESS,
 	XID_ABORTED
 } XidCommitStatus;
@@ -133,8 +134,8 @@ static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static bool check_tuple_header(HeapCheckContext *ctx);
+static bool check_tuple_visibility(HeapCheckContext *ctx);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
@@ -555,16 +556,11 @@ verify_heapam_tupdesc(void)
 }
 
 /*
- * Check for tuple header corruption and tuple visibility.
- *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * Check for tuple header corruption.
  *
  * Some kinds of corruption make it unsafe to check the tuple attributes, for
  * example when the line pointer refers to a range of bytes outside the page.
- * In such cases, we return false (not visible) after recording appropriate
+ * In such cases, we return false (not checkable) after recording appropriate
  * corruption messages.
  *
  * Some other kinds of tuple header corruption confuse the question of where
@@ -576,27 +572,16 @@ verify_heapam_tupdesc(void)
  *
  * Other kinds of tuple header corruption do not bear on the question of
  * whether the tuple attributes can be checked, so we record corruption
- * messages for them but do not base our visibility determination on them.  (In
- * other words, we do not return false merely because we detected them.)
- *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
+ * messages for them but we do not return false merely because we detected
+ * them.
  *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
- *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Returns whether the tuple is sufficiently sensible to undergo visibility and
+ * attribute checks.
  */
 static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+check_tuple_header(HeapCheckContext *ctx)
 {
+	HeapTupleHeader tuphdr = ctx->tuphdr;
 	uint16		infomask = tuphdr->t_infomask;
 	bool		header_garbled = false;
 	unsigned	expected_hoff;
@@ -651,13 +636,34 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 	if (header_garbled)
 		return false;			/* checking of this tuple should not continue */
 
-	/*
-	 * Ok, we can examine the header for tuple visibility purposes, though we
-	 * still need to be careful about a few remaining types of header
-	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
-	 */
+	return true;				/* header ok */
+}
+
+/*
+ * Checks whether a tuple is visible for checking.
+ *
+ * Since we do not hold a snapshot, tuple visibility is not a question of
+ * whether we should be able to see the tuple relative to any particular
+ * snapshot, but rather a question of whether it is safe and reasonable to
+ * check the tuple attributes.
+ *
+ * For visibility determination not specifically related to corruption, what we
+ * want to know is if a tuple is potentially visible to any running
+ * transaction.  If you are tempted to replace this function's visibility logic
+ * with a call to another visibility checking function, keep in mind that this
+ * function does not update hint bits, as it seems imprudent to write hint bits
+ * (or anything at all) to a table during a corruption check.  Nor does this
+ * function bother classifying tuple visibility beyond a boolean visible vs.
+ * not visible.
+ *
+ * Returns whether the tuple is visible for checking.
+ */
+static bool
+check_tuple_visibility(HeapCheckContext *ctx)
+{
+	HeapTupleHeader tuphdr = ctx->tuphdr;
+	uint16		infomask = tuphdr->t_infomask;
+
 	if (!HeapTupleHeaderXminCommitted(tuphdr))
 	{
 		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
@@ -704,6 +710,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 					switch (status)
 					{
 						case XID_IN_PROGRESS:
+						case XID_IS_CURRENT_XID:
 							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
 						case XID_COMMITTED:
 						case XID_ABORTED:
@@ -748,6 +755,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 						case XID_COMMITTED:
 							break;
 						case XID_IN_PROGRESS:
+						case XID_IS_CURRENT_XID:
 							return true;	/* insert or delete in progress */
 						case XID_ABORTED:
 							return false;	/* HEAPTUPLE_DEAD */
@@ -795,6 +803,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 					switch (status)
 					{
 						case XID_IN_PROGRESS:
+						case XID_IS_CURRENT_XID:
 							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
 						case XID_COMMITTED:
 						case XID_ABORTED:
@@ -1247,7 +1256,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * corrupt to continue checking, or if the tuple is not visible to anyone,
 	 * we cannot continue with other checks.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	if (!check_tuple_header(ctx))
+		return;
+
+	if (!check_tuple_visibility(ctx))
 		return;
 
 	/*
@@ -1448,7 +1460,7 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 	if (FullTransactionIdPrecedesOrEquals(clog_horizon, fxid))
 	{
 		if (TransactionIdIsCurrentTransactionId(xid))
-			*status = XID_IN_PROGRESS;
+			*status = XID_IS_CURRENT_XID;
 		else if (TransactionIdDidCommit(xid))
 			*status = XID_COMMITTED;
 		else if (TransactionIdDidAbort(xid))
-- 
2.21.1 (Apple Git-122.3)

v13-0002-Replacing-implementation-of-check_tuple_visibili.patchapplication/octet-stream; name=v13-0002-Replacing-implementation-of-check_tuple_visibili.patch; x-unix-mode=0644Download
From 4838129ccdf3917f71f52035330752e3b5d7416a Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 29 Mar 2021 14:31:13 -0700
Subject: [PATCH v13 2/4] Replacing implementation of check_tuple_visibility

Using a modified version of HeapTupleSatisfiesVacuumHorizon.
---
 contrib/amcheck/verify_heapam.c | 480 +++++++++++++++++++++++++-------
 1 file changed, 372 insertions(+), 108 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9172b5fd81..59b13180d9 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -73,6 +73,8 @@ typedef struct HeapCheckContext
 	TransactionId oldest_xid;	/* ShmemVariableCache->oldestXid */
 	FullTransactionId oldest_fxid;	/* 64-bit version of oldest_xid, computed
 									 * relative to next_fxid */
+	TransactionId safe_xmin;	/* this XID and newer ones can't become
+								 * all-visible while we're running */
 
 	/*
 	 * Cached copy of value from MultiXactState
@@ -114,6 +116,9 @@ typedef struct HeapCheckContext
 	uint32		offset;			/* offset in tuple data */
 	AttrNumber	attnum;
 
+	/* True if toast for this tuple could be vacuumed away */
+	bool		tuple_is_volatile;
+
 	/* Values for iterating over toast for the attribute */
 	int32		chunkno;
 	int32		attrsize;
@@ -249,6 +254,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
 
+	/*
+	 * Any xmin newer than the xmin of our snapshot can't become all-visible
+	 * while we're running.
+	 */
+	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+
 	/*
 	 * If we report corruption when not examining some individual attribute,
 	 * we need attnum to be reported as NULL.  Set that up before any
@@ -640,189 +651,442 @@ check_tuple_header(HeapCheckContext *ctx)
 }
 
 /*
- * Checks whether a tuple is visible for checking.
+ * Checks whether a tuple is visible to our transaction for checking, which is
+ * not a question of whether we should be able to see the tuple relative to any
+ * particular snapshot, but rather a question of whether it is safe and
+ * reasonable to check the tuple attributes.  The caller should already have
+ * checked that the tuple is sufficiently sensible for us to evaluate.
  *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * If a tuple could have been inserted by a transaction that also added a
+ * column to the table, but which ultimately did not commit, or which has not
+ * yet committed, then the table's current TupleDesc might differ from the one
+ * used to construct this tuple, so we must not check it.
  *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
+ * As a special case, if our own transaction inserted the tuple, even if we
+ * added a column to the table, our TupleDesc should match.  We could check the
+ * tuple, but choose not to do so.
  *
- * Returns whether the tuple is visible for checking.
+ * If a tuple has been updated or deleted, we can still read the old tuple for
+ * corruption checking purposes, as long as we are careful about concurrent
+ * vacuums.  The main table tuple itself cannot be vacuumed away because we
+ * hold a buffer lock on the page, but if the deleting transaction is older
+ * than our transaction snapshot's xmin, then vacuum could remove the toast at
+ * any time, so we must not check the toast.
+ *
+ * If xmin or xmax values are older than can be checked against clog, or appear
+ * to be in the future (possibly due to wrap-around), then we cannot make a
+ * determination about the visibility of the tuple, so we must not check it.
+ *
+ * Returns true if the tuple should be checked, false otherwise.  Sets
+ * ctx->toast_is_volatile true if the toast might be vacuumed away, false
+ * otherwise.
  */
 static bool
 check_tuple_visibility(HeapCheckContext *ctx)
 {
+	TransactionId xmin;
+	TransactionId xvac;
+	TransactionId xmax;
+	XidCommitStatus xmin_status;
+	XidCommitStatus xvac_status;
+	XidCommitStatus xmax_status;
 	HeapTupleHeader tuphdr = ctx->tuphdr;
-	uint16		infomask = tuphdr->t_infomask;
 
-	if (!HeapTupleHeaderXminCommitted(tuphdr))
+	ctx->tuple_is_volatile = true;	/* have not yet proven otherwise */
+
+	/* If xmin is normal, it should be within valid range */
+	xmin = HeapTupleHeaderGetXmin(tuphdr);
+	switch (get_xid_status(xmin, ctx, &xmin_status))
 	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
+		case XID_INVALID:
+		case XID_BOUNDS_OK:
+			break;
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+	}
 
+	/*
+	 * Has inserting transaction committed?
+	 */
+	if (!HeapTupleHeaderXminCommitted(tuphdr))
+	{
 		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
+
+			/*
+			 * The inserting transaction aborted.  The structure of the tuple
+			 * may not match our relation description, so we cannot check it.
+			 */
+			return false;		/* uncheckable */
 		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
 		{
-			XidCommitStatus status;
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xvac, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
+									  pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
-					break;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
-					break;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-						case XID_IS_CURRENT_XID:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
 			}
-		}
-		else
-		{
-			XidCommitStatus status;
 
-			switch (get_xid_status(raw_xmin, ctx, &status))
+			switch (xvac_status)
 			{
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
-					return false;
-				case XID_IN_FUTURE:
+				case XID_IS_CURRENT_XID:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
+											   xvac));
 					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
+				case XID_IN_PROGRESS:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
+											   xvac));
 					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-						case XID_IS_CURRENT_XID:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+
+				case XID_COMMITTED:
+
+					/*
+					 * The VACUUM FULL committed, so this tuple is dead and
+					 * could be vacuumed away at any time.  It's ok to check
+					 * the tuple because we have a buffer lock for the page,
+					 * but not safe to check the toast.  We don't bother
+					 * comparing against safe_xmin because the VACUUM FULL
+					 * must have committed prior to an upgrade and can't still
+					 * be running.
+					 */
+					return true;	/* checkable */
+
+				case XID_ABORTED:
+					break;
 			}
 		}
-	}
-
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
-	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
 		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xmax, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
+									  pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-						case XID_IS_CURRENT_XID:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
+					break;
 			}
 
-			/* Ok, the tuple is live */
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
+											   xvac));
+					return false;	/* corrupt */
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
+											   xvac));
+					return false;	/* corrupt */
+
+				case XID_COMMITTED:
+					break;
+
+				case XID_ABORTED:
+
+					/*
+					 * The VACUUM FULL aborted, so this tuple is dead and
+					 * could be vacuumed away at any time.  It's ok to check
+					 * the tuple because we have a buffer lock for the page,
+					 * but not safe to check the toast.
+					 */
+					return true;	/* checkable */
+			}
+		}
+		else if (xmin_status == XID_IS_CURRENT_XID)
+		{
+			/*
+			 * Don't check tuples from currently running transactions, not
+			 * even our own.
+			 */
+			return false;		/* checkable, but don't check */
+		}
+		else if (xmin_status == XID_IN_PROGRESS)
+		{
+			/* Don't check tuples from currently running transactions */
+			return false;		/* uncheckable */
+		}
+		else if (xmin_status != XID_COMMITTED)
+		{
+			/*
+			 * Inserting transaction is not in progress, and not committed, so
+			 * it either aborted or crashed. We cannot check.
+			 */
+			return false;		/* uncheckable */
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
 	}
-	return true;				/* not dead */
+
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * xmax is a multixact, so it should be within valid MXID range.  We
+		 * cannot safely look up the update xid if the multixact is out of
+		 * bounds, and must stop checking this tuple.
+		 */
+		xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+		switch (check_mxid_valid_in_rel(xmax, ctx))
+		{
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("multitransaction ID is invalid"));
+				return false;	/* corrupt */
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+										   xmax, ctx->relminmxid));
+				return false;	/* corrupt */
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+										   xmax, ctx->oldest_mxact));
+				return false;	/* corrupt */
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+										   xmax,
+										   ctx->next_mxact));
+				return false;	/* corrupt */
+			case XID_BOUNDS_OK:
+				break;
+		}
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_INVALID)
+	{
+		/*
+		 * This tuple is live.  A concurrently running transaction could
+		 * delete it before we get around to checking the toast, but any such
+		 * running transaction is surely not less than our safe_xmin, so the
+		 * toast cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_is_volatile = false;
+		return true;			/* checkable */
+	}
+
+	if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+	{
+		/*
+		 * "Deleting" xact really only locked it, so the tuple is live in any
+		 * case.  As above, a concurrently running transaction could delete
+		 * it, but it cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_is_volatile = false;
+		return true;			/* checkable */
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * We already checked above that this multixact is within limits for
+		 * this table.  Now check the update xid from this multixact.
+		 */
+		xmax = HeapTupleGetUpdateXid(tuphdr);
+		switch (get_xid_status(xmax, ctx, &xmax_status))
+		{
+				/* not LOCKED_ONLY, so it has to have an xmax */
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("update xid is invalid"));
+				return false;	/* corrupt */
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->next_fxid),
+										   XidFromFullTransactionId(ctx->next_fxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes relation freeze threshold %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->relfrozenfxid),
+										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->oldest_fxid),
+										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				return false;	/* corrupt */
+			case XID_BOUNDS_OK:
+				break;
+		}
+
+		switch (xmax_status)
+		{
+			case XID_IS_CURRENT_XID:
+			case XID_IN_PROGRESS:
+
+				/*
+				 * The delete is in progress, so it cannot be visible to our
+				 * snapshot.
+				 */
+				ctx->tuple_is_volatile = false;
+				return true;	/* checkable */
+			case XID_COMMITTED:
+
+				/*
+				 * The delete committed.  Whether the toast can be vacuumed
+				 * away depends on how old the deleting transaction is.
+				 */
+				ctx->tuple_is_volatile = TransactionIdPrecedes(xmax,
+															   ctx->safe_xmin);
+				return true;	/* checkable */
+			case XID_ABORTED:
+
+				/*
+				 * The delete aborted or crashed.  The tuple is still live.
+				 */
+				ctx->tuple_is_volatile = false;
+				return true;	/* checkable */
+		}
+	}
+
+	/*
+	 * The tuple is deleted.  Whether the toast can be vacuumed away depends
+	 * on how old the deleting transaction is.
+	 */
+	xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+
+	switch (get_xid_status(xmax, ctx, &xmax_status))
+	{
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes relation freeze threshold %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_BOUNDS_OK:
+		case XID_INVALID:
+			break;
+	}
+
+	switch (xmax_status)
+	{
+		case XID_IS_CURRENT_XID:
+		case XID_IN_PROGRESS:
+
+			/*
+			 * The delete is in progress, so it cannot be visible to our
+			 * snapshot.
+			 */
+			ctx->tuple_is_volatile = false;
+			return true;		/* checkable */
+		case XID_COMMITTED:
+
+			/*
+			 * The delete committed.  Whether the toast can be vacuumed away
+			 * depends on how old the deleting transaction is.
+			 */
+			ctx->tuple_is_volatile = TransactionIdPrecedes(xmax,
+														   ctx->safe_xmin);
+			return true;		/* checkable */
+		case XID_ABORTED:
+
+			/*
+			 * The delete aborted or crashed.  The tuple is still live.
+			 */
+			ctx->tuple_is_volatile = false;
+			return true;		/* checkable */
+	}
+
+	return false;				/* not reached */
 }
 
+
 /*
  * Check the current toast tuple against the state tracked in ctx, recording
  * any corruption found in ctx->tupstore.
-- 
2.21.1 (Apple Git-122.3)

v13-0003-Renaming-report_corruption-as-report_main_corrup.patchapplication/octet-stream; name=v13-0003-Renaming-report_corruption-as-report_main_corrup.patch; x-unix-mode=0644Download
From 293c1e8c1a398bdea10a1e9d769e08177a0b75dd Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 29 Mar 2021 14:37:20 -0700
Subject: [PATCH v13 3/4] Renaming report_corruption as report_main_corruption

In preparation for checking toast as a separate pass from checking
the main heap, renaming report_corruption so that the name
report_toast_corruption can be added in the next commit and fit in
nicely with this name.

This patch can probably be left out if the committer believes it
creates more git churn than it is worth.
---
 contrib/amcheck/verify_heapam.c | 486 ++++++++++++++++----------------
 1 file changed, 243 insertions(+), 243 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 59b13180d9..c5bde63ea7 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -142,7 +142,7 @@ static bool check_tuple_attribute(HeapCheckContext *ctx);
 static bool check_tuple_header(HeapCheckContext *ctx);
 static bool check_tuple_visibility(HeapCheckContext *ctx);
 
-static void report_corruption(HeapCheckContext *ctx, char *msg);
+static void report_main_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
 static FullTransactionId FullTransactionIdFromXidAndCtx(TransactionId xid,
 														const HeapCheckContext *ctx);
@@ -407,25 +407,25 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 				if (rdoffnum < FirstOffsetNumber)
 				{
-					report_corruption(&ctx,
-									  psprintf("line pointer redirection to item at offset %u precedes minimum offset %u",
-											   (unsigned) rdoffnum,
-											   (unsigned) FirstOffsetNumber));
+					report_main_corruption(&ctx,
+										   psprintf("line pointer redirection to item at offset %u precedes minimum offset %u",
+													(unsigned) rdoffnum,
+													(unsigned) FirstOffsetNumber));
 					continue;
 				}
 				if (rdoffnum > maxoff)
 				{
-					report_corruption(&ctx,
-									  psprintf("line pointer redirection to item at offset %u exceeds maximum offset %u",
-											   (unsigned) rdoffnum,
-											   (unsigned) maxoff));
+					report_main_corruption(&ctx,
+										   psprintf("line pointer redirection to item at offset %u exceeds maximum offset %u",
+													(unsigned) rdoffnum,
+													(unsigned) maxoff));
 					continue;
 				}
 				rditem = PageGetItemId(ctx.page, rdoffnum);
 				if (!ItemIdIsUsed(rditem))
-					report_corruption(&ctx,
-									  psprintf("line pointer redirection to unused item at offset %u",
-											   (unsigned) rdoffnum));
+					report_main_corruption(&ctx,
+										   psprintf("line pointer redirection to unused item at offset %u",
+													(unsigned) rdoffnum));
 				continue;
 			}
 
@@ -435,26 +435,26 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 			if (ctx.lp_off != MAXALIGN(ctx.lp_off))
 			{
-				report_corruption(&ctx,
-								  psprintf("line pointer to page offset %u is not maximally aligned",
-										   ctx.lp_off));
+				report_main_corruption(&ctx,
+									   psprintf("line pointer to page offset %u is not maximally aligned",
+												ctx.lp_off));
 				continue;
 			}
 			if (ctx.lp_len < MAXALIGN(SizeofHeapTupleHeader))
 			{
-				report_corruption(&ctx,
-								  psprintf("line pointer length %u is less than the minimum tuple header size %u",
-										   ctx.lp_len,
-										   (unsigned) MAXALIGN(SizeofHeapTupleHeader)));
+				report_main_corruption(&ctx,
+									   psprintf("line pointer length %u is less than the minimum tuple header size %u",
+												ctx.lp_len,
+												(unsigned) MAXALIGN(SizeofHeapTupleHeader)));
 				continue;
 			}
 			if (ctx.lp_off + ctx.lp_len > BLCKSZ)
 			{
-				report_corruption(&ctx,
-								  psprintf("line pointer to page offset %u with length %u ends beyond maximum page offset %u",
-										   ctx.lp_off,
-										   ctx.lp_len,
-										   (unsigned) BLCKSZ));
+				report_main_corruption(&ctx,
+									   psprintf("line pointer to page offset %u with length %u ends beyond maximum page offset %u",
+												ctx.lp_off,
+												ctx.lp_len,
+												(unsigned) BLCKSZ));
 				continue;
 			}
 
@@ -517,7 +517,7 @@ sanity_check_relation(Relation rel)
  * The msg argument is pfree'd by this function.
  */
 static void
-report_corruption(HeapCheckContext *ctx, char *msg)
+report_main_corruption(HeapCheckContext *ctx, char *msg)
 {
 	Datum		values[HEAPCHECK_RELATION_COLS];
 	bool		nulls[HEAPCHECK_RELATION_COLS];
@@ -599,17 +599,17 @@ check_tuple_header(HeapCheckContext *ctx)
 
 	if (ctx->tuphdr->t_hoff > ctx->lp_len)
 	{
-		report_corruption(ctx,
-						  psprintf("data begins at offset %u beyond the tuple length %u",
-								   ctx->tuphdr->t_hoff, ctx->lp_len));
+		report_main_corruption(ctx,
+							   psprintf("data begins at offset %u beyond the tuple length %u",
+										ctx->tuphdr->t_hoff, ctx->lp_len));
 		header_garbled = true;
 	}
 
 	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
 		(ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI))
 	{
-		report_corruption(ctx,
-						  pstrdup("multixact should not be marked committed"));
+		report_main_corruption(ctx,
+							   pstrdup("multixact should not be marked committed"));
 
 		/*
 		 * This condition is clearly wrong, but we do not consider the header
@@ -626,21 +626,21 @@ check_tuple_header(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff != expected_hoff)
 	{
 		if ((infomask & HEAP_HASNULL) && ctx->natts == 1)
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, has nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, has nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff));
 		else if ((infomask & HEAP_HASNULL))
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, has nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, has nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
 		else if (ctx->natts == 1)
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, no nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, no nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff));
 		else
-			report_corruption(ctx,
-							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, no nulls)",
-									   expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
+			report_main_corruption(ctx,
+								   psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, no nulls)",
+											expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
 		header_garbled = true;
 	}
 
@@ -702,25 +702,25 @@ check_tuple_visibility(HeapCheckContext *ctx)
 		case XID_BOUNDS_OK:
 			break;
 		case XID_IN_FUTURE:
-			report_corruption(ctx,
-							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->next_fxid),
-									   XidFromFullTransactionId(ctx->next_fxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->next_fxid),
+											XidFromFullTransactionId(ctx->next_fxid)));
 			return false;		/* corrupt */
 		case XID_PRECEDES_CLUSTERMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->oldest_fxid),
-									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->oldest_fxid),
+											XidFromFullTransactionId(ctx->oldest_fxid)));
 			return false;		/* corrupt */
 		case XID_PRECEDES_RELMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->relfrozenfxid),
-									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmin %u precedes relation freeze threshold %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->relfrozenfxid),
+											XidFromFullTransactionId(ctx->relfrozenfxid)));
 			return false;		/* corrupt */
 	}
 
@@ -744,29 +744,29 @@ check_tuple_visibility(HeapCheckContext *ctx)
 			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
+					report_main_corruption(ctx,
+										   pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->next_fxid),
+													XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->relfrozenfxid),
+													XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->oldest_fxid),
+													XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
 				case XID_BOUNDS_OK:
 					break;
@@ -775,14 +775,14 @@ check_tuple_visibility(HeapCheckContext *ctx)
 			switch (xvac_status)
 			{
 				case XID_IS_CURRENT_XID:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
-											   xvac));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
+													xvac));
 					return false;	/* corrupt */
 				case XID_IN_PROGRESS:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
-											   xvac));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
+													xvac));
 					return false;	/* corrupt */
 
 				case XID_COMMITTED:
@@ -810,29 +810,29 @@ check_tuple_visibility(HeapCheckContext *ctx)
 			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
+					report_main_corruption(ctx,
+										   pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->next_fxid),
+													XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->relfrozenfxid),
+													XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
-											   xvac,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
+													xvac,
+													EpochFromFullTransactionId(ctx->oldest_fxid),
+													XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
 				case XID_BOUNDS_OK:
 					break;
@@ -841,14 +841,14 @@ check_tuple_visibility(HeapCheckContext *ctx)
 			switch (xvac_status)
 			{
 				case XID_IS_CURRENT_XID:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
-											   xvac));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
+													xvac));
 					return false;	/* corrupt */
 				case XID_IN_PROGRESS:
-					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
-											   xvac));
+					report_main_corruption(ctx,
+										   psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
+													xvac));
 					return false;	/* corrupt */
 
 				case XID_COMMITTED:
@@ -904,24 +904,24 @@ check_tuple_visibility(HeapCheckContext *ctx)
 		switch (check_mxid_valid_in_rel(xmax, ctx))
 		{
 			case XID_INVALID:
-				report_corruption(ctx,
-								  pstrdup("multitransaction ID is invalid"));
+				report_main_corruption(ctx,
+									   pstrdup("multitransaction ID is invalid"));
 				return false;	/* corrupt */
 			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
-										   xmax, ctx->relminmxid));
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+												xmax, ctx->relminmxid));
 				return false;	/* corrupt */
 			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
-										   xmax, ctx->oldest_mxact));
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+												xmax, ctx->oldest_mxact));
 				return false;	/* corrupt */
 			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
-										   xmax,
-										   ctx->next_mxact));
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+												xmax,
+												ctx->next_mxact));
 				return false;	/* corrupt */
 			case XID_BOUNDS_OK:
 				break;
@@ -962,29 +962,29 @@ check_tuple_visibility(HeapCheckContext *ctx)
 		{
 				/* not LOCKED_ONLY, so it has to have an xmax */
 			case XID_INVALID:
-				report_corruption(ctx,
-								  pstrdup("update xid is invalid"));
+				report_main_corruption(ctx,
+									   pstrdup("update xid is invalid"));
 				return false;	/* corrupt */
 			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->next_fxid),
-										   XidFromFullTransactionId(ctx->next_fxid)));
+				report_main_corruption(ctx,
+									   psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->next_fxid),
+												XidFromFullTransactionId(ctx->next_fxid)));
 				return false;	/* corrupt */
 			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("update xid %u precedes relation freeze threshold %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->relfrozenfxid),
-										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				report_main_corruption(ctx,
+									   psprintf("update xid %u precedes relation freeze threshold %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->relfrozenfxid),
+												XidFromFullTransactionId(ctx->relfrozenfxid)));
 				return false;	/* corrupt */
 			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->oldest_fxid),
-										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				report_main_corruption(ctx,
+									   psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->oldest_fxid),
+												XidFromFullTransactionId(ctx->oldest_fxid)));
 				return false;	/* corrupt */
 			case XID_BOUNDS_OK:
 				break;
@@ -1029,25 +1029,25 @@ check_tuple_visibility(HeapCheckContext *ctx)
 	switch (get_xid_status(xmax, ctx, &xmax_status))
 	{
 		case XID_IN_FUTURE:
-			report_corruption(ctx,
-							  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-									   xmax,
-									   EpochFromFullTransactionId(ctx->next_fxid),
-									   XidFromFullTransactionId(ctx->next_fxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+											xmax,
+											EpochFromFullTransactionId(ctx->next_fxid),
+											XidFromFullTransactionId(ctx->next_fxid)));
 			return false;		/* corrupt */
 		case XID_PRECEDES_RELMIN:
-			report_corruption(ctx,
-							  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-									   xmax,
-									   EpochFromFullTransactionId(ctx->relfrozenfxid),
-									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmax %u precedes relation freeze threshold %u:%u",
+											xmax,
+											EpochFromFullTransactionId(ctx->relfrozenfxid),
+											XidFromFullTransactionId(ctx->relfrozenfxid)));
 			return false;		/* corrupt */
 		case XID_PRECEDES_CLUSTERMIN:
-			report_corruption(ctx,
-							  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-									   xmax,
-									   EpochFromFullTransactionId(ctx->oldest_fxid),
-									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+											xmax,
+											EpochFromFullTransactionId(ctx->oldest_fxid),
+											XidFromFullTransactionId(ctx->oldest_fxid)));
 			return false;		/* corrupt */
 		case XID_BOUNDS_OK:
 		case XID_INVALID:
@@ -1114,16 +1114,16 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 										 ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
-						  pstrdup("toast chunk sequence number is null"));
+		report_main_corruption(ctx,
+							   pstrdup("toast chunk sequence number is null"));
 		return;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
-						  pstrdup("toast chunk data is null"));
+		report_main_corruption(ctx,
+							   pstrdup("toast chunk data is null"));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1140,9 +1140,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		/* should never happen */
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_corruption(ctx,
-						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-								   header, curchunk));
+		report_main_corruption(ctx,
+							   psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
+										header, curchunk));
 		return;
 	}
 
@@ -1151,16 +1151,16 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	 */
 	if (curchunk != ctx->chunkno)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, ctx->chunkno));
+		report_main_corruption(ctx,
+							   psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
+										curchunk, ctx->chunkno));
 		return;
 	}
 	if (curchunk > ctx->endchunk)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, ctx->endchunk));
+		report_main_corruption(ctx,
+							   psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
+										curchunk, ctx->endchunk));
 		return;
 	}
 
@@ -1168,9 +1168,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		: ctx->attrsize - ((ctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
 	if (chunksize != expected_size)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk size %u differs from the expected size %u",
-								   chunksize, expected_size));
+		report_main_corruption(ctx,
+							   psprintf("toast chunk size %u differs from the expected size %u",
+										chunksize, expected_size));
 		return;
 	}
 }
@@ -1217,12 +1217,12 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
-								   ctx->attnum,
-								   thisatt->attlen,
-								   ctx->tuphdr->t_hoff + ctx->offset,
-								   ctx->lp_len));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
+										ctx->attnum,
+										thisatt->attlen,
+										ctx->tuphdr->t_hoff + ctx->offset,
+										ctx->lp_len));
 		return false;
 	}
 
@@ -1238,12 +1238,12 @@ check_tuple_attribute(HeapCheckContext *ctx)
 											tp + ctx->offset);
 		if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 		{
-			report_corruption(ctx,
-							  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-									   ctx->attnum,
-									   thisatt->attlen,
-									   ctx->tuphdr->t_hoff + ctx->offset,
-									   ctx->lp_len));
+			report_main_corruption(ctx,
+								   psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
+											ctx->attnum,
+											thisatt->attlen,
+											ctx->tuphdr->t_hoff + ctx->offset,
+											ctx->lp_len));
 			return false;
 		}
 		return true;
@@ -1271,10 +1271,10 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 		if (va_tag != VARTAG_ONDISK)
 		{
-			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
-									   va_tag));
+			report_main_corruption(ctx,
+								   psprintf("toasted attribute %u has unexpected TOAST tag %u",
+											ctx->attnum,
+											va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
 		}
@@ -1286,12 +1286,12 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
-								   thisatt->attlen,
-								   ctx->tuphdr->t_hoff + ctx->offset,
-								   ctx->lp_len));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
+										ctx->attnum,
+										thisatt->attlen,
+										ctx->tuphdr->t_hoff + ctx->offset,
+										ctx->lp_len));
 
 		return false;
 	}
@@ -1318,18 +1318,18 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
+										ctx->attnum));
 		return true;
 	}
 
 	/* The relation better have a toast table */
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
-		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+		report_main_corruption(ctx,
+							   psprintf("attribute %u is external but relation has no toast relation",
+										ctx->attnum));
 		return true;
 	}
 
@@ -1374,13 +1374,13 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		ctx->chunkno++;
 	}
 	if (!found_toasttup)
-		report_corruption(ctx,
-						  psprintf("toasted value for attribute %u missing from toast table",
-								   ctx->attnum));
+		report_main_corruption(ctx,
+							   psprintf("toasted value for attribute %u missing from toast table",
+										ctx->attnum));
 	else if (ctx->chunkno != (ctx->endchunk + 1))
-		report_corruption(ctx,
-						  psprintf("final toast chunk number %u differs from expected value %u",
-								   ctx->chunkno, (ctx->endchunk + 1)));
+		report_main_corruption(ctx,
+							   psprintf("final toast chunk number %u differs from expected value %u",
+										ctx->chunkno, (ctx->endchunk + 1)));
 	systable_endscan_ordered(toastscan);
 
 	return true;
@@ -1406,27 +1406,27 @@ check_tuple(HeapCheckContext *ctx)
 		case XID_BOUNDS_OK:
 			break;
 		case XID_IN_FUTURE:
-			report_corruption(ctx,
-							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->next_fxid),
-									   XidFromFullTransactionId(ctx->next_fxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->next_fxid),
+											XidFromFullTransactionId(ctx->next_fxid)));
 			fatal = true;
 			break;
 		case XID_PRECEDES_CLUSTERMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->oldest_fxid),
-									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->oldest_fxid),
+											XidFromFullTransactionId(ctx->oldest_fxid)));
 			fatal = true;
 			break;
 		case XID_PRECEDES_RELMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->relfrozenfxid),
-									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			report_main_corruption(ctx,
+								   psprintf("xmin %u precedes relation freeze threshold %u:%u",
+											xmin,
+											EpochFromFullTransactionId(ctx->relfrozenfxid),
+											XidFromFullTransactionId(ctx->relfrozenfxid)));
 			fatal = true;
 			break;
 	}
@@ -1439,27 +1439,27 @@ check_tuple(HeapCheckContext *ctx)
 		switch (check_mxid_valid_in_rel(xmax, ctx))
 		{
 			case XID_INVALID:
-				report_corruption(ctx,
-								  pstrdup("multitransaction ID is invalid"));
+				report_main_corruption(ctx,
+									   pstrdup("multitransaction ID is invalid"));
 				fatal = true;
 				break;
 			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
-										   xmax, ctx->relminmxid));
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+												xmax, ctx->relminmxid));
 				fatal = true;
 				break;
 			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
-										   xmax, ctx->oldest_mxact));
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+												xmax, ctx->oldest_mxact));
 				fatal = true;
 				break;
 			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
-										   xmax,
-										   ctx->next_mxact));
+				report_main_corruption(ctx,
+									   psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+												xmax,
+												ctx->next_mxact));
 				fatal = true;
 				break;
 			case XID_BOUNDS_OK:
@@ -1478,27 +1478,27 @@ check_tuple(HeapCheckContext *ctx)
 			case XID_BOUNDS_OK:
 				break;
 			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->next_fxid),
-										   XidFromFullTransactionId(ctx->next_fxid)));
+				report_main_corruption(ctx,
+									   psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->next_fxid),
+												XidFromFullTransactionId(ctx->next_fxid)));
 				fatal = true;
 				break;
 			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->oldest_fxid),
-										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				report_main_corruption(ctx,
+									   psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->oldest_fxid),
+												XidFromFullTransactionId(ctx->oldest_fxid)));
 				fatal = true;
 				break;
 			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->relfrozenfxid),
-										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				report_main_corruption(ctx,
+									   psprintf("xmax %u precedes relation freeze threshold %u:%u",
+												xmax,
+												EpochFromFullTransactionId(ctx->relfrozenfxid),
+												XidFromFullTransactionId(ctx->relfrozenfxid)));
 				fatal = true;
 		}
 	}
@@ -1533,10 +1533,10 @@ check_tuple(HeapCheckContext *ctx)
 	 */
 	if (RelationGetDescr(ctx->rel)->natts < ctx->natts)
 	{
-		report_corruption(ctx,
-						  psprintf("number of attributes %u exceeds maximum expected for table %u",
-								   ctx->natts,
-								   RelationGetDescr(ctx->rel)->natts));
+		report_main_corruption(ctx,
+							   psprintf("number of attributes %u exceeds maximum expected for table %u",
+										ctx->natts,
+										RelationGetDescr(ctx->rel)->natts));
 		return;
 	}
 
-- 
2.21.1 (Apple Git-122.3)

v13-0004-Checking-toast-separately-from-the-main-table.patchapplication/octet-stream; name=v13-0004-Checking-toast-separately-from-the-main-table.patch; x-unix-mode=0644Download
From f6399dcd6e37cd29d3df03eeca5c557843d97db5 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 29 Mar 2021 14:40:45 -0700
Subject: [PATCH v13 4/4] Checking toast separately from the main table.

Rather than checking toasted attributes as we find them, creating a
list of them and checking all the toast in the list after releasing
the buffer lock for each main table page.
---
 contrib/amcheck/verify_heapam.c  | 415 +++++++++++++++----------------
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 201 insertions(+), 215 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index c5bde63ea7..e3163e4bfa 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -58,6 +58,26 @@ typedef enum SkipPages
 	SKIP_PAGES_NONE
 } SkipPages;
 
+/*
+ * Struct holding information necessary to check a toasted attribute, including
+ * the toast pointer, state about the current toast chunk being checked, and
+ * the location in the main table of the toasted attribute.  We have to track
+ * the tuple's location in the main table for reporting purposes because by the
+ * time the toast is checked our HeapCheckContext will no longer be pointing to
+ * the relevant tuple.
+ */
+typedef struct ToastCheckContext
+{
+	struct varatt_external toast_pointer;
+	BlockNumber blkno;			/* block in main table */
+	OffsetNumber offnum;		/* offset in main table */
+	AttrNumber	attnum;			/* attribute in main table */
+	int32		chunkno;		/* chunk number in toast table */
+	int32		attrsize;		/* size of toasted attribute */
+	int32		endchunk;		/* last chunk number in toast table */
+	int32		totalchunks;	/* total chunks in toast table */
+} ToastCheckContext;
+
 /*
  * Struct holding the running context information during
  * a lifetime of a verify_heapam execution.
@@ -119,11 +139,11 @@ typedef struct HeapCheckContext
 	/* True if toast for this tuple could be vacuumed away */
 	bool		tuple_is_volatile;
 
-	/* Values for iterating over toast for the attribute */
-	int32		chunkno;
-	int32		attrsize;
-	int32		endchunk;
-	int32		totalchunks;
+	/*
+	 * List of ToastCheckContext structs for toasted attributes which are not
+	 * in danger of being vacuumed way and should be checked
+	 */
+	List	   *toasted_attributes;
 
 	/* Whether verify_heapam has yet encountered any corrupt tuples */
 	bool		is_corrupt;
@@ -136,13 +156,18 @@ typedef struct HeapCheckContext
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
+static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+							  ToastCheckContext *tctx);
 
-static bool check_tuple_attribute(HeapCheckContext *ctx);
 static bool check_tuple_header(HeapCheckContext *ctx);
 static bool check_tuple_visibility(HeapCheckContext *ctx);
 
+static bool check_tuple_attribute(HeapCheckContext *ctx);
+static void check_toasted_attributes(HeapCheckContext *ctx);
+
 static void report_main_corruption(HeapCheckContext *ctx, char *msg);
+static void report_toast_corruption(HeapCheckContext *ctx,
+									ToastCheckContext *tctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
 static FullTransactionId FullTransactionIdFromXidAndCtx(TransactionId xid,
 														const HeapCheckContext *ctx);
@@ -253,6 +278,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
+	ctx.toasted_attributes = NIL;
 
 	/*
 	 * Any xmin newer than the xmin of our snapshot can't become all-visible
@@ -469,6 +495,14 @@ verify_heapam(PG_FUNCTION_ARGS)
 		/* clean up */
 		UnlockReleaseBuffer(ctx.buffer);
 
+		/*
+		 * Check any toast pointers from the page whose lock we just released
+		 * and reset the list to NIL.
+		 */
+		if (ctx.toasted_attributes != NIL)
+			check_toasted_attributes(&ctx);
+		Assert(ctx.toasted_attributes == NIL);
+
 		if (on_error_stop && ctx.is_corrupt)
 			break;
 	}
@@ -510,14 +544,13 @@ sanity_check_relation(Relation rel)
 }
 
 /*
- * Record a single corruption found in the table.  The values in ctx should
- * reflect the location of the corruption, and the msg argument should contain
- * a human-readable description of the corruption.
- *
- * The msg argument is pfree'd by this function.
+ * Shared internal implementation for report_main_corruption and
+ * report_toast_corruption.
  */
 static void
-report_main_corruption(HeapCheckContext *ctx, char *msg)
+report_corruption(Tuplestorestate *tupstore, TupleDesc tupdesc,
+				  BlockNumber blkno, OffsetNumber offnum, AttrNumber attnum,
+				  char *msg)
 {
 	Datum		values[HEAPCHECK_RELATION_COLS];
 	bool		nulls[HEAPCHECK_RELATION_COLS];
@@ -525,10 +558,10 @@ report_main_corruption(HeapCheckContext *ctx, char *msg)
 
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
-	values[0] = Int64GetDatum(ctx->blkno);
-	values[1] = Int32GetDatum(ctx->offnum);
-	values[2] = Int32GetDatum(ctx->attnum);
-	nulls[2] = (ctx->attnum < 0);
+	values[0] = Int64GetDatum(blkno);
+	values[1] = Int32GetDatum(offnum);
+	values[2] = Int32GetDatum(attnum);
+	nulls[2] = (attnum < 0);
 	values[3] = CStringGetTextDatum(msg);
 
 	/*
@@ -541,8 +574,39 @@ report_main_corruption(HeapCheckContext *ctx, char *msg)
 	 */
 	pfree(msg);
 
-	tuple = heap_form_tuple(ctx->tupdesc, values, nulls);
-	tuplestore_puttuple(ctx->tupstore, tuple);
+	tuple = heap_form_tuple(tupdesc, values, nulls);
+	tuplestore_puttuple(tupstore, tuple);
+}
+
+/*
+ * Record a single corruption found in the main table.  The values in ctx should
+ * indicate the location of the corruption, and the msg argument should contain
+ * a human-readable description of the corruption.
+ *
+ * The msg argument is pfree'd by this function.
+ */
+static void
+report_main_corruption(HeapCheckContext *ctx, char *msg)
+{
+	report_corruption(ctx->tupstore, ctx->tupdesc, ctx->blkno, ctx->offnum,
+					  ctx->attnum, msg);
+	ctx->is_corrupt = true;
+}
+
+/*
+ * Record corruption found in the toast table.  The values in tctx should
+ * indicate the location in the main table where the toast pointer was
+ * encountered, and the msg argument should contain a human-readable
+ * description of the toast table corruption.
+ *
+ * As above, the msg argument is pfree'd by this function.
+ */
+static void
+report_toast_corruption(HeapCheckContext *ctx, ToastCheckContext *tctx,
+						char *msg)
+{
+	report_corruption(ctx->tupstore, ctx->tupdesc, tctx->blkno, tctx->offnum,
+					  tctx->attnum, msg);
 	ctx->is_corrupt = true;
 }
 
@@ -1086,7 +1150,6 @@ check_tuple_visibility(HeapCheckContext *ctx)
 	return false;				/* not reached */
 }
 
-
 /*
  * Check the current toast tuple against the state tracked in ctx, recording
  * any corruption found in ctx->tupstore.
@@ -1099,7 +1162,8 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * as each toast tuple having its varlena structure sanity checked.
  */
 static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
+check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+				  ToastCheckContext *tctx)
 {
 	int32		curchunk;
 	Pointer		chunk;
@@ -1114,16 +1178,16 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 										 ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_main_corruption(ctx,
-							   pstrdup("toast chunk sequence number is null"));
+		report_toast_corruption(ctx, tctx,
+								pstrdup("toast chunk sequence number is null"));
 		return;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_main_corruption(ctx,
-							   pstrdup("toast chunk data is null"));
+		report_toast_corruption(ctx, tctx,
+								pstrdup("toast chunk data is null"));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1140,37 +1204,37 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		/* should never happen */
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_main_corruption(ctx,
-							   psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-										header, curchunk));
+		report_toast_corruption(ctx, tctx,
+								psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
+										 header, curchunk));
 		return;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != ctx->chunkno)
+	if (curchunk != tctx->chunkno)
 	{
-		report_main_corruption(ctx,
-							   psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-										curchunk, ctx->chunkno));
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
+										 curchunk, tctx->chunkno));
 		return;
 	}
-	if (curchunk > ctx->endchunk)
+	if (curchunk > tctx->endchunk)
 	{
-		report_main_corruption(ctx,
-							   psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-										curchunk, ctx->endchunk));
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
+										 curchunk, tctx->endchunk));
 		return;
 	}
 
-	expected_size = curchunk < ctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
-		: ctx->attrsize - ((ctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
+	expected_size = curchunk < tctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
+		: tctx->attrsize - ((tctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
 	if (chunksize != expected_size)
 	{
-		report_main_corruption(ctx,
-							   psprintf("toast chunk size %u differs from the expected size %u",
-										chunksize, expected_size));
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast chunk size %u differs from the expected size %u",
+										 chunksize, expected_size));
 		return;
 	}
 }
@@ -1180,17 +1244,17 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
  * found in ctx->tupstore.
  *
  * This function follows the logic performed by heap_deform_tuple(), and in the
- * case of a toasted value, optionally continues along the logic of
- * detoast_external_attr(), checking for any conditions that would result in
- * either of those functions Asserting or crashing the backend.  The checks
- * performed by Asserts present in those two functions are also performed here.
- * In cases where those two functions are a bit cavalier in their assumptions
- * about data being correct, we perform additional checks not present in either
- * of those two functions.  Where some condition is checked in both of those
- * functions, we perform it here twice, as we parallel the logical flow of
- * those two functions.  The presence of duplicate checks seems a reasonable
- * price to pay for keeping this code tightly coupled with the code it
- * protects.
+ * case of a toasted value, optionally stores the toast pointer so later it can
+ * be checked following the logic of detoast_external_attr(), checking for any
+ * conditions that would result in either of those functions Asserting or
+ * crashing the backend.  The checks performed by Asserts present in those two
+ * functions are also performed here and in check_toasted_attributes.  In cases
+ * where those two functions are a bit cavalier in their assumptions about data
+ * being correct, we perform additional checks not present in either of those
+ * two functions.  Where some condition is checked in both of those functions,
+ * we perform it here twice, as we parallel the logical flow of those two
+ * functions.  The presence of duplicate checks seems a reasonable price to pay
+ * for keeping this code tightly coupled with the code it protects.
  *
  * Returns true if the tuple attribute is sane enough for processing to
  * continue on to the next attribute, false otherwise.
@@ -1198,12 +1262,6 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 static bool
 check_tuple_attribute(HeapCheckContext *ctx)
 {
-	struct varatt_external toast_pointer;
-	ScanKeyData toastkey;
-	SysScanDesc toastscan;
-	SnapshotData SnapshotToast;
-	HeapTuple	toasttup;
-	bool		found_toasttup;
 	Datum		attdatum;
 	struct varlena *attr;
 	char	   *tp;				/* pointer to the tuple data */
@@ -1338,191 +1396,114 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 
 	/*
-	 * Must copy attr into toast_pointer for alignment considerations
+	 * If this tuple is at risk of being vacuumed away, we cannot check the
+	 * toast.  Otherwise, we push a copy of the toast tuple so we can check it
+	 * after releasing the main table buffer lock.
 	 */
-	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+	if (!ctx->tuple_is_volatile)
+	{
+		ToastCheckContext *tctx;
 
-	ctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
-	ctx->endchunk = (ctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
-	ctx->totalchunks = ctx->endchunk + 1;
+		tctx = (ToastCheckContext *) palloc0fast(sizeof(ToastCheckContext));
 
-	/*
-	 * Setup a scan key to find chunks in toast table with matching va_valueid
-	 */
-	ScanKeyInit(&toastkey,
-				(AttrNumber) 1,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(toast_pointer.va_valueid));
-
-	/*
-	 * Check if any chunks for this toasted object exist in the toast table,
-	 * accessible via the index.
-	 */
-	init_toast_snapshot(&SnapshotToast);
-	toastscan = systable_beginscan_ordered(ctx->toast_rel,
-										   ctx->valid_toast_index,
-										   &SnapshotToast, 1,
-										   &toastkey);
-	ctx->chunkno = 0;
-	found_toasttup = false;
-	while ((toasttup =
-			systable_getnext_ordered(toastscan,
-									 ForwardScanDirection)) != NULL)
-	{
-		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx);
-		ctx->chunkno++;
+		VARATT_EXTERNAL_GET_POINTER(tctx->toast_pointer, attr);
+		tctx->blkno = ctx->blkno;
+		tctx->offnum = ctx->offnum;
+		tctx->attnum = ctx->attnum;
+		ctx->toasted_attributes = lappend(ctx->toasted_attributes, tctx);
 	}
-	if (!found_toasttup)
-		report_main_corruption(ctx,
-							   psprintf("toasted value for attribute %u missing from toast table",
-										ctx->attnum));
-	else if (ctx->chunkno != (ctx->endchunk + 1))
-		report_main_corruption(ctx,
-							   psprintf("final toast chunk number %u differs from expected value %u",
-										ctx->chunkno, (ctx->endchunk + 1)));
-	systable_endscan_ordered(toastscan);
 
 	return true;
 }
 
 /*
- * Check the current tuple as tracked in ctx, recording any corruption found in
- * ctx->tupstore.
+ * For each attribute collected in ctx->toasted_attributes, look up the value
+ * in the toast table and perform checks on it.  This function should only be
+ * called on toast pointers which cannot be vacuumed away during our
+ * processing.
  */
 static void
-check_tuple(HeapCheckContext *ctx)
+check_toasted_attributes(HeapCheckContext *ctx)
 {
-	TransactionId xmin;
-	TransactionId xmax;
-	bool		fatal = false;
-	uint16		infomask = ctx->tuphdr->t_infomask;
+	ListCell   *cell;
 
-	/* If xmin is normal, it should be within valid range */
-	xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
-	switch (get_xid_status(xmin, ctx, NULL))
+	foreach(cell, ctx->toasted_attributes)
 	{
-		case XID_INVALID:
-		case XID_BOUNDS_OK:
-			break;
-		case XID_IN_FUTURE:
-			report_main_corruption(ctx,
-								   psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
-											xmin,
-											EpochFromFullTransactionId(ctx->next_fxid),
-											XidFromFullTransactionId(ctx->next_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_CLUSTERMIN:
-			report_main_corruption(ctx,
-								   psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
-											xmin,
-											EpochFromFullTransactionId(ctx->oldest_fxid),
-											XidFromFullTransactionId(ctx->oldest_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_RELMIN:
-			report_main_corruption(ctx,
-								   psprintf("xmin %u precedes relation freeze threshold %u:%u",
-											xmin,
-											EpochFromFullTransactionId(ctx->relfrozenfxid),
-											XidFromFullTransactionId(ctx->relfrozenfxid)));
-			fatal = true;
-			break;
-	}
+		ToastCheckContext *tctx;
+		SnapshotData SnapshotToast;
+		ScanKeyData toastkey;
+		SysScanDesc toastscan;
+		bool		found_toasttup;
+		HeapTuple	toasttup;
+
+		tctx = lfirst(cell);
+		tctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer);
+		tctx->endchunk = (tctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
+		tctx->totalchunks = tctx->endchunk + 1;
 
-	xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
+		/*
+		 * Setup a scan key to find chunks in toast table with matching
+		 * va_valueid
+		 */
+		ScanKeyInit(&toastkey,
+					(AttrNumber) 1,
+					BTEqualStrategyNumber, F_OIDEQ,
+					ObjectIdGetDatum(tctx->toast_pointer.va_valueid));
 
-	if (infomask & HEAP_XMAX_IS_MULTI)
-	{
-		/* xmax is a multixact, so it should be within valid MXID range */
-		switch (check_mxid_valid_in_rel(xmax, ctx))
-		{
-			case XID_INVALID:
-				report_main_corruption(ctx,
-									   pstrdup("multitransaction ID is invalid"));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_main_corruption(ctx,
-									   psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
-												xmax, ctx->relminmxid));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_main_corruption(ctx,
-									   psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
-												xmax, ctx->oldest_mxact));
-				fatal = true;
-				break;
-			case XID_IN_FUTURE:
-				report_main_corruption(ctx,
-									   psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
-												xmax,
-												ctx->next_mxact));
-				fatal = true;
-				break;
-			case XID_BOUNDS_OK:
-				break;
-		}
-	}
-	else
-	{
 		/*
-		 * xmax is not a multixact and is normal, so it should be within the
-		 * valid XID range.
+		 * Check if any chunks for this toasted object exist in the toast
+		 * table, accessible via the index.
 		 */
-		switch (get_xid_status(xmax, ctx, NULL))
+		init_toast_snapshot(&SnapshotToast);
+		toastscan = systable_beginscan_ordered(ctx->toast_rel,
+											   ctx->valid_toast_index,
+											   &SnapshotToast, 1,
+											   &toastkey);
+		tctx->chunkno = 0;
+		found_toasttup = false;
+		while ((toasttup =
+				systable_getnext_ordered(toastscan,
+										 ForwardScanDirection)) != NULL)
 		{
-			case XID_INVALID:
-			case XID_BOUNDS_OK:
-				break;
-			case XID_IN_FUTURE:
-				report_main_corruption(ctx,
-									   psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-												xmax,
-												EpochFromFullTransactionId(ctx->next_fxid),
-												XidFromFullTransactionId(ctx->next_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_main_corruption(ctx,
-									   psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-												xmax,
-												EpochFromFullTransactionId(ctx->oldest_fxid),
-												XidFromFullTransactionId(ctx->oldest_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_main_corruption(ctx,
-									   psprintf("xmax %u precedes relation freeze threshold %u:%u",
-												xmax,
-												EpochFromFullTransactionId(ctx->relfrozenfxid),
-												XidFromFullTransactionId(ctx->relfrozenfxid)));
-				fatal = true;
+			found_toasttup = true;
+			check_toast_tuple(toasttup, ctx, tctx);
+			tctx->chunkno++;
 		}
+		if (!found_toasttup)
+			report_toast_corruption(ctx, tctx,
+									psprintf("toasted value for attribute %u missing from toast table",
+											 tctx->attnum));
+		else if (tctx->chunkno != (tctx->endchunk + 1))
+			report_toast_corruption(ctx, tctx,
+									psprintf("final toast chunk number %u differs from expected value %u",
+											 tctx->chunkno, (tctx->endchunk + 1)));
+		systable_endscan_ordered(toastscan);
+
+		pfree(tctx);
 	}
+	list_free(ctx->toasted_attributes);
+	ctx->toasted_attributes = NIL;
+}
 
-	/*
-	 * Cannot process tuple data if tuple header was corrupt, as the offsets
-	 * within the page cannot be trusted, leaving too much risk of reading
-	 * garbage if we continue.
-	 *
-	 * We also cannot process the tuple if the xmin or xmax were invalid
-	 * relative to relfrozenxid or relminmxid, as clog entries for the xids
-	 * may already be gone.
-	 */
-	if (fatal)
-		return;
-
+/*
+ * Check the current tuple as tracked in ctx, recording any corruption found in
+ * ctx->tupstore.
+ */
+static void
+check_tuple(HeapCheckContext *ctx)
+{
 	/*
 	 * Check various forms of tuple header corruption.  If the header is too
-	 * corrupt to continue checking, or if the tuple is not visible to anyone,
-	 * we cannot continue with other checks.
+	 * corrupt to continue checking, we cannot continue with other checks.
 	 */
 	if (!check_tuple_header(ctx))
 		return;
 
+	/*
+	 * Check tuple visibility.  If the inserting transaction aborted, we
+	 * cannot assume our relation description matches the tuple structure, and
+	 * therefore cannot check it.
+	 */
 	if (!check_tuple_visibility(ctx))
 		return;
 
@@ -1545,6 +1526,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * next, at which point we abort further attribute checks for this tuple.
 	 * Note that we don't abort for all types of corruption, only for those
 	 * types where we don't know how to continue.
+	 *
+	 * While checking the tuple attributes, we build a list of toast pointers
+	 * we encounter, to be checked later.  If further attribute checking is
+	 * aborted, we still have the pointers collected prior to aborting.
 	 */
 	ctx->offset = 0;
 	for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..0ce261e2a2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2557,6 +2557,7 @@ TimestampTz
 TmFromChar
 TmToChar
 ToastAttrInfo
+ToastCheckContext
 ToastTupleContext
 TocEntry
 TokenAuxData
-- 
2.21.1 (Apple Git-122.3)

#108Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#107)
Re: pg_amcheck contrib application

On Mon, Mar 29, 2021 at 7:16 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Sure, here are four patches which do the same as the single v12 patch did.

Thanks. Here are some comments on 0003 and 0004:

When you posted v11, you said that "Rather than print out all four
toast pointer fields for each toast failure, va_rawsize, va_extsize,
and va_toastrelid are only mentioned in the corruption message if they
are related to the specific corruption. Otherwise, just the
va_valueid is mentioned in the corruption message." I like that
principal; in fact, as you know, I suggested it. But, with the v13
patches applied, exactly zero of the callers to
report_toast_corruption() appear to be following it, because none of
them include the value ID. I think you need to revise the messages,
e.g. "toasted value for attribute %u missing from toast table" ->
"toast value %u not found in toast table"; "final toast chunk number
%u differs from expected value %u" -> "toast value %u was expected to
end at chunk %u, but ended at chunk %u"; "toast chunk sequence number
is null" -> "toast value %u has toast chunk with null sequence
number". In the first of those example cases, I think you need not
mention the attribute number because it's already there in its own
column.

On a related note, it doesn't look like you are actually checking
va_toastrelid here. Doing so seems like it would be a good idea. It
also seems like it would be good to check that the compressed size is
less than or equal to the uncompressed size.

I do not like the name tuple_is_volatile, because volatile has a
couple of meanings already, and this isn't one of them. A
SQL-callable function is volatile if it might return different outputs
given the same inputs, even within the same SQL statement. A C
variable is volatile if it might be magically modified in ways not
known to the compiler. I had suggested tuple_cannot_die_now, which is
closer to the mark. If you want to be even more precise, you could
talk about whether the tuple is potentially prunable (e.g.
tuple_can_be_pruned, which inverts the sense). That's really what
we're worried about: whether MVCC rules would permit the tuple to be
pruned after we release the buffer lock and before we check TOAST.

I would ideally prefer not to rename report_corruption(). The old name
is clearer, and changing it produces a bunch of churn that I'd rather
avoid. Perhaps the common helper function could be called
report_corruption_internal(), and the callers could be
report_corruption() and report_toast_corruption().

Regarding 0001 and 0002, I think the logic in 0002 looks a lot closer
to correct now, but I want to go through it in more detail. I think,
though, that you've made some of my comments worse. For example, I
wrote: "It should be impossible for xvac to still be running, since
we've removed all that code, but even if it were, it ought to be safe
to read the tuple, since the original inserter must have committed.
But, if the xvac transaction committed, this tuple (and its associated
TOAST tuples) could be pruned at any time." You changed that to read
"We don't bother comparing against safe_xmin because the VACUUM FULL
must have committed prior to an upgrade and can't still be running."
Your comment is shorter, which is a point in its favor, but what I was
trying to emphasize is that the logic would be correct EVEN IF we
again started to use HEAP_MOVED_OFF and HEAP_MOVED_IN again. Your
version makes it sound like the code would need to be revised in that
case. If that's true, then my comment was wrong, but I didn't think it
was true, or I wouldn't have written the comment in that way.

Also, and maybe this is a job for a separate patch, but then again
maybe not, I wonder if it's really a good idea for get_xid_status to
return both a XidBoundsViolation and an XidCommitStatus. It seems to
me (without checking all that carefully) that it might be better to
just flatten all of that into a single enum, because right now it
seems like you often end up with two consecutive switch statements
where, perhaps, just one would suffice.

--
Robert Haas
EDB: http://www.enterprisedb.com

#109Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#108)
3 attachment(s)
Re: pg_amcheck contrib application

On Mar 30, 2021, at 12:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 29, 2021 at 7:16 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Sure, here are four patches which do the same as the single v12 patch did.

Thanks. Here are some comments on 0003 and 0004:

When you posted v11, you said that "Rather than print out all four
toast pointer fields for each toast failure, va_rawsize, va_extsize,
and va_toastrelid are only mentioned in the corruption message if they
are related to the specific corruption. Otherwise, just the
va_valueid is mentioned in the corruption message." I like that
principal; in fact, as you know, I suggested it. But, with the v13
patches applied, exactly zero of the callers to
report_toast_corruption() appear to be following it, because none of
them include the value ID. I think you need to revise the messages,
e.g.

These changes got lost between v11 and v12. I've put them back, as well as updating to use your language.

"toasted value for attribute %u missing from toast table" ->
"toast value %u not found in toast table";

Changed.

"final toast chunk number
%u differs from expected value %u" -> "toast value %u was expected to
end at chunk %u, but ended at chunk %u";

Changed.

"toast chunk sequence number
is null" -> "toast value %u has toast chunk with null sequence
number".

Changed.

In the first of those example cases, I think you need not
mention the attribute number because it's already there in its own
column.

Correct. I'd removed that but lost that work in v12.

On a related note, it doesn't look like you are actually checking
va_toastrelid here. Doing so seems like it would be a good idea. It
also seems like it would be good to check that the compressed size is
less than or equal to the uncompressed size.

Yeah, those checks were in v11 but got lost when I changed things for v12. They are back in v14.

I do not like the name tuple_is_volatile, because volatile has a
couple of meanings already, and this isn't one of them. A
SQL-callable function is volatile if it might return different outputs
given the same inputs, even within the same SQL statement. A C
variable is volatile if it might be magically modified in ways not
known to the compiler. I had suggested tuple_cannot_die_now, which is
closer to the mark. If you want to be even more precise, you could
talk about whether the tuple is potentially prunable (e.g.
tuple_can_be_pruned, which inverts the sense). That's really what
we're worried about: whether MVCC rules would permit the tuple to be
pruned after we release the buffer lock and before we check TOAST.

I used "tuple_can_be_pruned". I didn't like "tuple_cannot_die_now", and still don't like that name, as it has several wrong interpretations. One meaning of "cannot die now" is that it has become immortal. Another is "cannot be deleted from the table".

I would ideally prefer not to rename report_corruption(). The old name
is clearer, and changing it produces a bunch of churn that I'd rather
avoid. Perhaps the common helper function could be called
report_corruption_internal(), and the callers could be
report_corruption() and report_toast_corruption().

Yes, hence the commit message in the previous patch set, "This patch can probably be left out if the committer believes it creates more git churn than it is worth." I've removed this patch from this next patch set, and used the function names you suggest.

Regarding 0001 and 0002, I think the logic in 0002 looks a lot closer
to correct now, but I want to go through it in more detail. I think,
though, that you've made some of my comments worse. For example, I
wrote: "It should be impossible for xvac to still be running, since
we've removed all that code, but even if it were, it ought to be safe
to read the tuple, since the original inserter must have committed.
But, if the xvac transaction committed, this tuple (and its associated
TOAST tuples) could be pruned at any time." You changed that to read
"We don't bother comparing against safe_xmin because the VACUUM FULL
must have committed prior to an upgrade and can't still be running."
Your comment is shorter, which is a point in its favor, but what I was
trying to emphasize is that the logic would be correct EVEN IF we
again started to use HEAP_MOVED_OFF and HEAP_MOVED_IN again. Your
version makes it sound like the code would need to be revised in that
case. If that's true, then my comment was wrong, but I didn't think it
was true, or I wouldn't have written the comment in that way.

I think the logic would have to change if we brought back the old VACUUM FULL behavior.

I'm not looking at the old VACUUM FULL code, but my assumption is that if the xvac code were resurrected, then when a tuple is moved off by a VACUUM FULL, the old tuple and associated toast cannot be pruned until concurrent transactions end. So, if amcheck is running more-or-less concurrently with the VACUUM FULL and has a snapshot xmin no newer than the xid of the VACUUM FULL's xid, it can check the toast associated with the moved off tuple after the VACUUM FULL commits. If instead the VACUUM FULL xid was older than amcheck's xmin, then the toast is in danger of being vacuumed away. So the logic in verify_heapam would need to change to think about this distinction. We don't have to concern ourselves about that, because VACUUM FULL cannot be running, and so the xid for it must be older than our xmin, and hence the toast is unconditionally not safe to check.

I'm changing the comments back to how you had them, but I'd like to know why my reasoning is wrong.

Also, and maybe this is a job for a separate patch, but then again
maybe not, I wonder if it's really a good idea for get_xid_status to
return both a XidBoundsViolation and an XidCommitStatus. It seems to
me (without checking all that carefully) that it might be better to
just flatten all of that into a single enum, because right now it
seems like you often end up with two consecutive switch statements
where, perhaps, just one would suffice.

get_xid_status was written to return XidBoundsViolation separately from returning by reference an XidCommitStatus because, if you pass null for the XidCommitStatus parameter, the function can return earlier without taking the XactTruncationLock and checking clog. I think that design made a lot of sense at the time get_xid_status was written, but there are no longer any callers passing null, so the function never returns early.

I am hesitant to refactor get_xid_status as you suggest until we're sure no such callers who pass null are needed. So perhaps your idea of having that change as a separate patch for after this patch series is done and committed is the right strategy.

Also, even now, there are some places where the returned XidBoundsViolation is used right away, but some other processing happens before the XidCommitStatus is finally used. If they were one value in a merged enum, there would still be two switches at least in the location I'm thinking of.

Attachments:

v14-0001-Refactoring-function-check_tuple_header_and_visi.patchapplication/octet-stream; name=v14-0001-Refactoring-function-check_tuple_header_and_visi.patch; x-unix-mode=0644Download
From b48c48825d6e14ba0c2ec90609308cffa98ce424 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Wed, 24 Mar 2021 18:18:56 -0700
Subject: [PATCH v14 1/3] Refactoring function
 check_tuple_header_and_visibility

Extending enum XidCommitStatus to include XID_IS_CURRENT_XID.  The
visibility code for verify_heapam() was conflating XID_IN_PROGRESS
and XID_IS_CURRENT_XID under just one enum, making it harder to
compare the logic to that used by vacuum's visibility function,
which treats those two cases separately.

Simplifying check_tuple_header_and_visibilty signature.  It was
taking both tuphdr and ctx arguments, but the tuphdr is just
ctx->tuphdr, so it is a bit absurd to pass two arguments for this.

Splitting check_tuple_header_and_visibilty() into two functions.
check_tuple_header() and check_tuple_visibility() are split out as
separate functions, but otherwise behave exactly as before.
---
 contrib/amcheck/verify_heapam.c | 82 +++++++++++++++++++--------------
 1 file changed, 47 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..9172b5fd81 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -46,6 +46,7 @@ typedef enum XidBoundsViolation
 typedef enum XidCommitStatus
 {
 	XID_COMMITTED,
+	XID_IS_CURRENT_XID,
 	XID_IN_PROGRESS,
 	XID_ABORTED
 } XidCommitStatus;
@@ -133,8 +134,8 @@ static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static bool check_tuple_header(HeapCheckContext *ctx);
+static bool check_tuple_visibility(HeapCheckContext *ctx);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
@@ -555,16 +556,11 @@ verify_heapam_tupdesc(void)
 }
 
 /*
- * Check for tuple header corruption and tuple visibility.
- *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * Check for tuple header corruption.
  *
  * Some kinds of corruption make it unsafe to check the tuple attributes, for
  * example when the line pointer refers to a range of bytes outside the page.
- * In such cases, we return false (not visible) after recording appropriate
+ * In such cases, we return false (not checkable) after recording appropriate
  * corruption messages.
  *
  * Some other kinds of tuple header corruption confuse the question of where
@@ -576,27 +572,16 @@ verify_heapam_tupdesc(void)
  *
  * Other kinds of tuple header corruption do not bear on the question of
  * whether the tuple attributes can be checked, so we record corruption
- * messages for them but do not base our visibility determination on them.  (In
- * other words, we do not return false merely because we detected them.)
- *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
+ * messages for them but we do not return false merely because we detected
+ * them.
  *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
- *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Returns whether the tuple is sufficiently sensible to undergo visibility and
+ * attribute checks.
  */
 static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+check_tuple_header(HeapCheckContext *ctx)
 {
+	HeapTupleHeader tuphdr = ctx->tuphdr;
 	uint16		infomask = tuphdr->t_infomask;
 	bool		header_garbled = false;
 	unsigned	expected_hoff;
@@ -651,13 +636,34 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 	if (header_garbled)
 		return false;			/* checking of this tuple should not continue */
 
-	/*
-	 * Ok, we can examine the header for tuple visibility purposes, though we
-	 * still need to be careful about a few remaining types of header
-	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
-	 */
+	return true;				/* header ok */
+}
+
+/*
+ * Checks whether a tuple is visible for checking.
+ *
+ * Since we do not hold a snapshot, tuple visibility is not a question of
+ * whether we should be able to see the tuple relative to any particular
+ * snapshot, but rather a question of whether it is safe and reasonable to
+ * check the tuple attributes.
+ *
+ * For visibility determination not specifically related to corruption, what we
+ * want to know is if a tuple is potentially visible to any running
+ * transaction.  If you are tempted to replace this function's visibility logic
+ * with a call to another visibility checking function, keep in mind that this
+ * function does not update hint bits, as it seems imprudent to write hint bits
+ * (or anything at all) to a table during a corruption check.  Nor does this
+ * function bother classifying tuple visibility beyond a boolean visible vs.
+ * not visible.
+ *
+ * Returns whether the tuple is visible for checking.
+ */
+static bool
+check_tuple_visibility(HeapCheckContext *ctx)
+{
+	HeapTupleHeader tuphdr = ctx->tuphdr;
+	uint16		infomask = tuphdr->t_infomask;
+
 	if (!HeapTupleHeaderXminCommitted(tuphdr))
 	{
 		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
@@ -704,6 +710,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 					switch (status)
 					{
 						case XID_IN_PROGRESS:
+						case XID_IS_CURRENT_XID:
 							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
 						case XID_COMMITTED:
 						case XID_ABORTED:
@@ -748,6 +755,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 						case XID_COMMITTED:
 							break;
 						case XID_IN_PROGRESS:
+						case XID_IS_CURRENT_XID:
 							return true;	/* insert or delete in progress */
 						case XID_ABORTED:
 							return false;	/* HEAPTUPLE_DEAD */
@@ -795,6 +803,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 					switch (status)
 					{
 						case XID_IN_PROGRESS:
+						case XID_IS_CURRENT_XID:
 							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
 						case XID_COMMITTED:
 						case XID_ABORTED:
@@ -1247,7 +1256,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * corrupt to continue checking, or if the tuple is not visible to anyone,
 	 * we cannot continue with other checks.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	if (!check_tuple_header(ctx))
+		return;
+
+	if (!check_tuple_visibility(ctx))
 		return;
 
 	/*
@@ -1448,7 +1460,7 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 	if (FullTransactionIdPrecedesOrEquals(clog_horizon, fxid))
 	{
 		if (TransactionIdIsCurrentTransactionId(xid))
-			*status = XID_IN_PROGRESS;
+			*status = XID_IS_CURRENT_XID;
 		else if (TransactionIdDidCommit(xid))
 			*status = XID_COMMITTED;
 		else if (TransactionIdDidAbort(xid))
-- 
2.21.1 (Apple Git-122.3)

v14-0002-Replacing-implementation-of-check_tuple_visibili.patchapplication/octet-stream; name=v14-0002-Replacing-implementation-of-check_tuple_visibili.patch; x-unix-mode=0644Download
From 1c4781ff30b5f4d9e10c6dc03c755d34881338b1 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 29 Mar 2021 14:31:13 -0700
Subject: [PATCH v14 2/3] Replacing implementation of check_tuple_visibility

Using a modified version of HeapTupleSatisfiesVacuumHorizon.
---
 contrib/amcheck/verify_heapam.c | 479 +++++++++++++++++++++++++-------
 1 file changed, 371 insertions(+), 108 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9172b5fd81..be22b491d6 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -73,6 +73,8 @@ typedef struct HeapCheckContext
 	TransactionId oldest_xid;	/* ShmemVariableCache->oldestXid */
 	FullTransactionId oldest_fxid;	/* 64-bit version of oldest_xid, computed
 									 * relative to next_fxid */
+	TransactionId safe_xmin;	/* this XID and newer ones can't become
+								 * all-visible while we're running */
 
 	/*
 	 * Cached copy of value from MultiXactState
@@ -114,6 +116,9 @@ typedef struct HeapCheckContext
 	uint32		offset;			/* offset in tuple data */
 	AttrNumber	attnum;
 
+	/* True if toast for this tuple could be vacuumed away */
+	bool		tuple_can_be_pruned;
+
 	/* Values for iterating over toast for the attribute */
 	int32		chunkno;
 	int32		attrsize;
@@ -249,6 +254,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
 
+	/*
+	 * Any xmin newer than the xmin of our snapshot can't become all-visible
+	 * while we're running.
+	 */
+	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+
 	/*
 	 * If we report corruption when not examining some individual attribute,
 	 * we need attnum to be reported as NULL.  Set that up before any
@@ -640,189 +651,441 @@ check_tuple_header(HeapCheckContext *ctx)
 }
 
 /*
- * Checks whether a tuple is visible for checking.
+ * Checks whether a tuple is visible to our transaction for checking, which is
+ * not a question of whether we should be able to see the tuple relative to any
+ * particular snapshot, but rather a question of whether it is safe and
+ * reasonable to check the tuple attributes.  The caller should already have
+ * checked that the tuple is sufficiently sensible for us to evaluate.
  *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * If a tuple could have been inserted by a transaction that also added a
+ * column to the table, but which ultimately did not commit, or which has not
+ * yet committed, then the table's current TupleDesc might differ from the one
+ * used to construct this tuple, so we must not check it.
  *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
+ * As a special case, if our own transaction inserted the tuple, even if we
+ * added a column to the table, our TupleDesc should match.  We could check the
+ * tuple, but choose not to do so.
  *
- * Returns whether the tuple is visible for checking.
+ * If a tuple has been updated or deleted, we can still read the old tuple for
+ * corruption checking purposes, as long as we are careful about concurrent
+ * vacuums.  The main table tuple itself cannot be vacuumed away because we
+ * hold a buffer lock on the page, but if the deleting transaction is older
+ * than our transaction snapshot's xmin, then vacuum could remove the toast at
+ * any time, so we must not check the toast.
+ *
+ * If xmin or xmax values are older than can be checked against clog, or appear
+ * to be in the future (possibly due to wrap-around), then we cannot make a
+ * determination about the visibility of the tuple, so we must not check it.
+ *
+ * Returns true if the tuple should be checked, false otherwise.  Sets
+ * ctx->toast_is_volatile true if the toast might be vacuumed away, false
+ * otherwise.
  */
 static bool
 check_tuple_visibility(HeapCheckContext *ctx)
 {
+	TransactionId xmin;
+	TransactionId xvac;
+	TransactionId xmax;
+	XidCommitStatus xmin_status;
+	XidCommitStatus xvac_status;
+	XidCommitStatus xmax_status;
 	HeapTupleHeader tuphdr = ctx->tuphdr;
-	uint16		infomask = tuphdr->t_infomask;
 
-	if (!HeapTupleHeaderXminCommitted(tuphdr))
+	ctx->tuple_can_be_pruned = true;	/* have not yet proven otherwise */
+
+	/* If xmin is normal, it should be within valid range */
+	xmin = HeapTupleHeaderGetXmin(tuphdr);
+	switch (get_xid_status(xmin, ctx, &xmin_status))
 	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
+		case XID_INVALID:
+		case XID_BOUNDS_OK:
+			break;
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+	}
 
+	/*
+	 * Has inserting transaction committed?
+	 */
+	if (!HeapTupleHeaderXminCommitted(tuphdr))
+	{
 		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
+
+			/*
+			 * The inserting transaction aborted.  The structure of the tuple
+			 * may not match our relation description, so we cannot check it.
+			 */
+			return false;		/* uncheckable */
 		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
 		{
-			XidCommitStatus status;
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xvac, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
+									  pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
-					break;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
-					break;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-						case XID_IS_CURRENT_XID:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
 			}
-		}
-		else
-		{
-			XidCommitStatus status;
 
-			switch (get_xid_status(raw_xmin, ctx, &status))
+			switch (xvac_status)
 			{
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
-					return false;
-				case XID_IN_FUTURE:
+				case XID_IS_CURRENT_XID:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
+											   xvac));
 					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
+				case XID_IN_PROGRESS:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
+											   xvac));
 					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-						case XID_IS_CURRENT_XID:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+
+				case XID_COMMITTED:
+
+					/*
+					 * It should be impossible for xvac to still be running,
+					 * since we've removed all that code, but even if it were,
+					 * it ought to be safe to read the tuple, since the
+					 * original inserter must have committed.  But, if the
+					 * xvac transaction committed, this tuple (and its
+					 * associated TOAST tuples) could be pruned at any time.
+					 */
+					return true;	/* checkable */
+
+				case XID_ABORTED:
+					break;
 			}
 		}
-	}
-
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
-	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
 		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xmax, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
+									  pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
 					return false;	/* corrupt */
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
 					return false;	/* corrupt */
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
 					return false;	/* corrupt */
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-						case XID_IS_CURRENT_XID:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
+					break;
 			}
 
-			/* Ok, the tuple is live */
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
+											   xvac));
+					return false;	/* corrupt */
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
+											   xvac));
+					return false;	/* corrupt */
+
+				case XID_COMMITTED:
+					break;
+
+				case XID_ABORTED:
+
+					/*
+					 * The VACUUM FULL aborted, so this tuple is dead and
+					 * could be vacuumed away at any time.  It's ok to check
+					 * the tuple because we have a buffer lock for the page,
+					 * but not safe to check the toast.
+					 */
+					return true;	/* checkable */
+			}
+		}
+		else if (xmin_status == XID_IS_CURRENT_XID)
+		{
+			/*
+			 * Don't check tuples from currently running transactions, not
+			 * even our own.
+			 */
+			return false;		/* checkable, but don't check */
+		}
+		else if (xmin_status == XID_IN_PROGRESS)
+		{
+			/* Don't check tuples from currently running transactions */
+			return false;		/* uncheckable */
+		}
+		else if (xmin_status != XID_COMMITTED)
+		{
+			/*
+			 * Inserting transaction is not in progress, and not committed, so
+			 * it either aborted or crashed. We cannot check.
+			 */
+			return false;		/* uncheckable */
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
 	}
-	return true;				/* not dead */
+
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * xmax is a multixact, so it should be within valid MXID range.  We
+		 * cannot safely look up the update xid if the multixact is out of
+		 * bounds, and must stop checking this tuple.
+		 */
+		xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+		switch (check_mxid_valid_in_rel(xmax, ctx))
+		{
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("multitransaction ID is invalid"));
+				return false;	/* corrupt */
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+										   xmax, ctx->relminmxid));
+				return false;	/* corrupt */
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+										   xmax, ctx->oldest_mxact));
+				return false;	/* corrupt */
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+										   xmax,
+										   ctx->next_mxact));
+				return false;	/* corrupt */
+			case XID_BOUNDS_OK:
+				break;
+		}
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_INVALID)
+	{
+		/*
+		 * This tuple is live.  A concurrently running transaction could
+		 * delete it before we get around to checking the toast, but any such
+		 * running transaction is surely not less than our safe_xmin, so the
+		 * toast cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_can_be_pruned = false;
+		return true;			/* checkable */
+	}
+
+	if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+	{
+		/*
+		 * "Deleting" xact really only locked it, so the tuple is live in any
+		 * case.  As above, a concurrently running transaction could delete
+		 * it, but it cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_can_be_pruned = false;
+		return true;			/* checkable */
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * We already checked above that this multixact is within limits for
+		 * this table.  Now check the update xid from this multixact.
+		 */
+		xmax = HeapTupleGetUpdateXid(tuphdr);
+		switch (get_xid_status(xmax, ctx, &xmax_status))
+		{
+				/* not LOCKED_ONLY, so it has to have an xmax */
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("update xid is invalid"));
+				return false;	/* corrupt */
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->next_fxid),
+										   XidFromFullTransactionId(ctx->next_fxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes relation freeze threshold %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->relfrozenfxid),
+										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				return false;	/* corrupt */
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->oldest_fxid),
+										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				return false;	/* corrupt */
+			case XID_BOUNDS_OK:
+				break;
+		}
+
+		switch (xmax_status)
+		{
+			case XID_IS_CURRENT_XID:
+			case XID_IN_PROGRESS:
+
+				/*
+				 * The delete is in progress, so it cannot be visible to our
+				 * snapshot.
+				 */
+				ctx->tuple_can_be_pruned = false;
+				return true;	/* checkable */
+			case XID_COMMITTED:
+
+				/*
+				 * The delete committed.  Whether the toast can be vacuumed
+				 * away depends on how old the deleting transaction is.
+				 */
+				ctx->tuple_can_be_pruned = TransactionIdPrecedes(xmax,
+																 ctx->safe_xmin);
+				return true;	/* checkable */
+			case XID_ABORTED:
+
+				/*
+				 * The delete aborted or crashed.  The tuple is still live.
+				 */
+				ctx->tuple_can_be_pruned = false;
+				return true;	/* checkable */
+		}
+	}
+
+	/*
+	 * The tuple is deleted.  Whether the toast can be vacuumed away depends
+	 * on how old the deleting transaction is.
+	 */
+	xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+
+	switch (get_xid_status(xmax, ctx, &xmax_status))
+	{
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes relation freeze threshold %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_BOUNDS_OK:
+		case XID_INVALID:
+			break;
+	}
+
+	switch (xmax_status)
+	{
+		case XID_IS_CURRENT_XID:
+		case XID_IN_PROGRESS:
+
+			/*
+			 * The delete is in progress, so it cannot be visible to our
+			 * snapshot.
+			 */
+			ctx->tuple_can_be_pruned = false;
+			return true;		/* checkable */
+		case XID_COMMITTED:
+
+			/*
+			 * The delete committed.  Whether the toast can be vacuumed away
+			 * depends on how old the deleting transaction is.
+			 */
+			ctx->tuple_can_be_pruned = TransactionIdPrecedes(xmax,
+															 ctx->safe_xmin);
+			return true;		/* checkable */
+		case XID_ABORTED:
+
+			/*
+			 * The delete aborted or crashed.  The tuple is still live.
+			 */
+			ctx->tuple_can_be_pruned = false;
+			return true;		/* checkable */
+	}
+
+	return false;				/* not reached */
 }
 
+
 /*
  * Check the current toast tuple against the state tracked in ctx, recording
  * any corruption found in ctx->tupstore.
-- 
2.21.1 (Apple Git-122.3)

v14-0003-Checking-toast-separately-from-the-main-table.patchapplication/octet-stream; name=v14-0003-Checking-toast-separately-from-the-main-table.patch; x-unix-mode=0644Download
From d412782c24fb42b8c3cd626823ff9e9ff1523092 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Tue, 30 Mar 2021 21:00:34 -0700
Subject: [PATCH v14 3/3] Checking toast separately from the main table.

Rather than checking toasted attributes as we find them, creating a
list of them and checking all the toast in the list after releasing
the buffer lock for each main table page.
---
 contrib/amcheck/verify_heapam.c           | 598 +++++++++++++---------
 src/bin/pg_amcheck/t/004_verify_heapam.pl |  66 ++-
 src/tools/pgindent/typedefs.list          |   1 +
 3 files changed, 430 insertions(+), 235 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index be22b491d6..254dc9f3a5 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -58,6 +58,26 @@ typedef enum SkipPages
 	SKIP_PAGES_NONE
 } SkipPages;
 
+/*
+ * Struct holding information necessary to check a toasted attribute, including
+ * the toast pointer, state about the current toast chunk being checked, and
+ * the location in the main table of the toasted attribute.  We have to track
+ * the tuple's location in the main table for reporting purposes because by the
+ * time the toast is checked our HeapCheckContext will no longer be pointing to
+ * the relevant tuple.
+ */
+typedef struct ToastCheckContext
+{
+	struct varatt_external toast_pointer;
+	BlockNumber blkno;			/* block in main table */
+	OffsetNumber offnum;		/* offset in main table */
+	AttrNumber	attnum;			/* attribute in main table */
+	int32		chunkno;		/* chunk number in toast table */
+	int32		attrsize;		/* size of toasted attribute */
+	int32		endchunk;		/* last chunk number in toast table */
+	int32		totalchunks;	/* total chunks in toast table */
+} ToastCheckContext;
+
 /*
  * Struct holding the running context information during
  * a lifetime of a verify_heapam execution.
@@ -119,11 +139,11 @@ typedef struct HeapCheckContext
 	/* True if toast for this tuple could be vacuumed away */
 	bool		tuple_can_be_pruned;
 
-	/* Values for iterating over toast for the attribute */
-	int32		chunkno;
-	int32		attrsize;
-	int32		endchunk;
-	int32		totalchunks;
+	/*
+	 * List of ToastCheckContext structs for toasted attributes which are not
+	 * in danger of being vacuumed way and should be checked
+	 */
+	List	   *toasted_attributes;
 
 	/* Whether verify_heapam has yet encountered any corrupt tuples */
 	bool		is_corrupt;
@@ -136,13 +156,18 @@ typedef struct HeapCheckContext
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
+static int32 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+							   ToastCheckContext *tctx, bool *error);
 
-static bool check_tuple_attribute(HeapCheckContext *ctx);
 static bool check_tuple_header(HeapCheckContext *ctx);
 static bool check_tuple_visibility(HeapCheckContext *ctx);
 
+static bool check_tuple_attribute(HeapCheckContext *ctx);
+static void check_toasted_attributes(HeapCheckContext *ctx);
+
 static void report_corruption(HeapCheckContext *ctx, char *msg);
+static void report_toast_corruption(HeapCheckContext *ctx,
+									ToastCheckContext *tctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
 static FullTransactionId FullTransactionIdFromXidAndCtx(TransactionId xid,
 														const HeapCheckContext *ctx);
@@ -253,6 +278,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
+	ctx.toasted_attributes = NIL;
 
 	/*
 	 * Any xmin newer than the xmin of our snapshot can't become all-visible
@@ -469,6 +495,14 @@ verify_heapam(PG_FUNCTION_ARGS)
 		/* clean up */
 		UnlockReleaseBuffer(ctx.buffer);
 
+		/*
+		 * Check any toast pointers from the page whose lock we just released
+		 * and reset the list to NIL.
+		 */
+		if (ctx.toasted_attributes != NIL)
+			check_toasted_attributes(&ctx);
+		Assert(ctx.toasted_attributes == NIL);
+
 		if (on_error_stop && ctx.is_corrupt)
 			break;
 	}
@@ -510,14 +544,13 @@ sanity_check_relation(Relation rel)
 }
 
 /*
- * Record a single corruption found in the table.  The values in ctx should
- * reflect the location of the corruption, and the msg argument should contain
- * a human-readable description of the corruption.
- *
- * The msg argument is pfree'd by this function.
+ * Shared internal implementation for report_corruption and
+ * report_toast_corruption.
  */
 static void
-report_corruption(HeapCheckContext *ctx, char *msg)
+report_corruption_internal(Tuplestorestate *tupstore, TupleDesc tupdesc,
+						   BlockNumber blkno, OffsetNumber offnum,
+						   AttrNumber attnum, char *msg)
 {
 	Datum		values[HEAPCHECK_RELATION_COLS];
 	bool		nulls[HEAPCHECK_RELATION_COLS];
@@ -525,10 +558,10 @@ report_corruption(HeapCheckContext *ctx, char *msg)
 
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
-	values[0] = Int64GetDatum(ctx->blkno);
-	values[1] = Int32GetDatum(ctx->offnum);
-	values[2] = Int32GetDatum(ctx->attnum);
-	nulls[2] = (ctx->attnum < 0);
+	values[0] = Int64GetDatum(blkno);
+	values[1] = Int32GetDatum(offnum);
+	values[2] = Int32GetDatum(attnum);
+	nulls[2] = (attnum < 0);
 	values[3] = CStringGetTextDatum(msg);
 
 	/*
@@ -541,8 +574,39 @@ report_corruption(HeapCheckContext *ctx, char *msg)
 	 */
 	pfree(msg);
 
-	tuple = heap_form_tuple(ctx->tupdesc, values, nulls);
-	tuplestore_puttuple(ctx->tupstore, tuple);
+	tuple = heap_form_tuple(tupdesc, values, nulls);
+	tuplestore_puttuple(tupstore, tuple);
+}
+
+/*
+ * Record a single corruption found in the main table.  The values in ctx should
+ * indicate the location of the corruption, and the msg argument should contain
+ * a human-readable description of the corruption.
+ *
+ * The msg argument is pfree'd by this function.
+ */
+static void
+report_corruption(HeapCheckContext *ctx, char *msg)
+{
+	report_corruption_internal(ctx->tupstore, ctx->tupdesc, ctx->blkno,
+							   ctx->offnum, ctx->attnum, msg);
+	ctx->is_corrupt = true;
+}
+
+/*
+ * Record corruption found in the toast table.  The values in tctx should
+ * indicate the location in the main table where the toast pointer was
+ * encountered, and the msg argument should contain a human-readable
+ * description of the toast table corruption.
+ *
+ * As above, the msg argument is pfree'd by this function.
+ */
+static void
+report_toast_corruption(HeapCheckContext *ctx, ToastCheckContext *tctx,
+						char *msg)
+{
+	report_corruption_internal(ctx->tupstore, ctx->tupdesc, tctx->blkno,
+							   tctx->offnum, tctx->attnum, msg);
 	ctx->is_corrupt = true;
 }
 
@@ -1085,7 +1149,6 @@ check_tuple_visibility(HeapCheckContext *ctx)
 	return false;				/* not reached */
 }
 
-
 /*
  * Check the current toast tuple against the state tracked in ctx, recording
  * any corruption found in ctx->tupstore.
@@ -1097,8 +1160,9 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
  */
-static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
+static int32
+check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+				  ToastCheckContext *tctx, bool *error)
 {
 	int32		curchunk;
 	Pointer		chunk;
@@ -1113,17 +1177,21 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 										 ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
-						  pstrdup("toast chunk sequence number is null"));
-		return;
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast value %u has toast chunk with null sequence number",
+										 tctx->toast_pointer.va_valueid));
+		*error = true;
+		return 0;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
-						  pstrdup("toast chunk data is null"));
-		return;
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast value %u chunk data is null",
+										 tctx->toast_pointer.va_valueid));
+		*error = true;
+		return 0;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
 		chunksize = VARSIZE(chunk) - VARHDRSZ;
@@ -1139,39 +1207,49 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		/* should never happen */
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_corruption(ctx,
-						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-								   header, curchunk));
-		return;
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast value %u corrupt extended chunk has invalid varlena header: %0x (sequence number %d)",
+										 tctx->toast_pointer.va_valueid,
+										 header, curchunk));
+		*error = true;
+		return 0;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != ctx->chunkno)
+	if (curchunk != tctx->chunkno)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, ctx->chunkno));
-		return;
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast value %u chunk sequence number %u does not match the expected sequence number %u",
+										 tctx->toast_pointer.va_valueid,
+										 curchunk, tctx->chunkno));
+		*error = true;
+		return chunksize;
 	}
-	if (curchunk > ctx->endchunk)
+	if (curchunk > tctx->endchunk)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, ctx->endchunk));
-		return;
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast value %u chunk sequence number %u exceeds the end chunk sequence number %u",
+										 tctx->toast_pointer.va_valueid,
+										 curchunk, tctx->endchunk));
+		*error = true;
+		return chunksize;
 	}
 
-	expected_size = curchunk < ctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
-		: ctx->attrsize - ((ctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
+	expected_size = curchunk < tctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
+		: tctx->attrsize - ((tctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
 	if (chunksize != expected_size)
 	{
-		report_corruption(ctx,
-						  psprintf("toast chunk size %u differs from the expected size %u",
-								   chunksize, expected_size));
-		return;
+		report_toast_corruption(ctx, tctx,
+								psprintf("toast value %u chunk size %u differs from the expected size %u",
+										 tctx->toast_pointer.va_valueid,
+										 chunksize, expected_size));
+		*error = true;
+		return chunksize;
 	}
+
+	return chunksize;
 }
 
 /*
@@ -1179,17 +1257,17 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
  * found in ctx->tupstore.
  *
  * This function follows the logic performed by heap_deform_tuple(), and in the
- * case of a toasted value, optionally continues along the logic of
- * detoast_external_attr(), checking for any conditions that would result in
- * either of those functions Asserting or crashing the backend.  The checks
- * performed by Asserts present in those two functions are also performed here.
- * In cases where those two functions are a bit cavalier in their assumptions
- * about data being correct, we perform additional checks not present in either
- * of those two functions.  Where some condition is checked in both of those
- * functions, we perform it here twice, as we parallel the logical flow of
- * those two functions.  The presence of duplicate checks seems a reasonable
- * price to pay for keeping this code tightly coupled with the code it
- * protects.
+ * case of a toasted value, optionally stores the toast pointer so later it can
+ * be checked following the logic of detoast_external_attr(), checking for any
+ * conditions that would result in either of those functions Asserting or
+ * crashing the backend.  The checks performed by Asserts present in those two
+ * functions are also performed here and in check_toasted_attributes.  In cases
+ * where those two functions are a bit cavalier in their assumptions about data
+ * being correct, we perform additional checks not present in either of those
+ * two functions.  Where some condition is checked in both of those functions,
+ * we perform it here twice, as we parallel the logical flow of those two
+ * functions.  The presence of duplicate checks seems a reasonable price to pay
+ * for keeping this code tightly coupled with the code it protects.
  *
  * Returns true if the tuple attribute is sane enough for processing to
  * continue on to the next attribute, false otherwise.
@@ -1197,17 +1275,12 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 static bool
 check_tuple_attribute(HeapCheckContext *ctx)
 {
-	struct varatt_external toast_pointer;
-	ScanKeyData toastkey;
-	SysScanDesc toastscan;
-	SnapshotData SnapshotToast;
-	HeapTuple	toasttup;
-	bool		found_toasttup;
 	Datum		attdatum;
 	struct varlena *attr;
 	char	   *tp;				/* pointer to the tuple data */
 	uint16		infomask;
 	Form_pg_attribute thisatt;
+	struct varatt_external toast_pointer;
 
 	infomask = ctx->tuphdr->t_infomask;
 	thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), ctx->attnum);
@@ -1271,8 +1344,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (va_tag != VARTAG_ONDISK)
 		{
 			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
+							  psprintf("toasted attribute has unexpected TOAST tag %u",
 									   va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
@@ -1286,8 +1358,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1314,12 +1385,17 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	/* It is external, and we're looking at a page on disk */
 
+	/*
+	 * Must copy attr into toast_pointer for alignment considerations
+	 */
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
+								   toast_pointer.va_valueid));
 		return true;
 	}
 
@@ -1327,8 +1403,28 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but relation has no toast relation",
+								   toast_pointer.va_valueid));
+		return true;
+	}
+
+	if (VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize - VARHDRSZ)
+	{
+		report_corruption(ctx,
+						  psprintf("toast value %u external size %u exceeds maximum expected for rawsize %u",
+								   toast_pointer.va_valueid,
+								   VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer),
+								   toast_pointer.va_rawsize));
+		return true;
+	}
+
+	if (toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+	{
+		report_corruption(ctx,
+						  psprintf("toast value %u toast relation oid %u differs from expected oid %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
 		return true;
 	}
 
@@ -1337,191 +1433,231 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 
 	/*
-	 * Must copy attr into toast_pointer for alignment considerations
+	 * If this tuple is at risk of being vacuumed away, we cannot check the
+	 * toast.  Otherwise, we push a copy of the toast tuple so we can check it
+	 * after releasing the main table buffer lock.
 	 */
-	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
-
-	ctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
-	ctx->endchunk = (ctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
-	ctx->totalchunks = ctx->endchunk + 1;
+	if (!ctx->tuple_can_be_pruned)
+	{
+		ToastCheckContext *tctx;
 
-	/*
-	 * Setup a scan key to find chunks in toast table with matching va_valueid
-	 */
-	ScanKeyInit(&toastkey,
-				(AttrNumber) 1,
-				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(toast_pointer.va_valueid));
+		tctx = (ToastCheckContext *) palloc0fast(sizeof(ToastCheckContext));
 
-	/*
-	 * Check if any chunks for this toasted object exist in the toast table,
-	 * accessible via the index.
-	 */
-	init_toast_snapshot(&SnapshotToast);
-	toastscan = systable_beginscan_ordered(ctx->toast_rel,
-										   ctx->valid_toast_index,
-										   &SnapshotToast, 1,
-										   &toastkey);
-	ctx->chunkno = 0;
-	found_toasttup = false;
-	while ((toasttup =
-			systable_getnext_ordered(toastscan,
-									 ForwardScanDirection)) != NULL)
-	{
-		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx);
-		ctx->chunkno++;
+		VARATT_EXTERNAL_GET_POINTER(tctx->toast_pointer, attr);
+		tctx->blkno = ctx->blkno;
+		tctx->offnum = ctx->offnum;
+		tctx->attnum = ctx->attnum;
+		ctx->toasted_attributes = lappend(ctx->toasted_attributes, tctx);
 	}
-	if (!found_toasttup)
-		report_corruption(ctx,
-						  psprintf("toasted value for attribute %u missing from toast table",
-								   ctx->attnum));
-	else if (ctx->chunkno != (ctx->endchunk + 1))
-		report_corruption(ctx,
-						  psprintf("final toast chunk number %u differs from expected value %u",
-								   ctx->chunkno, (ctx->endchunk + 1)));
-	systable_endscan_ordered(toastscan);
 
 	return true;
 }
 
 /*
- * Check the current tuple as tracked in ctx, recording any corruption found in
- * ctx->tupstore.
+ * For each attribute collected in ctx->toasted_attributes, look up the value
+ * in the toast table and perform checks on it.  This function should only be
+ * called on toast pointers which cannot be vacuumed away during our
+ * processing.
  */
 static void
-check_tuple(HeapCheckContext *ctx)
+check_toasted_attributes(HeapCheckContext *ctx)
 {
-	TransactionId xmin;
-	TransactionId xmax;
-	bool		fatal = false;
-	uint16		infomask = ctx->tuphdr->t_infomask;
+	ListCell   *cell;
 
-	/* If xmin is normal, it should be within valid range */
-	xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
-	switch (get_xid_status(xmin, ctx, NULL))
+	foreach(cell, ctx->toasted_attributes)
 	{
-		case XID_INVALID:
-		case XID_BOUNDS_OK:
-			break;
-		case XID_IN_FUTURE:
-			report_corruption(ctx,
-							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->next_fxid),
-									   XidFromFullTransactionId(ctx->next_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_CLUSTERMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->oldest_fxid),
-									   XidFromFullTransactionId(ctx->oldest_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_RELMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->relfrozenfxid),
-									   XidFromFullTransactionId(ctx->relfrozenfxid)));
-			fatal = true;
-			break;
-	}
+		ToastCheckContext *tctx;
+		SnapshotData SnapshotToast;
+		ScanKeyData toastkey;
+		SysScanDesc toastscan;
+		int64		toastsize;	/* corrupt toast could overflow 32 bits */
+		bool		found_toasttup;
+		bool		toast_error;
+		HeapTuple	toasttup;
+
+		tctx = lfirst(cell);
+		tctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer);
+		tctx->endchunk = (tctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
+		tctx->totalchunks = tctx->endchunk + 1;
 
-	xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
+		/*
+		 * Setup a scan key to find chunks in toast table with matching
+		 * va_valueid
+		 */
+		ScanKeyInit(&toastkey,
+					(AttrNumber) 1,
+					BTEqualStrategyNumber, F_OIDEQ,
+					ObjectIdGetDatum(tctx->toast_pointer.va_valueid));
 
-	if (infomask & HEAP_XMAX_IS_MULTI)
-	{
-		/* xmax is a multixact, so it should be within valid MXID range */
-		switch (check_mxid_valid_in_rel(xmax, ctx))
-		{
-			case XID_INVALID:
-				report_corruption(ctx,
-								  pstrdup("multitransaction ID is invalid"));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
-										   xmax, ctx->relminmxid));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
-										   xmax, ctx->oldest_mxact));
-				fatal = true;
-				break;
-			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
-										   xmax,
-										   ctx->next_mxact));
-				fatal = true;
-				break;
-			case XID_BOUNDS_OK:
-				break;
-		}
-	}
-	else
-	{
 		/*
-		 * xmax is not a multixact and is normal, so it should be within the
-		 * valid XID range.
+		 * Check if any chunks for this toasted object exist in the toast
+		 * table, accessible via the index.
 		 */
-		switch (get_xid_status(xmax, ctx, NULL))
+		init_toast_snapshot(&SnapshotToast);
+		toastscan = systable_beginscan_ordered(ctx->toast_rel,
+											   ctx->valid_toast_index,
+											   &SnapshotToast, 1,
+											   &toastkey);
+		tctx->chunkno = 0;
+		found_toasttup = false;
+		toastsize = 0;
+		while ((toasttup =
+				systable_getnext_ordered(toastscan,
+										 ForwardScanDirection)) != NULL)
 		{
-			case XID_INVALID:
-			case XID_BOUNDS_OK:
-				break;
-			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->next_fxid),
-										   XidFromFullTransactionId(ctx->next_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->oldest_fxid),
-										   XidFromFullTransactionId(ctx->oldest_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->relfrozenfxid),
-										   XidFromFullTransactionId(ctx->relfrozenfxid)));
-				fatal = true;
+			found_toasttup = true;
+			toastsize += check_toast_tuple(toasttup, ctx, tctx, &toast_error);
+			tctx->chunkno++;
+		}
+		systable_endscan_ordered(toastscan);
+
+		if (!found_toasttup)
+			report_toast_corruption(ctx, tctx,
+									psprintf("toast value %u not found in toast table",
+											 tctx->toast_pointer.va_valueid));
+		else if (tctx->chunkno != (tctx->endchunk + 1))
+			report_toast_corruption(ctx, tctx,
+									psprintf("toast value %u was expected to end at chunk %u, but ended at chunk %u",
+											 tctx->toast_pointer.va_valueid,
+											 (tctx->endchunk + 1), tctx->chunkno));
+		else if (toastsize != VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer))
+			report_toast_corruption(ctx, tctx,
+									psprintf("toast value %u total size " INT64_FORMAT " differs from expected size %u",
+											 tctx->toast_pointer.va_valueid, toastsize,
+											 VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer)));
+		else if (!toast_error)
+		{
+			if (!AllocSizeIsValid(tctx->toast_pointer.va_rawsize))
+			{
+				report_toast_corruption(ctx, tctx,
+										psprintf("toast value %u rawsize %u too large to be allocated",
+												 tctx->toast_pointer.va_valueid,
+												 tctx->toast_pointer.va_rawsize));
+				toast_error = true;
+			}
+
+			if (!AllocSizeIsValid(VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer)))
+			{
+				report_toast_corruption(ctx, tctx,
+										psprintf("toast value %u extsize %u too large to be allocated",
+												 VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer),
+												 tctx->toast_pointer.va_valueid));
+				toast_error = true;
+			}
+
+			if (!toast_error)
+			{
+				Size		allocsize;
+				struct varlena *attr;
+
+				/* Fetch all chunks */
+				allocsize = VARATT_EXTERNAL_GET_EXTSIZE(tctx->toast_pointer) + VARHDRSZ;
+				attr = (struct varlena *) palloc(allocsize);
+				if (VARATT_EXTERNAL_IS_COMPRESSED(tctx->toast_pointer))
+					SET_VARSIZE_COMPRESSED(attr, allocsize);
+				else
+					SET_VARSIZE(attr, allocsize);
+
+				table_relation_fetch_toast_slice(ctx->toast_rel, tctx->toast_pointer.va_valueid,
+												 toastsize, 0, toastsize, attr);
+
+				if (VARATT_IS_COMPRESSED(attr))
+				{
+#ifdef DECOMPRESSION_CORRUPTION_CHECKING
+					struct varlena *uncompressed;
+					int32		rawsize;
+#endif
+					Size		allocsize;
+					ToastCompressionId cmid;
+
+					/* allocate memory for the uncompressed data */
+					allocsize = VARDATA_COMPRESSED_GET_EXTSIZE(attr) + VARHDRSZ;
+					if (!AllocSizeIsValid(allocsize))
+						report_toast_corruption(ctx, tctx,
+												psprintf("toast value %u invalid uncompressed size %zu",
+														 tctx->toast_pointer.va_valueid,
+														 allocsize));
+					cmid = TOAST_COMPRESS_METHOD(attr);
+					switch (cmid)
+					{
+						case TOAST_PGLZ_COMPRESSION_ID:
+#ifdef DECOMPRESSION_CORRUPTION_CHECKING
+							/* decompress the data */
+							uncompressed = (struct varlena *) palloc(allocsize);
+							rawsize = pglz_decompress((char *) attr + VARHDRSZ_COMPRESSED,
+													  VARSIZE(attr) - VARHDRSZ_COMPRESSED,
+													  VARDATA(uncompressed),
+													  VARDATA_COMPRESSED_GET_EXTSIZE(attr), true);
+							if (rawsize < 0)
+								report_toast_corruption(ctx, tctx,
+														psprintf("toast value %u compressed pglz data is corrupt",
+																 tctx->toast_pointer.va_valueid));
+							pfree(uncompressed);
+#endif
+							break;
+						case TOAST_LZ4_COMPRESSION_ID:
+#ifndef USE_LZ4
+							report_toast_corruption(ctx, tctx,
+													psprintf("toast value %u unsupported LZ4 compression method",
+															 tctx->toast_pointer.va_valueid));
+#else
+#ifdef DECOMPRESSION_CORRUPTION_CHECKING
+							/* decompress the data */
+							uncompressed = (struct varlena *) palloc(allocsize);
+							rawsize = LZ4_decompress_safe((char *) attr + VARHDRSZ_COMPRESSED,
+														  VARDATA(uncompressed),
+														  VARSIZE(attr) - VARHDRSZ_COMPRESSED,
+														  VARDATA_COMPRESSED_GET_EXTSIZE(attr));
+							if (rawsize < 0)
+								report_toast_corruption(ctx, tctx,
+														psprintf("toast value %u compressed lz4 data is corrupt",
+																 tctx->toast_pointer.va_valueid));
+							pfree(uncompressed);
+#endif
+#endif
+							break;
+						default:
+							report_toast_corruption(ctx, tctx,
+													psprintf("toast value %u invalid compression method id %d",
+															 tctx->toast_pointer.va_valueid,
+															 cmid));
+					}
+				}
+				else if (VARSIZE(attr) != tctx->toast_pointer.va_rawsize)
+					report_toast_corruption(ctx, tctx,
+											psprintf("toast value %u detoasted attribute size %u differs from expected rawsize %u",
+													 tctx->toast_pointer.va_valueid,
+													 VARSIZE(attr),
+													 tctx->toast_pointer.va_rawsize));
+				pfree(attr);
+			}
 		}
+		pfree(tctx);
 	}
 
-	/*
-	 * Cannot process tuple data if tuple header was corrupt, as the offsets
-	 * within the page cannot be trusted, leaving too much risk of reading
-	 * garbage if we continue.
-	 *
-	 * We also cannot process the tuple if the xmin or xmax were invalid
-	 * relative to relfrozenxid or relminmxid, as clog entries for the xids
-	 * may already be gone.
-	 */
-	if (fatal)
-		return;
+	list_free(ctx->toasted_attributes);
+	ctx->toasted_attributes = NIL;
+}
 
+/*
+ * Check the current tuple as tracked in ctx, recording any corruption found in
+ * ctx->tupstore.
+ */
+static void
+check_tuple(HeapCheckContext *ctx)
+{
 	/*
 	 * Check various forms of tuple header corruption.  If the header is too
-	 * corrupt to continue checking, or if the tuple is not visible to anyone,
-	 * we cannot continue with other checks.
+	 * corrupt to continue checking, we cannot continue with other checks.
 	 */
 	if (!check_tuple_header(ctx))
 		return;
 
+	/*
+	 * Check tuple visibility.  If the inserting transaction aborted, we
+	 * cannot assume our relation description matches the tuple structure, and
+	 * therefore cannot check it.
+	 */
 	if (!check_tuple_visibility(ctx))
 		return;
 
@@ -1544,6 +1680,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * next, at which point we abort further attribute checks for this tuple.
 	 * Note that we don't abort for all types of corruption, only for those
 	 * types where we don't know how to continue.
+	 *
+	 * While checking the tuple attributes, we build a list of toast pointers
+	 * we encounter, to be checked later.  If further attribute checking is
+	 * aborted, we still have the pointers collected prior to aborting.
 	 */
 	ctx->offset = 0;
 	for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 36607596b1..33e5de51bf 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -224,7 +224,7 @@ my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.te
 my $relpath = "$pgdata/$rel";
 
 # Insert data and freeze public.test
-use constant ROWCOUNT => 16;
+use constant ROWCOUNT => 21;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
@@ -259,6 +259,13 @@ select lp_off from heap_page_items(get_raw_page('test', 'main', 0))
 	offset $tup limit 1)));
 }
 
+# Find our toast relation id
+my $toastrelid = $node->safe_psql('postgres', qq(
+	SELECT c.reltoastrelid
+		FROM pg_catalog.pg_class c
+		WHERE c.oid = 'public.test'::regclass
+		));
+
 # Sanity check that our 'test' table on disk layout matches expectations.  If
 # this is not so, we will have to skip the test until somebody updates the test
 # to work on this platform.
@@ -296,7 +303,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 19;
+plan tests => 24;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -310,6 +317,7 @@ $node->stop;
 
 # Some #define constants from access/htup_details.h for use while corrupting.
 use constant HEAP_HASNULL            => 0x0001;
+use constant HEAP_HASEXTERNAL        => 0x0004;
 use constant HEAP_XMAX_LOCK_ONLY     => 0x0080;
 use constant HEAP_XMIN_COMMITTED     => 0x0100;
 use constant HEAP_XMIN_INVALID       => 0x0200;
@@ -362,7 +370,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}xmin $xmin precedes relation freeze threshold 0:\d+/;
 	}
-	if ($offnum == 2)
+	elsif ($offnum == 2)
 	{
 		# Corruptly set xmin < datfrozenxid
 		my $xmin = 3;
@@ -480,7 +488,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 1);
 		push @expected,
-			qr/${header}attribute \d+ with length \d+ ends at offset \d+ beyond total tuple length \d+/;
+			qr/${header}attribute with length \d+ ends at offset \d+ beyond total tuple length \d+/;
 	}
 	elsif ($offnum == 13)
 	{
@@ -489,9 +497,18 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}toasted value for attribute 2 missing from toast table/;
+			qr/${header}toast value \d+ not found in toast table/;
 	}
 	elsif ($offnum == 14)
+	{
+		# Corrupt infomask to claim there are no external attributes, which conflicts
+		# with column 'c' which is toasted
+		$tup->{t_infomask} &= ~HEAP_HASEXTERNAL;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ is external but tuple header flag HEAP_HASEXTERNAL not set/;
+	}
+	elsif ($offnum == 15)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
@@ -501,7 +518,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
 	}
-	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	elsif ($offnum == 16)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
@@ -511,6 +528,43 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
 	}
+	elsif ($offnum == 17)
+	{
+		# Corrupt column c's toast pointer va_vartag field
+		$tup->{c_va_vartag} = 42;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toasted attribute has unexpected TOAST tag 42/;
+	}
+	elsif ($offnum == 18)
+	{
+		# Corrupt column c's toast pointer va_extinfo field
+		$tup->{c_va_extinfo} = 7654321;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ external size 7654321 exceeds maximum expected for rawsize 10004/;
+	}
+	elsif ($offnum == 19)
+	{
+		# Corrupt column c's toast pointer va_valueid field.  We have not
+		# consumed enough oids for any valueid in the toast table to be large.
+		# Use a large oid for the corruption to avoid colliding with an
+		# existent entry in the toast.
+		my $corrupt = $tup->{c_va_valueid} + 100000000;
+		$tup->{c_va_valueid} = $corrupt;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ not found in toast table/;
+	}
+	elsif ($offnum == 20)	# Last offnum must less than or equal to ROWCOUNT-1
+	{
+		# Corrupt column c's toast pointer va_toastrelid field
+		my $otherid = $toastrelid + 1;
+		$tup->{c_va_toastrelid} = $otherid;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ toast relation oid $otherid differs from expected oid $toastrelid/;
+	}
 	write_tuple($file, $offset, $tup);
 }
 close($file)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..0ce261e2a2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2557,6 +2557,7 @@ TimestampTz
 TmFromChar
 TmToChar
 ToastAttrInfo
+ToastCheckContext
 ToastTupleContext
 TocEntry
 TokenAuxData
-- 
2.21.1 (Apple Git-122.3)

#110Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#109)
Re: pg_amcheck contrib application

On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I'm not looking at the old VACUUM FULL code, but my assumption is that if the xvac code were resurrected, then when a tuple is moved off by a VACUUM FULL, the old tuple and associated toast cannot be pruned until concurrent transactions end. So, if amcheck is running more-or-less concurrently with the VACUUM FULL and has a snapshot xmin no newer than the xid of the VACUUM FULL's xid, it can check the toast associated with the moved off tuple after the VACUUM FULL commits. If instead the VACUUM FULL xid was older than amcheck's xmin, then the toast is in danger of being vacuumed away. So the logic in verify_heapam would need to change to think about this distinction. We don't have to concern ourselves about that, because VACUUM FULL cannot be running, and so the xid for it must be older than our xmin, and hence the toast is unconditionally not safe to check.

I'm changing the comments back to how you had them, but I'd like to know why my reasoning is wrong.

Let's start by figuring out *whether* your reasoning is wrong. My
assumption was that old-style VACUUM FULL would move tuples without
retoasting. That is, if we decided to move a tuple from page 2 of the
main table to page 1, we would just write the tuple into page 1,
marking it moved-in, and on page 2 we would mark it moved-off. And
that we would not examine or follow any TOAST pointers at all, so
whatever TOAST entries existed would be shared between the two tuples.
One tuple or the other would eventually die, depending on whether xvac
went on to commit or abort, but either way the TOAST doesn't need
updating because there's always exactly 1 remaining tuple using
pointers to those TOAST values.

Your assumption seems to be the opposite, that the TOASTed values
would be retoasted as part of VF. If that is true, then your idea is
right.

Do you agree with this analysis? If so, we can check the code and find
out which way it actually worked.

--
Robert Haas
EDB: http://www.enterprisedb.com

#111Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#110)
Re: pg_amcheck contrib application

On Mar 31, 2021, at 10:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I'm not looking at the old VACUUM FULL code, but my assumption is that if the xvac code were resurrected, then when a tuple is moved off by a VACUUM FULL, the old tuple and associated toast cannot be pruned until concurrent transactions end. So, if amcheck is running more-or-less concurrently with the VACUUM FULL and has a snapshot xmin no newer than the xid of the VACUUM FULL's xid, it can check the toast associated with the moved off tuple after the VACUUM FULL commits. If instead the VACUUM FULL xid was older than amcheck's xmin, then the toast is in danger of being vacuumed away. So the logic in verify_heapam would need to change to think about this distinction. We don't have to concern ourselves about that, because VACUUM FULL cannot be running, and so the xid for it must be older than our xmin, and hence the toast is unconditionally not safe to check.

I'm changing the comments back to how you had them, but I'd like to know why my reasoning is wrong.

Let's start by figuring out *whether* your reasoning is wrong. My
assumption was that old-style VACUUM FULL would move tuples without
retoasting. That is, if we decided to move a tuple from page 2 of the
main table to page 1, we would just write the tuple into page 1,
marking it moved-in, and on page 2 we would mark it moved-off. And
that we would not examine or follow any TOAST pointers at all, so
whatever TOAST entries existed would be shared between the two tuples.
One tuple or the other would eventually die, depending on whether xvac
went on to commit or abort, but either way the TOAST doesn't need
updating because there's always exactly 1 remaining tuple using
pointers to those TOAST values.

Your assumption seems to be the opposite, that the TOASTed values
would be retoasted as part of VF. If that is true, then your idea is
right.

Do you agree with this analysis? If so, we can check the code and find
out which way it actually worked.

Actually, that makes a lot of sense without even looking at the old code. I was implicitly assuming that the toast table was undergoing a VF also, and that the toast pointers in the main table tuples would be updated to point to the new location, so we'd be unable to follow the pointers to the old location without danger of the old location entries being vacuumed away. But if the main table tuples get moved while keeping their toast pointers unaltered, then you don't have to worry about that, although you do have to worry that a VF of the main table doesn't help so much with toast table bloat.

We're only discussing this in order to craft the right comment for a bit of code with respect to a hypothetical situation in which VF gets resurrected, so I'm not sure this should be top priority, but I'm curious enough now to go read the old code....


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#112Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#111)
Re: pg_amcheck contrib application

On Wed, Mar 31, 2021 at 1:31 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Actually, that makes a lot of sense without even looking at the old code. I was implicitly assuming that the toast table was undergoing a VF also, and that the toast pointers in the main table tuples would be updated to point to the new location, so we'd be unable to follow the pointers to the old location without danger of the old location entries being vacuumed away. But if the main table tuples get moved while keeping their toast pointers unaltered, then you don't have to worry about that, although you do have to worry that a VF of the main table doesn't help so much with toast table bloat.

We're only discussing this in order to craft the right comment for a bit of code with respect to a hypothetical situation in which VF gets resurrected, so I'm not sure this should be top priority, but I'm curious enough now to go read the old code....

Right, well, we wouldn't be PostgreSQL hackers if we didn't spend lots
of time worrying about obscure details. Whether that's good software
engineering or mere pedantry is sometimes debatable.

I took a look at commit 0a469c87692d15a22eaa69d4b3a43dd8e278dd64,
which removed old-style VACUUM FULL, and AFAICS, it doesn't contain
any references to tuple deforming, varlena, HeapTupleHasExternal, or
anything else that would make me think it has the foggiest idea
whether the tuples it's moving around contain TOAST pointers, so I
think I had the right idea.

--
Robert Haas
EDB: http://www.enterprisedb.com

#113Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#111)
Re: pg_amcheck contrib application

On Mar 31, 2021, at 10:31 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Mar 31, 2021, at 10:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I'm not looking at the old VACUUM FULL code, but my assumption is that if the xvac code were resurrected, then when a tuple is moved off by a VACUUM FULL, the old tuple and associated toast cannot be pruned until concurrent transactions end. So, if amcheck is running more-or-less concurrently with the VACUUM FULL and has a snapshot xmin no newer than the xid of the VACUUM FULL's xid, it can check the toast associated with the moved off tuple after the VACUUM FULL commits. If instead the VACUUM FULL xid was older than amcheck's xmin, then the toast is in danger of being vacuumed away. So the logic in verify_heapam would need to change to think about this distinction. We don't have to concern ourselves about that, because VACUUM FULL cannot be running, and so the xid for it must be older than our xmin, and hence the toast is unconditionally not safe to check.

I'm changing the comments back to how you had them, but I'd like to know why my reasoning is wrong.

Let's start by figuring out *whether* your reasoning is wrong. My
assumption was that old-style VACUUM FULL would move tuples without
retoasting. That is, if we decided to move a tuple from page 2 of the
main table to page 1, we would just write the tuple into page 1,
marking it moved-in, and on page 2 we would mark it moved-off. And
that we would not examine or follow any TOAST pointers at all, so
whatever TOAST entries existed would be shared between the two tuples.
One tuple or the other would eventually die, depending on whether xvac
went on to commit or abort, but either way the TOAST doesn't need
updating because there's always exactly 1 remaining tuple using
pointers to those TOAST values.

Your assumption seems to be the opposite, that the TOASTed values
would be retoasted as part of VF. If that is true, then your idea is
right.

Do you agree with this analysis? If so, we can check the code and find
out which way it actually worked.

Actually, that makes a lot of sense without even looking at the old code. I was implicitly assuming that the toast table was undergoing a VF also, and that the toast pointers in the main table tuples would be updated to point to the new location, so we'd be unable to follow the pointers to the old location without danger of the old location entries being vacuumed away. But if the main table tuples get moved while keeping their toast pointers unaltered, then you don't have to worry about that, although you do have to worry that a VF of the main table doesn't help so much with toast table bloat.

We're only discussing this in order to craft the right comment for a bit of code with respect to a hypothetical situation in which VF gets resurrected, so I'm not sure this should be top priority, but I'm curious enough now to go read the old code....

Well, that's annoying. The documentation of postgres 8.2 for vacuum full [1]https://www.postgresql.org/docs/8.2/sql-vacuum.html says,

Selects "full" vacuum, which may reclaim more space, but takes much longer and exclusively locks the table.

I read "exclusively locks" as meaning it takes an ExclusiveLock, but the code shows that it takes an AccessExclusiveLock. I think the docs are pretty misleading here, though I understand that grammatically it is hard to say "accessively exclusively locks" or such. But a part of my analysis was based on the reasoning that if VF only takes an ExclusiveLock, then there must be concurrent readers possible. VF went away long enough ago that I had forgotten exactly how inconvenient it was.

[1]: https://www.postgresql.org/docs/8.2/sql-vacuum.html


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#114Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#113)
Re: pg_amcheck contrib application

On Wed, Mar 31, 2021 at 1:44 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I read "exclusively locks" as meaning it takes an ExclusiveLock, but the code shows that it takes an AccessExclusiveLock. I think the docs are pretty misleading here, though I understand that grammatically it is hard to say "accessively exclusively locks" or such. But a part of my analysis was based on the reasoning that if VF only takes an ExclusiveLock, then there must be concurrent readers possible. VF went away long enough ago that I had forgotten exactly how inconvenient it was.

It kinda depends on what you mean by concurrent readers, because a
transaction that could start on Monday and acquire an XID, and then on
Tuesday you could run VACUUM FULL on relation "foo", and then on
Wednesday the transaction from before could get around to reading some
data from "foo". The two transactions are concurrent, in the sense
that the 3-day transaction was running before the VACUUM FULL, was
still running after VACUUM FULL, read the same pages that the VACUUM
FULL modified, and cares whether the XID from the VACUUM FULL
committed or aborted. But, it's not concurrent in the sense that you
never have a situation where the VACUUM FULL does some of its
modifications, then an overlapping transaction sees them, and then it
does the rest of them.

--
Robert Haas
EDB: http://www.enterprisedb.com

#115Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#109)
1 attachment(s)
Re: pg_amcheck contrib application

On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

These changes got lost between v11 and v12. I've put them back, as well as updating to use your language.

Here's an updated patch that includes your 0001 and 0002 plus a bunch
of changes by me. I intend to commit this version unless somebody
spots a problem with it.

Here are the things I changed:

- I renamed tuple_can_be_pruned to tuple_could_be_pruned because I
think it does a better job suggesting that we're uncertain about what
will happen.

- I got rid of bool header_garbled and changed it to bool result,
inverting the sense, because it didn't seem useful to have a function
that ended with if (some_boolean) return false; return true when I
could end the function with return some_other_boolean.

- I removed all the one-word comments that said /* corrupt */ or /*
checkable */ because they seemed redundant.

- In the xmin_status section of check_tuple_visibility(), I got rid of
the xmin_status == XID_IS_CURRENT_XID and xmin_status ==
XID_IN_PROGRESS cases because they were redundant with the xmin_status
!= XID_COMMITTED case.

- If xmax is a multi but seems to be garbled, I changed it to return
true rather than false. The inserter is known to have committed by
that point, so I think it's OK to try to deform the tuple. We just
shouldn't try to check TOAST.

- I changed both switches over xmax_status to break in each case and
then return true after instead of returning true for each case. I
think this is clearer.

- I changed get_xid_status() to perform the TransactionIdIs... checks
in the same order that HeapTupleSatisfies...() does them. I believe
that it's incorrect to conclude that the transaction must be in
progress because it neither IsCurrentTransaction nor DidCommit nor
DidAbort, because all of those things will be false for a transaction
that is running at the time of a system crash. The correct rule is
that if it neither IsCurrentTransaction nor IsInProgress nor DidCommit
then it's aborted.

- I moved a few comments and rewrote some others, including some of
the ones that you took from my earlier draft patch. The thing is, the
comment needs to be adjusted based on where you put it. Like, I had a
comment that says"It should be impossible for xvac to still be
running, since we've removed all that code, but even if it were, it
ought to be safe to read the tuple, since the original inserter must
have committed. But, if the xvac transaction committed, this tuple
(and its associated TOAST tuples) could be pruned at any time." which
in my version was right before a TransactionIdDidCommit() test, and
explains why that test is there and why the code does what it does as
a result. But in your version you've moved it to a place where we've
already tested that the transaction has committed, and more
importantly, where we've already tested that it's not still running.
Saying that it "should" be impossible for it not to be running when
we've *actually checked that* doesn't make nearly as much sense as it
does when we haven't checked that and aren't going to do so.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v15-0001-amcheck-Fix-verify_heapam-s-tuple-visibility-che.patchapplication/octet-stream; name=v15-0001-amcheck-Fix-verify_heapam-s-tuple-visibility-che.patchDownload
From 0106c7907be0db0040f144d080a8ef0b4ff8b5e0 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 1 Apr 2021 11:04:47 -0400
Subject: [PATCH v15] amcheck: Fix verify_heapam's tuple visibility checking
 rules.

We now follow the order of checks from HeapTupleSatisfies* more
closely to avoid coming to erroneous conclusions.

Mark Dilger and Robert Haas
---
 contrib/amcheck/verify_heapam.c | 555 ++++++++++++++++++++++++--------
 1 file changed, 414 insertions(+), 141 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..ac898cf53a 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -46,6 +46,7 @@ typedef enum XidBoundsViolation
 typedef enum XidCommitStatus
 {
 	XID_COMMITTED,
+	XID_IS_CURRENT_XID,
 	XID_IN_PROGRESS,
 	XID_ABORTED
 } XidCommitStatus;
@@ -72,6 +73,8 @@ typedef struct HeapCheckContext
 	TransactionId oldest_xid;	/* ShmemVariableCache->oldestXid */
 	FullTransactionId oldest_fxid;	/* 64-bit version of oldest_xid, computed
 									 * relative to next_fxid */
+	TransactionId safe_xmin;	/* this XID and newer ones can't become
+								 * all-visible while we're running */
 
 	/*
 	 * Cached copy of value from MultiXactState
@@ -113,6 +116,9 @@ typedef struct HeapCheckContext
 	uint32		offset;			/* offset in tuple data */
 	AttrNumber	attnum;
 
+	/* True if tuple's xmax makes it eligible for pruning */
+	bool		tuple_could_be_pruned;
+
 	/* Values for iterating over toast for the attribute */
 	int32		chunkno;
 	int32		attrsize;
@@ -133,8 +139,8 @@ static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static bool check_tuple_header(HeapCheckContext *ctx);
+static bool check_tuple_visibility(HeapCheckContext *ctx);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
@@ -248,6 +254,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
 
+	/*
+	 * Any xmin newer than the xmin of our snapshot can't become all-visible
+	 * while we're running.
+	 */
+	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+
 	/*
 	 * If we report corruption when not examining some individual attribute,
 	 * we need attnum to be reported as NULL.  Set that up before any
@@ -555,16 +567,11 @@ verify_heapam_tupdesc(void)
 }
 
 /*
- * Check for tuple header corruption and tuple visibility.
- *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * Check for tuple header corruption.
  *
  * Some kinds of corruption make it unsafe to check the tuple attributes, for
  * example when the line pointer refers to a range of bytes outside the page.
- * In such cases, we return false (not visible) after recording appropriate
+ * In such cases, we return false (not checkable) after recording appropriate
  * corruption messages.
  *
  * Some other kinds of tuple header corruption confuse the question of where
@@ -576,29 +583,18 @@ verify_heapam_tupdesc(void)
  *
  * Other kinds of tuple header corruption do not bear on the question of
  * whether the tuple attributes can be checked, so we record corruption
- * messages for them but do not base our visibility determination on them.  (In
- * other words, we do not return false merely because we detected them.)
- *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
- *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
+ * messages for them but we do not return false merely because we detected
+ * them.
  *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Returns whether the tuple is sufficiently sensible to undergo visibility and
+ * attribute checks.
  */
 static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+check_tuple_header(HeapCheckContext *ctx)
 {
+	HeapTupleHeader tuphdr = ctx->tuphdr;
 	uint16		infomask = tuphdr->t_infomask;
-	bool		header_garbled = false;
+	bool		result = true;
 	unsigned	expected_hoff;
 
 	if (ctx->tuphdr->t_hoff > ctx->lp_len)
@@ -606,7 +602,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 		report_corruption(ctx,
 						  psprintf("data begins at offset %u beyond the tuple length %u",
 								   ctx->tuphdr->t_hoff, ctx->lp_len));
-		header_garbled = true;
+		result = false;
 	}
 
 	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
@@ -616,9 +612,9 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 						  pstrdup("multixact should not be marked committed"));
 
 		/*
-		 * This condition is clearly wrong, but we do not consider the header
-		 * garbled, because we don't rely on this property for determining if
-		 * the tuple is visible or for interpreting other relevant header
+		 * This condition is clearly wrong, but it's not enough to justify
+		 * skipping further checks, because we don't rely on this to determine
+		 * whether the tuple is visible or to interpret other relevant header
 		 * fields.
 		 */
 	}
@@ -645,175 +641,449 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 			report_corruption(ctx,
 							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, no nulls)",
 									   expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
-		header_garbled = true;
+		result = false;
 	}
 
-	if (header_garbled)
-		return false;			/* checking of this tuple should not continue */
+	return result;
+}
+
+/*
+ * Checks tuple visibility so we know which further checks are safe to
+ * perform.
+ *
+ * If a tuple could have been inserted by a transaction that also added a
+ * column to the table, but which ultimately did not commit, or which has not
+ * yet committed, then the table's current TupleDesc might differ from the one
+ * used to construct this tuple, so we must not check it.
+ *
+ * As a special case, if our own transaction inserted the tuple, even if we
+ * added a column to the table, our TupleDesc should match.  We could check the
+ * tuple, but choose not to do so.
+ *
+ * If a tuple has been updated or deleted, we can still read the old tuple for
+ * corruption checking purposes, as long as we are careful about concurrent
+ * vacuums.  The main table tuple itself cannot be vacuumed away because we
+ * hold a buffer lock on the page, but if the deleting transaction is older
+ * than our transaction snapshot's xmin, then vacuum could remove the toast at
+ * any time, so we must not try to follow TOAST pointers.
+ *
+ * If xmin or xmax values are older than can be checked against clog, or appear
+ * to be in the future (possibly due to wrap-around), then we cannot make a
+ * determination about the visibility of the tuple, so we skip further checks.
+ *
+ * Returns true if the tuple itself should be checked, false otherwise.  Sets
+ * ctx->tuple_could_be_pruned if the tuple -- and thus also any associated
+ * TOAST tuples -- are eligible for pruning.
+ */
+static bool
+check_tuple_visibility(HeapCheckContext *ctx)
+{
+	TransactionId xmin;
+	TransactionId xvac;
+	TransactionId xmax;
+	XidCommitStatus xmin_status;
+	XidCommitStatus xvac_status;
+	XidCommitStatus xmax_status;
+	HeapTupleHeader tuphdr = ctx->tuphdr;
+
+	ctx->tuple_could_be_pruned = true;	/* have not yet proven otherwise */
+
+	/* If xmin is normal, it should be within valid range */
+	xmin = HeapTupleHeaderGetXmin(tuphdr);
+	switch (get_xid_status(xmin, ctx, &xmin_status))
+	{
+		case XID_INVALID:
+		case XID_BOUNDS_OK:
+			break;
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;
+	}
 
 	/*
-	 * Ok, we can examine the header for tuple visibility purposes, though we
-	 * still need to be careful about a few remaining types of header
-	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
+	 * Has inserting transaction committed?
 	 */
 	if (!HeapTupleHeaderXminCommitted(tuphdr))
 	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
-
 		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
+			return false;		/* inserter aborted, don't check */
 		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
 		{
-			XidCommitStatus status;
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xvac, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
-					return false;	/* corrupt */
+									  pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
+					return false;
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-					break;
+					return false;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-					break;
+					return false;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
+			}
+
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
+											   xvac));
+					return false;
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
+											   xvac));
+					return false;
+
+				case XID_COMMITTED:
+					/*
+					 * The tuple is dead, because the xvac transaction moved
+					 * it off and comitted. It's checkable, but also prunable.
+					 */
+					return true;
+
+				case XID_ABORTED:
+					/*
+					 * The original xmin must have committed, because the xvac
+					 * transaction tried to move it later. Since xvac is
+					 * aborted, whether it's still alive now depends on the
+					 * status of xmax.
+					 */
+					break;
 			}
 		}
-		else
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
 		{
-			XidCommitStatus status;
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(raw_xmin, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
+									  pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
 					return false;
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
 			}
+
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
+											   xvac));
+					return false;
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
+											   xvac));
+					return false;
+
+				case XID_COMMITTED:
+					/*
+					 * The original xmin must have committed, because the xvac
+					 * transaction moved it later. Whether it's still alive
+					 * now depends on the status of xmax.
+					 */
+					break;
+
+				case XID_ABORTED:
+					/*
+					 * The tuple is dead, because the xvac transaction moved
+					 * it off and comitted. It's checkable, but also prunable.
+					 */
+					return true;
+			}
+		}
+		else if (xmin_status != XID_COMMITTED)
+		{
+			/*
+			 * Inserting transaction is not in progress, and not committed, so
+			 * it might have changed the TupleDesc in ways we don't know about.
+			 * Thus, don't try to check the tuple structure.
+			 *
+			 * If xmin_status happens to be XID_IN_PROGRESS, then in theory
+			 * any such DDL changes ought to be visible to us, so perhaps
+			 * we could check anyway in that case. But, for now, let's be
+			 * conservate and treat this like any other uncommitted insert.
+			 */
+			return false;
 		}
 	}
 
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
+		/*
+		 * xmax is a multixact, so sanity-check the MXID. Note that we do this
+		 * prior to checking for HEAP_XMAX_INVALID or HEAP_XMAX_IS_LOCKED_ONLY.
+		 * This might therefore complain about things that wouldn't actually
+		 * be a problem during a normal scan, but eventually we're going to
+		 * have to freeze, and that process will ignore hint bits.
+		 *
+		 * Even if the MXID is out of range, we still know that the original
+		 * insert committed, so we can check the tuple itself. However, we
+		 * can't rule out the possibility that this tuple is dead, so don't
+		 * clear ctx->tuple_could_be_pruned. Possibly we should go ahead and
+		 * clear that flag anyway if HEAP_XMAX_INVALID is set or if
+		 * HEAP_XMAX_IS_LOCKED_ONLY is true, but for now we err on the side
+		 * of avoiding possibly-bogus complaints about missing TOAST entries.
+		 */
+		xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+		switch (check_mxid_valid_in_rel(xmax, ctx))
 		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("multitransaction ID is invalid"));
+				return true;
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+										   xmax, ctx->relminmxid));
+				return true;
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+										   xmax, ctx->oldest_mxact));
+				return true;
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+										   xmax,
+										   ctx->next_mxact));
+				return true;
+			case XID_BOUNDS_OK:
+				break;
+		}
+	}
 
-			switch (get_xid_status(xmax, ctx, &status))
-			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
-			}
+	if (tuphdr->t_infomask & HEAP_XMAX_INVALID)
+	{
+		/*
+		 * This tuple is live.  A concurrently running transaction could
+		 * delete it before we get around to checking the toast, but any such
+		 * running transaction is surely not less than our safe_xmin, so the
+		 * toast cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_could_be_pruned = false;
+		return true;
+	}
 
-			/* Ok, the tuple is live */
+	if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+	{
+		/*
+		 * "Deleting" xact really only locked it, so the tuple is live in any
+		 * case.  As above, a concurrently running transaction could delete
+		 * it, but it cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_could_be_pruned = false;
+		return true;
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * We already checked above that this multixact is within limits for
+		 * this table.  Now check the update xid from this multixact.
+		 */
+		xmax = HeapTupleGetUpdateXid(tuphdr);
+		switch (get_xid_status(xmax, ctx, &xmax_status))
+		{
+			case XID_INVALID:
+				/* not LOCKED_ONLY, so it has to have an xmax */
+				report_corruption(ctx,
+								  pstrdup("update xid is invalid"));
+				return true;
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->next_fxid),
+										   XidFromFullTransactionId(ctx->next_fxid)));
+				return true;
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes relation freeze threshold %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->relfrozenfxid),
+										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				return true;
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->oldest_fxid),
+										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				return true;
+			case XID_BOUNDS_OK:
+				break;
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
+
+		switch (xmax_status)
+		{
+			case XID_IS_CURRENT_XID:
+			case XID_IN_PROGRESS:
+
+				/*
+				 * The delete is in progress, so it cannot be visible to our
+				 * snapshot.
+				 */
+				ctx->tuple_could_be_pruned = false;
+				break;
+			case XID_COMMITTED:
+
+				/*
+				 * The delete committed.  Whether the toast can be vacuumed
+				 * away depends on how old the deleting transaction is.
+				 */
+				ctx->tuple_could_be_pruned = TransactionIdPrecedes(xmax,
+																 ctx->safe_xmin);
+				break;
+			case XID_ABORTED:
+				/*
+				 * The delete aborted or crashed.  The tuple is still live.
+				 */
+				ctx->tuple_could_be_pruned = false;
+				break;
+		}
+
+		/* Tuple itself is checkable even if it's dead. */
+		return true;
+	}
+
+	/* xmax is an MXID, not an MXID. Sanity check it. */
+	xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+	switch (get_xid_status(xmax, ctx, &xmax_status))
+	{
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes relation freeze threshold %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_BOUNDS_OK:
+		case XID_INVALID:
+			break;
 	}
-	return true;				/* not dead */
+
+	/*
+	 * Whether the toast can be vacuumed away depends on how old the deleting
+	 * transaction is.
+	 */
+	switch (xmax_status)
+	{
+		case XID_IS_CURRENT_XID:
+		case XID_IN_PROGRESS:
+
+			/*
+			 * The delete is in progress, so it cannot be visible to our
+			 * snapshot.
+			 */
+			ctx->tuple_could_be_pruned = false;
+			break;
+
+		case XID_COMMITTED:
+			/*
+			 * The delete committed.  Whether the toast can be vacuumed away
+			 * depends on how old the deleting transaction is.
+			 */
+			ctx->tuple_could_be_pruned = TransactionIdPrecedes(xmax,
+															 ctx->safe_xmin);
+			break;
+
+		case XID_ABORTED:
+			/*
+			 * The delete aborted or crashed.  The tuple is still live.
+			 */
+			ctx->tuple_could_be_pruned = false;
+			break;
+	}
+
+	/* Tuple itself is checkable even if it's dead. */
+	return true;
 }
 
+
 /*
  * Check the current toast tuple against the state tracked in ctx, recording
  * any corruption found in ctx->tupstore.
@@ -1247,7 +1517,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * corrupt to continue checking, or if the tuple is not visible to anyone,
 	 * we cannot continue with other checks.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	if (!check_tuple_header(ctx))
+		return;
+
+	if (!check_tuple_visibility(ctx))
 		return;
 
 	/*
@@ -1448,13 +1721,13 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 	if (FullTransactionIdPrecedesOrEquals(clog_horizon, fxid))
 	{
 		if (TransactionIdIsCurrentTransactionId(xid))
+			*status = XID_IS_CURRENT_XID;
+		else if (TransactionIdIsInProgress(xid))
 			*status = XID_IN_PROGRESS;
 		else if (TransactionIdDidCommit(xid))
 			*status = XID_COMMITTED;
-		else if (TransactionIdDidAbort(xid))
-			*status = XID_ABORTED;
 		else
-			*status = XID_IN_PROGRESS;
+			*status = XID_ABORTED;
 	}
 	LWLockRelease(XactTruncationLock);
 	ctx->cached_xid = xid;
-- 
2.24.3 (Apple Git-128)

#116Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#115)
Re: pg_amcheck contrib application

On Apr 1, 2021, at 8:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

These changes got lost between v11 and v12. I've put them back, as well as updating to use your language.

Here's an updated patch that includes your 0001 and 0002 plus a bunch
of changes by me. I intend to commit this version unless somebody
spots a problem with it.

Here are the things I changed:

- I renamed tuple_can_be_pruned to tuple_could_be_pruned because I
think it does a better job suggesting that we're uncertain about what
will happen.

+1

- I got rid of bool header_garbled and changed it to bool result,
inverting the sense, because it didn't seem useful to have a function
that ended with if (some_boolean) return false; return true when I
could end the function with return some_other_boolean.

+1

- I removed all the one-word comments that said /* corrupt */ or /*
checkable */ because they seemed redundant.

Ok.

- In the xmin_status section of check_tuple_visibility(), I got rid of
the xmin_status == XID_IS_CURRENT_XID and xmin_status ==
XID_IN_PROGRESS cases because they were redundant with the xmin_status
!= XID_COMMITTED case.

Ok.

- If xmax is a multi but seems to be garbled, I changed it to return
true rather than false. The inserter is known to have committed by
that point, so I think it's OK to try to deform the tuple. We just
shouldn't try to check TOAST.

It is hard to know what to do when at least one tuple header field is corrupt. You don't necesarily know which one it is. For example, if HEAP_XMAX_IS_MULTI is set, we try to interpret the xmax as a mxid, and if it is out of bounds, we report it as corrupt. But was the xmax corrupt? Or was the HEAP_XMAX_IS_MULTI bit corrupt? It's not clear. I took the view that if either xmin or xmax appear to be corrupt when interpreted in light of the various tuple header bits, all we really know is that the set of fields/bits don't make sense as a whole, so we report corruption, don't trust any of them, and abort further checking of the tuple. You have be burden of proof the other way around. If the xmin appears fine, and xmax appears corrupt, then we only know that xmax is corrupt, so the tuple is checkable because according to the xmin it committed.

I don't think how you have it causes undue problems, since deforming the tuple when you shouldn't merely risks a bunch of extra not-so-helpful corruption messages. And hey, maybe they're helpful to somebody clever enough to diagnose why that particular bit of noise was generated.

- I changed both switches over xmax_status to break in each case and
then return true after instead of returning true for each case. I
think this is clearer.

Ok.

- I changed get_xid_status() to perform the TransactionIdIs... checks
in the same order that HeapTupleSatisfies...() does them. I believe
that it's incorrect to conclude that the transaction must be in
progress because it neither IsCurrentTransaction nor DidCommit nor
DidAbort, because all of those things will be false for a transaction
that is running at the time of a system crash. The correct rule is
that if it neither IsCurrentTransaction nor IsInProgress nor DidCommit
then it's aborted.

Ok.

- I moved a few comments and rewrote some others, including some of
the ones that you took from my earlier draft patch. The thing is, the
comment needs to be adjusted based on where you put it. Like, I had a
comment that says"It should be impossible for xvac to still be
running, since we've removed all that code, but even if it were, it
ought to be safe to read the tuple, since the original inserter must
have committed. But, if the xvac transaction committed, this tuple
(and its associated TOAST tuples) could be pruned at any time." which
in my version was right before a TransactionIdDidCommit() test, and
explains why that test is there and why the code does what it does as
a result. But in your version you've moved it to a place where we've
already tested that the transaction has committed, and more
importantly, where we've already tested that it's not still running.
Saying that it "should" be impossible for it not to be running when
we've *actually checked that* doesn't make nearly as much sense as it
does when we haven't checked that and aren't going to do so.

* If xmin_status happens to be XID_IN_PROGRESS, then in theory

Did you mean to say XID_IS_CURRENT_XID here?

/* xmax is an MXID, not an MXID. Sanity check it. */

Is it an MXID or isn't it?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#117Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#116)
1 attachment(s)
Re: pg_amcheck contrib application

On Thu, Apr 1, 2021 at 12:32 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

- If xmax is a multi but seems to be garbled, I changed it to return
true rather than false. The inserter is known to have committed by
that point, so I think it's OK to try to deform the tuple. We just
shouldn't try to check TOAST.

It is hard to know what to do when at least one tuple header field is corrupt. You don't necesarily know which one it is. For example, if HEAP_XMAX_IS_MULTI is set, we try to interpret the xmax as a mxid, and if it is out of bounds, we report it as corrupt. But was the xmax corrupt? Or was the HEAP_XMAX_IS_MULTI bit corrupt? It's not clear. I took the view that if either xmin or xmax appear to be corrupt when interpreted in light of the various tuple header bits, all we really know is that the set of fields/bits don't make sense as a whole, so we report corruption, don't trust any of them, and abort further checking of the tuple. You have be burden of proof the other way around. If the xmin appears fine, and xmax appears corrupt, then we only know that xmax is corrupt, so the tuple is checkable because according to the xmin it committed.

I agree that it's hard to be sure what's gone once we start finding
corrupted data, but deciding that maybe xmin didn't really commit
because we see that there's something wrong with xmax seems excessive
to me. I thought about a related case: if xmax is a bad multi but is
also hinted invalid, should we try to follow TOAST pointers? I think
that's hard to say, because we don't know whether (1) the invalid
marking is in error, (2) it's wrong to consider it a multi rather than
an XID, (3) the stored multi got overwritten with a garbage value, or
(4) the stored multi got removed before the tuple was frozen. Not
knowing which of those is the case, how are we supposed to decide
whether the TOAST tuples might have been (or be about to get) pruned?

But, in the case we're talking about here, I don't think it's a
particularly close decision. All we need to say is that if xmax or the
infomask bits pertaining to it are corrupted, we're still going to
suppose that xmin and the infomask bits pertaining to it, which are
all different bytes and bits, are OK. To me, the contrary decision,
namely that a bogus xmax means xmin was probably lying about the
transaction having been committed in the first place, seems like a
serious overreaction. As you say:

I don't think how you have it causes undue problems, since deforming the tuple when you shouldn't merely risks a bunch of extra not-so-helpful corruption messages. And hey, maybe they're helpful to somebody clever enough to diagnose why that particular bit of noise was generated.

I agree. The biggest risk here is that we might omit >0 complaints
when only 0 are justified. That will panic users. The possibility that
we might omit >x complaints when only x are justified, for some x>0,
is also a risk, but it's not nearly as bad, because there's definitely
something wrong, and it's just a question of what it is exactly. So we
have to be really conservative about saying that X is corruption if
there's any possibility that it might be fine. But once we've
complained about one thing, we can take a more balanced approach about
whether to risk issuing more complaints. The possibility that
suppressing the additional complaints might complicate resolution of
the issue also needs to be considered.

* If xmin_status happens to be XID_IN_PROGRESS, then in theory

Did you mean to say XID_IS_CURRENT_XID here?

Yes, I did, thanks.

/* xmax is an MXID, not an MXID. Sanity check it. */

Is it an MXID or isn't it?

Good catch.

New patch attached.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v16-0001-amcheck-Fix-verify_heapam-s-tuple-visibility-che.patchapplication/octet-stream; name=v16-0001-amcheck-Fix-verify_heapam-s-tuple-visibility-che.patchDownload
From b6b1dc8b14c2013239ccf4410e68875919ae8752 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 1 Apr 2021 12:40:31 -0400
Subject: [PATCH v16] amcheck: Fix verify_heapam's tuple visibility checking
 rules.

We now follow the order of checks from HeapTupleSatisfies* more
closely to avoid coming to erroneous conclusions.

Mark Dilger and Robert Haas
---
 contrib/amcheck/verify_heapam.c | 555 ++++++++++++++++++++++++--------
 1 file changed, 414 insertions(+), 141 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..2e0bc5f059 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -46,6 +46,7 @@ typedef enum XidBoundsViolation
 typedef enum XidCommitStatus
 {
 	XID_COMMITTED,
+	XID_IS_CURRENT_XID,
 	XID_IN_PROGRESS,
 	XID_ABORTED
 } XidCommitStatus;
@@ -72,6 +73,8 @@ typedef struct HeapCheckContext
 	TransactionId oldest_xid;	/* ShmemVariableCache->oldestXid */
 	FullTransactionId oldest_fxid;	/* 64-bit version of oldest_xid, computed
 									 * relative to next_fxid */
+	TransactionId safe_xmin;	/* this XID and newer ones can't become
+								 * all-visible while we're running */
 
 	/*
 	 * Cached copy of value from MultiXactState
@@ -113,6 +116,9 @@ typedef struct HeapCheckContext
 	uint32		offset;			/* offset in tuple data */
 	AttrNumber	attnum;
 
+	/* True if tuple's xmax makes it eligible for pruning */
+	bool		tuple_could_be_pruned;
+
 	/* Values for iterating over toast for the attribute */
 	int32		chunkno;
 	int32		attrsize;
@@ -133,8 +139,8 @@ static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static bool check_tuple_header(HeapCheckContext *ctx);
+static bool check_tuple_visibility(HeapCheckContext *ctx);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
@@ -248,6 +254,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
 
+	/*
+	 * Any xmin newer than the xmin of our snapshot can't become all-visible
+	 * while we're running.
+	 */
+	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+
 	/*
 	 * If we report corruption when not examining some individual attribute,
 	 * we need attnum to be reported as NULL.  Set that up before any
@@ -555,16 +567,11 @@ verify_heapam_tupdesc(void)
 }
 
 /*
- * Check for tuple header corruption and tuple visibility.
- *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * Check for tuple header corruption.
  *
  * Some kinds of corruption make it unsafe to check the tuple attributes, for
  * example when the line pointer refers to a range of bytes outside the page.
- * In such cases, we return false (not visible) after recording appropriate
+ * In such cases, we return false (not checkable) after recording appropriate
  * corruption messages.
  *
  * Some other kinds of tuple header corruption confuse the question of where
@@ -576,29 +583,18 @@ verify_heapam_tupdesc(void)
  *
  * Other kinds of tuple header corruption do not bear on the question of
  * whether the tuple attributes can be checked, so we record corruption
- * messages for them but do not base our visibility determination on them.  (In
- * other words, we do not return false merely because we detected them.)
- *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
- *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
+ * messages for them but we do not return false merely because we detected
+ * them.
  *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Returns whether the tuple is sufficiently sensible to undergo visibility and
+ * attribute checks.
  */
 static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+check_tuple_header(HeapCheckContext *ctx)
 {
+	HeapTupleHeader tuphdr = ctx->tuphdr;
 	uint16		infomask = tuphdr->t_infomask;
-	bool		header_garbled = false;
+	bool		result = true;
 	unsigned	expected_hoff;
 
 	if (ctx->tuphdr->t_hoff > ctx->lp_len)
@@ -606,7 +602,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 		report_corruption(ctx,
 						  psprintf("data begins at offset %u beyond the tuple length %u",
 								   ctx->tuphdr->t_hoff, ctx->lp_len));
-		header_garbled = true;
+		result = false;
 	}
 
 	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
@@ -616,9 +612,9 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 						  pstrdup("multixact should not be marked committed"));
 
 		/*
-		 * This condition is clearly wrong, but we do not consider the header
-		 * garbled, because we don't rely on this property for determining if
-		 * the tuple is visible or for interpreting other relevant header
+		 * This condition is clearly wrong, but it's not enough to justify
+		 * skipping further checks, because we don't rely on this to determine
+		 * whether the tuple is visible or to interpret other relevant header
 		 * fields.
 		 */
 	}
@@ -645,175 +641,449 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 			report_corruption(ctx,
 							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, no nulls)",
 									   expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
-		header_garbled = true;
+		result = false;
 	}
 
-	if (header_garbled)
-		return false;			/* checking of this tuple should not continue */
+	return result;
+}
+
+/*
+ * Checks tuple visibility so we know which further checks are safe to
+ * perform.
+ *
+ * If a tuple could have been inserted by a transaction that also added a
+ * column to the table, but which ultimately did not commit, or which has not
+ * yet committed, then the table's current TupleDesc might differ from the one
+ * used to construct this tuple, so we must not check it.
+ *
+ * As a special case, if our own transaction inserted the tuple, even if we
+ * added a column to the table, our TupleDesc should match.  We could check the
+ * tuple, but choose not to do so.
+ *
+ * If a tuple has been updated or deleted, we can still read the old tuple for
+ * corruption checking purposes, as long as we are careful about concurrent
+ * vacuums.  The main table tuple itself cannot be vacuumed away because we
+ * hold a buffer lock on the page, but if the deleting transaction is older
+ * than our transaction snapshot's xmin, then vacuum could remove the toast at
+ * any time, so we must not try to follow TOAST pointers.
+ *
+ * If xmin or xmax values are older than can be checked against clog, or appear
+ * to be in the future (possibly due to wrap-around), then we cannot make a
+ * determination about the visibility of the tuple, so we skip further checks.
+ *
+ * Returns true if the tuple itself should be checked, false otherwise.  Sets
+ * ctx->tuple_could_be_pruned if the tuple -- and thus also any associated
+ * TOAST tuples -- are eligible for pruning.
+ */
+static bool
+check_tuple_visibility(HeapCheckContext *ctx)
+{
+	TransactionId xmin;
+	TransactionId xvac;
+	TransactionId xmax;
+	XidCommitStatus xmin_status;
+	XidCommitStatus xvac_status;
+	XidCommitStatus xmax_status;
+	HeapTupleHeader tuphdr = ctx->tuphdr;
+
+	ctx->tuple_could_be_pruned = true;	/* have not yet proven otherwise */
+
+	/* If xmin is normal, it should be within valid range */
+	xmin = HeapTupleHeaderGetXmin(tuphdr);
+	switch (get_xid_status(xmin, ctx, &xmin_status))
+	{
+		case XID_INVALID:
+		case XID_BOUNDS_OK:
+			break;
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;
+	}
 
 	/*
-	 * Ok, we can examine the header for tuple visibility purposes, though we
-	 * still need to be careful about a few remaining types of header
-	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
+	 * Has inserting transaction committed?
 	 */
 	if (!HeapTupleHeaderXminCommitted(tuphdr))
 	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
-
 		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
+			return false;		/* inserter aborted, don't check */
 		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
 		{
-			XidCommitStatus status;
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xvac, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
-					return false;	/* corrupt */
+									  pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
+					return false;
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-					break;
+					return false;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-					break;
+					return false;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
+			}
+
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
+											   xvac));
+					return false;
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
+											   xvac));
+					return false;
+
+				case XID_COMMITTED:
+					/*
+					 * The tuple is dead, because the xvac transaction moved
+					 * it off and comitted. It's checkable, but also prunable.
+					 */
+					return true;
+
+				case XID_ABORTED:
+					/*
+					 * The original xmin must have committed, because the xvac
+					 * transaction tried to move it later. Since xvac is
+					 * aborted, whether it's still alive now depends on the
+					 * status of xmax.
+					 */
+					break;
 			}
 		}
-		else
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
 		{
-			XidCommitStatus status;
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(raw_xmin, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
+									  pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
 					return false;
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
 			}
+
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
+											   xvac));
+					return false;
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
+											   xvac));
+					return false;
+
+				case XID_COMMITTED:
+					/*
+					 * The original xmin must have committed, because the xvac
+					 * transaction moved it later. Whether it's still alive
+					 * now depends on the status of xmax.
+					 */
+					break;
+
+				case XID_ABORTED:
+					/*
+					 * The tuple is dead, because the xvac transaction moved
+					 * it off and comitted. It's checkable, but also prunable.
+					 */
+					return true;
+			}
+		}
+		else if (xmin_status != XID_COMMITTED)
+		{
+			/*
+			 * Inserting transaction is not in progress, and not committed, so
+			 * it might have changed the TupleDesc in ways we don't know about.
+			 * Thus, don't try to check the tuple structure.
+			 *
+			 * If xmin_status happens to be XID_IN_CURRENT_XID, then in theory
+			 * any such DDL changes ought to be visible to us, so perhaps
+			 * we could check anyway in that case. But, for now, let's be
+			 * conservate and treat this like any other uncommitted insert.
+			 */
+			return false;
 		}
 	}
 
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
+		/*
+		 * xmax is a multixact, so sanity-check the MXID. Note that we do this
+		 * prior to checking for HEAP_XMAX_INVALID or HEAP_XMAX_IS_LOCKED_ONLY.
+		 * This might therefore complain about things that wouldn't actually
+		 * be a problem during a normal scan, but eventually we're going to
+		 * have to freeze, and that process will ignore hint bits.
+		 *
+		 * Even if the MXID is out of range, we still know that the original
+		 * insert committed, so we can check the tuple itself. However, we
+		 * can't rule out the possibility that this tuple is dead, so don't
+		 * clear ctx->tuple_could_be_pruned. Possibly we should go ahead and
+		 * clear that flag anyway if HEAP_XMAX_INVALID is set or if
+		 * HEAP_XMAX_IS_LOCKED_ONLY is true, but for now we err on the side
+		 * of avoiding possibly-bogus complaints about missing TOAST entries.
+		 */
+		xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+		switch (check_mxid_valid_in_rel(xmax, ctx))
 		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("multitransaction ID is invalid"));
+				return true;
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+										   xmax, ctx->relminmxid));
+				return true;
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+										   xmax, ctx->oldest_mxact));
+				return true;
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+										   xmax,
+										   ctx->next_mxact));
+				return true;
+			case XID_BOUNDS_OK:
+				break;
+		}
+	}
 
-			switch (get_xid_status(xmax, ctx, &status))
-			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
-			}
+	if (tuphdr->t_infomask & HEAP_XMAX_INVALID)
+	{
+		/*
+		 * This tuple is live.  A concurrently running transaction could
+		 * delete it before we get around to checking the toast, but any such
+		 * running transaction is surely not less than our safe_xmin, so the
+		 * toast cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_could_be_pruned = false;
+		return true;
+	}
 
-			/* Ok, the tuple is live */
+	if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+	{
+		/*
+		 * "Deleting" xact really only locked it, so the tuple is live in any
+		 * case.  As above, a concurrently running transaction could delete
+		 * it, but it cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_could_be_pruned = false;
+		return true;
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * We already checked above that this multixact is within limits for
+		 * this table.  Now check the update xid from this multixact.
+		 */
+		xmax = HeapTupleGetUpdateXid(tuphdr);
+		switch (get_xid_status(xmax, ctx, &xmax_status))
+		{
+			case XID_INVALID:
+				/* not LOCKED_ONLY, so it has to have an xmax */
+				report_corruption(ctx,
+								  pstrdup("update xid is invalid"));
+				return true;
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->next_fxid),
+										   XidFromFullTransactionId(ctx->next_fxid)));
+				return true;
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes relation freeze threshold %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->relfrozenfxid),
+										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				return true;
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->oldest_fxid),
+										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				return true;
+			case XID_BOUNDS_OK:
+				break;
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
+
+		switch (xmax_status)
+		{
+			case XID_IS_CURRENT_XID:
+			case XID_IN_PROGRESS:
+
+				/*
+				 * The delete is in progress, so it cannot be visible to our
+				 * snapshot.
+				 */
+				ctx->tuple_could_be_pruned = false;
+				break;
+			case XID_COMMITTED:
+
+				/*
+				 * The delete committed.  Whether the toast can be vacuumed
+				 * away depends on how old the deleting transaction is.
+				 */
+				ctx->tuple_could_be_pruned = TransactionIdPrecedes(xmax,
+																 ctx->safe_xmin);
+				break;
+			case XID_ABORTED:
+				/*
+				 * The delete aborted or crashed.  The tuple is still live.
+				 */
+				ctx->tuple_could_be_pruned = false;
+				break;
+		}
+
+		/* Tuple itself is checkable even if it's dead. */
+		return true;
+	}
+
+	/* xmax is an XID, not a MXID. Sanity check it. */
+	xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+	switch (get_xid_status(xmax, ctx, &xmax_status))
+	{
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes relation freeze threshold %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_BOUNDS_OK:
+		case XID_INVALID:
+			break;
 	}
-	return true;				/* not dead */
+
+	/*
+	 * Whether the toast can be vacuumed away depends on how old the deleting
+	 * transaction is.
+	 */
+	switch (xmax_status)
+	{
+		case XID_IS_CURRENT_XID:
+		case XID_IN_PROGRESS:
+
+			/*
+			 * The delete is in progress, so it cannot be visible to our
+			 * snapshot.
+			 */
+			ctx->tuple_could_be_pruned = false;
+			break;
+
+		case XID_COMMITTED:
+			/*
+			 * The delete committed.  Whether the toast can be vacuumed away
+			 * depends on how old the deleting transaction is.
+			 */
+			ctx->tuple_could_be_pruned = TransactionIdPrecedes(xmax,
+															 ctx->safe_xmin);
+			break;
+
+		case XID_ABORTED:
+			/*
+			 * The delete aborted or crashed.  The tuple is still live.
+			 */
+			ctx->tuple_could_be_pruned = false;
+			break;
+	}
+
+	/* Tuple itself is checkable even if it's dead. */
+	return true;
 }
 
+
 /*
  * Check the current toast tuple against the state tracked in ctx, recording
  * any corruption found in ctx->tupstore.
@@ -1247,7 +1517,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * corrupt to continue checking, or if the tuple is not visible to anyone,
 	 * we cannot continue with other checks.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	if (!check_tuple_header(ctx))
+		return;
+
+	if (!check_tuple_visibility(ctx))
 		return;
 
 	/*
@@ -1448,13 +1721,13 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 	if (FullTransactionIdPrecedesOrEquals(clog_horizon, fxid))
 	{
 		if (TransactionIdIsCurrentTransactionId(xid))
+			*status = XID_IS_CURRENT_XID;
+		else if (TransactionIdIsInProgress(xid))
 			*status = XID_IN_PROGRESS;
 		else if (TransactionIdDidCommit(xid))
 			*status = XID_COMMITTED;
-		else if (TransactionIdDidAbort(xid))
-			*status = XID_ABORTED;
 		else
-			*status = XID_IN_PROGRESS;
+			*status = XID_ABORTED;
 	}
 	LWLockRelease(XactTruncationLock);
 	ctx->cached_xid = xid;
-- 
2.24.3 (Apple Git-128)

#118Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#117)
Re: pg_amcheck contrib application

On Apr 1, 2021, at 9:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 1, 2021 at 12:32 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

- If xmax is a multi but seems to be garbled, I changed it to return
true rather than false. The inserter is known to have committed by
that point, so I think it's OK to try to deform the tuple. We just
shouldn't try to check TOAST.

It is hard to know what to do when at least one tuple header field is corrupt. You don't necesarily know which one it is. For example, if HEAP_XMAX_IS_MULTI is set, we try to interpret the xmax as a mxid, and if it is out of bounds, we report it as corrupt. But was the xmax corrupt? Or was the HEAP_XMAX_IS_MULTI bit corrupt? It's not clear. I took the view that if either xmin or xmax appear to be corrupt when interpreted in light of the various tuple header bits, all we really know is that the set of fields/bits don't make sense as a whole, so we report corruption, don't trust any of them, and abort further checking of the tuple. You have be burden of proof the other way around. If the xmin appears fine, and xmax appears corrupt, then we only know that xmax is corrupt, so the tuple is checkable because according to the xmin it committed.

I agree that it's hard to be sure what's gone once we start finding
corrupted data, but deciding that maybe xmin didn't really commit
because we see that there's something wrong with xmax seems excessive
to me. I thought about a related case: if xmax is a bad multi but is
also hinted invalid, should we try to follow TOAST pointers? I think
that's hard to say, because we don't know whether (1) the invalid
marking is in error, (2) it's wrong to consider it a multi rather than
an XID, (3) the stored multi got overwritten with a garbage value, or
(4) the stored multi got removed before the tuple was frozen. Not
knowing which of those is the case, how are we supposed to decide
whether the TOAST tuples might have been (or be about to get) pruned?

But, in the case we're talking about here, I don't think it's a
particularly close decision. All we need to say is that if xmax or the
infomask bits pertaining to it are corrupted, we're still going to
suppose that xmin and the infomask bits pertaining to it, which are
all different bytes and bits, are OK. To me, the contrary decision,
namely that a bogus xmax means xmin was probably lying about the
transaction having been committed in the first place, seems like a
serious overreaction. As you say:

I don't think how you have it causes undue problems, since deforming the tuple when you shouldn't merely risks a bunch of extra not-so-helpful corruption messages. And hey, maybe they're helpful to somebody clever enough to diagnose why that particular bit of noise was generated.

I agree. The biggest risk here is that we might omit >0 complaints
when only 0 are justified. That will panic users. The possibility that
we might omit >x complaints when only x are justified, for some x>0,
is also a risk, but it's not nearly as bad, because there's definitely
something wrong, and it's just a question of what it is exactly. So we
have to be really conservative about saying that X is corruption if
there's any possibility that it might be fine. But once we've
complained about one thing, we can take a more balanced approach about
whether to risk issuing more complaints. The possibility that
suppressing the additional complaints might complicate resolution of
the issue also needs to be considered.

This all seems fine to me. The main thing is that we don't go on to check the toast, which we don't.

* If xmin_status happens to be XID_IN_PROGRESS, then in theory

Did you mean to say XID_IS_CURRENT_XID here?

Yes, I did, thanks.

Ouch. You've got a typo: s/XID_IN_CURRENT_XID/XID_IS_CURRENT_XID/

/* xmax is an MXID, not an MXID. Sanity check it. */

Is it an MXID or isn't it?

Good catch.

New patch attached.

Seems fine other than the typo.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#119Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#118)
1 attachment(s)
Re: pg_amcheck contrib application

On Thu, Apr 1, 2021 at 1:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Seems fine other than the typo.

OK, let's try that again.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v17-0001-amcheck-Fix-verify_heapam-s-tuple-visibility-che.patchapplication/octet-stream; name=v17-0001-amcheck-Fix-verify_heapam-s-tuple-visibility-che.patchDownload
From 1ea29a39988f0767596fee5a618d08e4cc93319e Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 1 Apr 2021 13:19:57 -0400
Subject: [PATCH v17] amcheck: Fix verify_heapam's tuple visibility checking
 rules.

We now follow the order of checks from HeapTupleSatisfies* more
closely to avoid coming to erroneous conclusions.

Mark Dilger and Robert Haas
---
 contrib/amcheck/verify_heapam.c | 555 ++++++++++++++++++++++++--------
 1 file changed, 414 insertions(+), 141 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 6f972e630a..3fb709b842 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -46,6 +46,7 @@ typedef enum XidBoundsViolation
 typedef enum XidCommitStatus
 {
 	XID_COMMITTED,
+	XID_IS_CURRENT_XID,
 	XID_IN_PROGRESS,
 	XID_ABORTED
 } XidCommitStatus;
@@ -72,6 +73,8 @@ typedef struct HeapCheckContext
 	TransactionId oldest_xid;	/* ShmemVariableCache->oldestXid */
 	FullTransactionId oldest_fxid;	/* 64-bit version of oldest_xid, computed
 									 * relative to next_fxid */
+	TransactionId safe_xmin;	/* this XID and newer ones can't become
+								 * all-visible while we're running */
 
 	/*
 	 * Cached copy of value from MultiXactState
@@ -113,6 +116,9 @@ typedef struct HeapCheckContext
 	uint32		offset;			/* offset in tuple data */
 	AttrNumber	attnum;
 
+	/* True if tuple's xmax makes it eligible for pruning */
+	bool		tuple_could_be_pruned;
+
 	/* Values for iterating over toast for the attribute */
 	int32		chunkno;
 	int32		attrsize;
@@ -133,8 +139,8 @@ static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
-static bool check_tuple_header_and_visibilty(HeapTupleHeader tuphdr,
-											 HeapCheckContext *ctx);
+static bool check_tuple_header(HeapCheckContext *ctx);
+static bool check_tuple_visibility(HeapCheckContext *ctx);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
@@ -248,6 +254,12 @@ verify_heapam(PG_FUNCTION_ARGS)
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
 
+	/*
+	 * Any xmin newer than the xmin of our snapshot can't become all-visible
+	 * while we're running.
+	 */
+	ctx.safe_xmin = GetTransactionSnapshot()->xmin;
+
 	/*
 	 * If we report corruption when not examining some individual attribute,
 	 * we need attnum to be reported as NULL.  Set that up before any
@@ -555,16 +567,11 @@ verify_heapam_tupdesc(void)
 }
 
 /*
- * Check for tuple header corruption and tuple visibility.
- *
- * Since we do not hold a snapshot, tuple visibility is not a question of
- * whether we should be able to see the tuple relative to any particular
- * snapshot, but rather a question of whether it is safe and reasonable to
- * check the tuple attributes.
+ * Check for tuple header corruption.
  *
  * Some kinds of corruption make it unsafe to check the tuple attributes, for
  * example when the line pointer refers to a range of bytes outside the page.
- * In such cases, we return false (not visible) after recording appropriate
+ * In such cases, we return false (not checkable) after recording appropriate
  * corruption messages.
  *
  * Some other kinds of tuple header corruption confuse the question of where
@@ -576,29 +583,18 @@ verify_heapam_tupdesc(void)
  *
  * Other kinds of tuple header corruption do not bear on the question of
  * whether the tuple attributes can be checked, so we record corruption
- * messages for them but do not base our visibility determination on them.  (In
- * other words, we do not return false merely because we detected them.)
- *
- * For visibility determination not specifically related to corruption, what we
- * want to know is if a tuple is potentially visible to any running
- * transaction.  If you are tempted to replace this function's visibility logic
- * with a call to another visibility checking function, keep in mind that this
- * function does not update hint bits, as it seems imprudent to write hint bits
- * (or anything at all) to a table during a corruption check.  Nor does this
- * function bother classifying tuple visibility beyond a boolean visible vs.
- * not visible.
- *
- * The caller should already have checked that xmin and xmax are not out of
- * bounds for the relation.
+ * messages for them but we do not return false merely because we detected
+ * them.
  *
- * Returns whether the tuple is both visible and sufficiently sensible to
- * undergo attribute checks.
+ * Returns whether the tuple is sufficiently sensible to undergo visibility and
+ * attribute checks.
  */
 static bool
-check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
+check_tuple_header(HeapCheckContext *ctx)
 {
+	HeapTupleHeader tuphdr = ctx->tuphdr;
 	uint16		infomask = tuphdr->t_infomask;
-	bool		header_garbled = false;
+	bool		result = true;
 	unsigned	expected_hoff;
 
 	if (ctx->tuphdr->t_hoff > ctx->lp_len)
@@ -606,7 +602,7 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 		report_corruption(ctx,
 						  psprintf("data begins at offset %u beyond the tuple length %u",
 								   ctx->tuphdr->t_hoff, ctx->lp_len));
-		header_garbled = true;
+		result = false;
 	}
 
 	if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
@@ -616,9 +612,9 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 						  pstrdup("multixact should not be marked committed"));
 
 		/*
-		 * This condition is clearly wrong, but we do not consider the header
-		 * garbled, because we don't rely on this property for determining if
-		 * the tuple is visible or for interpreting other relevant header
+		 * This condition is clearly wrong, but it's not enough to justify
+		 * skipping further checks, because we don't rely on this to determine
+		 * whether the tuple is visible or to interpret other relevant header
 		 * fields.
 		 */
 	}
@@ -645,175 +641,449 @@ check_tuple_header_and_visibilty(HeapTupleHeader tuphdr, HeapCheckContext *ctx)
 			report_corruption(ctx,
 							  psprintf("tuple data should begin at byte %u, but actually begins at byte %u (%u attributes, no nulls)",
 									   expected_hoff, ctx->tuphdr->t_hoff, ctx->natts));
-		header_garbled = true;
+		result = false;
 	}
 
-	if (header_garbled)
-		return false;			/* checking of this tuple should not continue */
+	return result;
+}
+
+/*
+ * Checks tuple visibility so we know which further checks are safe to
+ * perform.
+ *
+ * If a tuple could have been inserted by a transaction that also added a
+ * column to the table, but which ultimately did not commit, or which has not
+ * yet committed, then the table's current TupleDesc might differ from the one
+ * used to construct this tuple, so we must not check it.
+ *
+ * As a special case, if our own transaction inserted the tuple, even if we
+ * added a column to the table, our TupleDesc should match.  We could check the
+ * tuple, but choose not to do so.
+ *
+ * If a tuple has been updated or deleted, we can still read the old tuple for
+ * corruption checking purposes, as long as we are careful about concurrent
+ * vacuums.  The main table tuple itself cannot be vacuumed away because we
+ * hold a buffer lock on the page, but if the deleting transaction is older
+ * than our transaction snapshot's xmin, then vacuum could remove the toast at
+ * any time, so we must not try to follow TOAST pointers.
+ *
+ * If xmin or xmax values are older than can be checked against clog, or appear
+ * to be in the future (possibly due to wrap-around), then we cannot make a
+ * determination about the visibility of the tuple, so we skip further checks.
+ *
+ * Returns true if the tuple itself should be checked, false otherwise.  Sets
+ * ctx->tuple_could_be_pruned if the tuple -- and thus also any associated
+ * TOAST tuples -- are eligible for pruning.
+ */
+static bool
+check_tuple_visibility(HeapCheckContext *ctx)
+{
+	TransactionId xmin;
+	TransactionId xvac;
+	TransactionId xmax;
+	XidCommitStatus xmin_status;
+	XidCommitStatus xvac_status;
+	XidCommitStatus xmax_status;
+	HeapTupleHeader tuphdr = ctx->tuphdr;
+
+	ctx->tuple_could_be_pruned = true;	/* have not yet proven otherwise */
+
+	/* If xmin is normal, it should be within valid range */
+	xmin = HeapTupleHeaderGetXmin(tuphdr);
+	switch (get_xid_status(xmin, ctx, &xmin_status))
+	{
+		case XID_INVALID:
+		case XID_BOUNDS_OK:
+			break;
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
+									   xmin,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;
+	}
 
 	/*
-	 * Ok, we can examine the header for tuple visibility purposes, though we
-	 * still need to be careful about a few remaining types of header
-	 * corruption.  This logic roughly follows that of
-	 * HeapTupleSatisfiesVacuum.  Where possible the comments indicate which
-	 * HTSV_Result we think that function might return for this tuple.
+	 * Has inserting transaction committed?
 	 */
 	if (!HeapTupleHeaderXminCommitted(tuphdr))
 	{
-		TransactionId raw_xmin = HeapTupleHeaderGetRawXmin(tuphdr);
-
 		if (HeapTupleHeaderXminInvalid(tuphdr))
-			return false;		/* HEAPTUPLE_DEAD */
+			return false;		/* inserter aborted, don't check */
 		/* Used by pre-9.0 binary upgrades */
-		else if (infomask & HEAP_MOVED_OFF ||
-				 infomask & HEAP_MOVED_IN)
+		else if (tuphdr->t_infomask & HEAP_MOVED_OFF)
 		{
-			XidCommitStatus status;
-			TransactionId xvac = HeapTupleHeaderGetXvac(tuphdr);
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(xvac, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("old-style VACUUM FULL transaction ID is invalid"));
-					return false;	/* corrupt */
+									  pstrdup("old-style VACUUM FULL transaction ID for moved off tuple is invalid"));
+					return false;
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u equals or exceeds next valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple equals or exceeds next valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes relation freeze threshold %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes relation freeze threshold %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-					break;
+					return false;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("old-style VACUUM FULL transaction ID %u precedes oldest valid transaction ID %u:%u",
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple precedes oldest valid transaction ID %u:%u",
 											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-					break;
+					return false;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
+			}
+
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple matches our current transaction ID",
+											   xvac));
+					return false;
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved off tuple appears to be in progress",
+											   xvac));
+					return false;
+
+				case XID_COMMITTED:
+					/*
+					 * The tuple is dead, because the xvac transaction moved
+					 * it off and comitted. It's checkable, but also prunable.
+					 */
+					return true;
+
+				case XID_ABORTED:
+					/*
+					 * The original xmin must have committed, because the xvac
+					 * transaction tried to move it later. Since xvac is
+					 * aborted, whether it's still alive now depends on the
+					 * status of xmax.
+					 */
+					break;
 			}
 		}
-		else
+		/* Used by pre-9.0 binary upgrades */
+		else if (tuphdr->t_infomask & HEAP_MOVED_IN)
 		{
-			XidCommitStatus status;
+			xvac = HeapTupleHeaderGetXvac(tuphdr);
 
-			switch (get_xid_status(raw_xmin, ctx, &status))
+			switch (get_xid_status(xvac, ctx, &xvac_status))
 			{
 				case XID_INVALID:
 					report_corruption(ctx,
-									  pstrdup("raw xmin is invalid"));
+									  pstrdup("old-style VACUUM FULL transaction ID for moved in tuple is invalid"));
 					return false;
 				case XID_IN_FUTURE:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u equals or exceeds next valid transaction ID %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple equals or exceeds next valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->next_fxid),
 											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_RELMIN:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes relation freeze threshold %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes relation freeze threshold %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->relfrozenfxid),
 											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_PRECEDES_CLUSTERMIN:
 					report_corruption(ctx,
-									  psprintf("raw xmin %u precedes oldest valid transaction ID %u:%u",
-											   raw_xmin,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple precedes oldest valid transaction ID %u:%u",
+											   xvac,
 											   EpochFromFullTransactionId(ctx->oldest_fxid),
 											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
+					return false;
 				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_COMMITTED:
-							break;
-						case XID_IN_PROGRESS:
-							return true;	/* insert or delete in progress */
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_DEAD */
-					}
+					break;
 			}
+
+			switch (xvac_status)
+			{
+				case XID_IS_CURRENT_XID:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple matches our current transaction ID",
+											   xvac));
+					return false;
+				case XID_IN_PROGRESS:
+					report_corruption(ctx,
+									  psprintf("old-style VACUUM FULL transaction ID %u for moved in tuple appears to be in progress",
+											   xvac));
+					return false;
+
+				case XID_COMMITTED:
+					/*
+					 * The original xmin must have committed, because the xvac
+					 * transaction moved it later. Whether it's still alive
+					 * now depends on the status of xmax.
+					 */
+					break;
+
+				case XID_ABORTED:
+					/*
+					 * The tuple is dead, because the xvac transaction moved
+					 * it off and comitted. It's checkable, but also prunable.
+					 */
+					return true;
+			}
+		}
+		else if (xmin_status != XID_COMMITTED)
+		{
+			/*
+			 * Inserting transaction is not in progress, and not committed, so
+			 * it might have changed the TupleDesc in ways we don't know about.
+			 * Thus, don't try to check the tuple structure.
+			 *
+			 * If xmin_status happens to be XID_IS_CURRENT_XID, then in theory
+			 * any such DDL changes ought to be visible to us, so perhaps
+			 * we could check anyway in that case. But, for now, let's be
+			 * conservate and treat this like any other uncommitted insert.
+			 */
+			return false;
 		}
 	}
 
-	if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
+	/*
+	 * Okay, the inserter committed, so it was good at some point.  Now what
+	 * about the deleting transaction?
+	 */
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
-		if (infomask & HEAP_XMAX_IS_MULTI)
+		/*
+		 * xmax is a multixact, so sanity-check the MXID. Note that we do this
+		 * prior to checking for HEAP_XMAX_INVALID or HEAP_XMAX_IS_LOCKED_ONLY.
+		 * This might therefore complain about things that wouldn't actually
+		 * be a problem during a normal scan, but eventually we're going to
+		 * have to freeze, and that process will ignore hint bits.
+		 *
+		 * Even if the MXID is out of range, we still know that the original
+		 * insert committed, so we can check the tuple itself. However, we
+		 * can't rule out the possibility that this tuple is dead, so don't
+		 * clear ctx->tuple_could_be_pruned. Possibly we should go ahead and
+		 * clear that flag anyway if HEAP_XMAX_INVALID is set or if
+		 * HEAP_XMAX_IS_LOCKED_ONLY is true, but for now we err on the side
+		 * of avoiding possibly-bogus complaints about missing TOAST entries.
+		 */
+		xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+		switch (check_mxid_valid_in_rel(xmax, ctx))
 		{
-			XidCommitStatus status;
-			TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
+			case XID_INVALID:
+				report_corruption(ctx,
+								  pstrdup("multitransaction ID is invalid"));
+				return true;
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
+										   xmax, ctx->relminmxid));
+				return true;
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
+										   xmax, ctx->oldest_mxact));
+				return true;
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
+										   xmax,
+										   ctx->next_mxact));
+				return true;
+			case XID_BOUNDS_OK:
+				break;
+		}
+	}
 
-			switch (get_xid_status(xmax, ctx, &status))
-			{
-					/* not LOCKED_ONLY, so it has to have an xmax */
-				case XID_INVALID:
-					report_corruption(ctx,
-									  pstrdup("xmax is invalid"));
-					return false;	/* corrupt */
-				case XID_IN_FUTURE:
-					report_corruption(ctx,
-									  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->next_fxid),
-											   XidFromFullTransactionId(ctx->next_fxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_RELMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->relfrozenfxid),
-											   XidFromFullTransactionId(ctx->relfrozenfxid)));
-					return false;	/* corrupt */
-				case XID_PRECEDES_CLUSTERMIN:
-					report_corruption(ctx,
-									  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-											   xmax,
-											   EpochFromFullTransactionId(ctx->oldest_fxid),
-											   XidFromFullTransactionId(ctx->oldest_fxid)));
-					return false;	/* corrupt */
-				case XID_BOUNDS_OK:
-					switch (status)
-					{
-						case XID_IN_PROGRESS:
-							return true;	/* HEAPTUPLE_DELETE_IN_PROGRESS */
-						case XID_COMMITTED:
-						case XID_ABORTED:
-							return false;	/* HEAPTUPLE_RECENTLY_DEAD or
-											 * HEAPTUPLE_DEAD */
-					}
-			}
+	if (tuphdr->t_infomask & HEAP_XMAX_INVALID)
+	{
+		/*
+		 * This tuple is live.  A concurrently running transaction could
+		 * delete it before we get around to checking the toast, but any such
+		 * running transaction is surely not less than our safe_xmin, so the
+		 * toast cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_could_be_pruned = false;
+		return true;
+	}
 
-			/* Ok, the tuple is live */
+	if (HEAP_XMAX_IS_LOCKED_ONLY(tuphdr->t_infomask))
+	{
+		/*
+		 * "Deleting" xact really only locked it, so the tuple is live in any
+		 * case.  As above, a concurrently running transaction could delete
+		 * it, but it cannot be vacuumed out from under us.
+		 */
+		ctx->tuple_could_be_pruned = false;
+		return true;
+	}
+
+	if (tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)
+	{
+		/*
+		 * We already checked above that this multixact is within limits for
+		 * this table.  Now check the update xid from this multixact.
+		 */
+		xmax = HeapTupleGetUpdateXid(tuphdr);
+		switch (get_xid_status(xmax, ctx, &xmax_status))
+		{
+			case XID_INVALID:
+				/* not LOCKED_ONLY, so it has to have an xmax */
+				report_corruption(ctx,
+								  pstrdup("update xid is invalid"));
+				return true;
+			case XID_IN_FUTURE:
+				report_corruption(ctx,
+								  psprintf("update xid %u equals or exceeds next valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->next_fxid),
+										   XidFromFullTransactionId(ctx->next_fxid)));
+				return true;
+			case XID_PRECEDES_RELMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes relation freeze threshold %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->relfrozenfxid),
+										   XidFromFullTransactionId(ctx->relfrozenfxid)));
+				return true;
+			case XID_PRECEDES_CLUSTERMIN:
+				report_corruption(ctx,
+								  psprintf("update xid %u precedes oldest valid transaction ID %u:%u",
+										   xmax,
+										   EpochFromFullTransactionId(ctx->oldest_fxid),
+										   XidFromFullTransactionId(ctx->oldest_fxid)));
+				return true;
+			case XID_BOUNDS_OK:
+				break;
 		}
-		else if (!(infomask & HEAP_XMAX_COMMITTED))
-			return true;		/* HEAPTUPLE_DELETE_IN_PROGRESS or
-								 * HEAPTUPLE_LIVE */
-		else
-			return false;		/* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
+
+		switch (xmax_status)
+		{
+			case XID_IS_CURRENT_XID:
+			case XID_IN_PROGRESS:
+
+				/*
+				 * The delete is in progress, so it cannot be visible to our
+				 * snapshot.
+				 */
+				ctx->tuple_could_be_pruned = false;
+				break;
+			case XID_COMMITTED:
+
+				/*
+				 * The delete committed.  Whether the toast can be vacuumed
+				 * away depends on how old the deleting transaction is.
+				 */
+				ctx->tuple_could_be_pruned = TransactionIdPrecedes(xmax,
+																 ctx->safe_xmin);
+				break;
+			case XID_ABORTED:
+				/*
+				 * The delete aborted or crashed.  The tuple is still live.
+				 */
+				ctx->tuple_could_be_pruned = false;
+				break;
+		}
+
+		/* Tuple itself is checkable even if it's dead. */
+		return true;
+	}
+
+	/* xmax is an XID, not a MXID. Sanity check it. */
+	xmax = HeapTupleHeaderGetRawXmax(tuphdr);
+	switch (get_xid_status(xmax, ctx, &xmax_status))
+	{
+		case XID_IN_FUTURE:
+			report_corruption(ctx,
+							  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->next_fxid),
+									   XidFromFullTransactionId(ctx->next_fxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_RELMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes relation freeze threshold %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->relfrozenfxid),
+									   XidFromFullTransactionId(ctx->relfrozenfxid)));
+			return false;		/* corrupt */
+		case XID_PRECEDES_CLUSTERMIN:
+			report_corruption(ctx,
+							  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
+									   xmax,
+									   EpochFromFullTransactionId(ctx->oldest_fxid),
+									   XidFromFullTransactionId(ctx->oldest_fxid)));
+			return false;		/* corrupt */
+		case XID_BOUNDS_OK:
+		case XID_INVALID:
+			break;
 	}
-	return true;				/* not dead */
+
+	/*
+	 * Whether the toast can be vacuumed away depends on how old the deleting
+	 * transaction is.
+	 */
+	switch (xmax_status)
+	{
+		case XID_IS_CURRENT_XID:
+		case XID_IN_PROGRESS:
+
+			/*
+			 * The delete is in progress, so it cannot be visible to our
+			 * snapshot.
+			 */
+			ctx->tuple_could_be_pruned = false;
+			break;
+
+		case XID_COMMITTED:
+			/*
+			 * The delete committed.  Whether the toast can be vacuumed away
+			 * depends on how old the deleting transaction is.
+			 */
+			ctx->tuple_could_be_pruned = TransactionIdPrecedes(xmax,
+															 ctx->safe_xmin);
+			break;
+
+		case XID_ABORTED:
+			/*
+			 * The delete aborted or crashed.  The tuple is still live.
+			 */
+			ctx->tuple_could_be_pruned = false;
+			break;
+	}
+
+	/* Tuple itself is checkable even if it's dead. */
+	return true;
 }
 
+
 /*
  * Check the current toast tuple against the state tracked in ctx, recording
  * any corruption found in ctx->tupstore.
@@ -1247,7 +1517,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * corrupt to continue checking, or if the tuple is not visible to anyone,
 	 * we cannot continue with other checks.
 	 */
-	if (!check_tuple_header_and_visibilty(ctx->tuphdr, ctx))
+	if (!check_tuple_header(ctx))
+		return;
+
+	if (!check_tuple_visibility(ctx))
 		return;
 
 	/*
@@ -1448,13 +1721,13 @@ get_xid_status(TransactionId xid, HeapCheckContext *ctx,
 	if (FullTransactionIdPrecedesOrEquals(clog_horizon, fxid))
 	{
 		if (TransactionIdIsCurrentTransactionId(xid))
+			*status = XID_IS_CURRENT_XID;
+		else if (TransactionIdIsInProgress(xid))
 			*status = XID_IN_PROGRESS;
 		else if (TransactionIdDidCommit(xid))
 			*status = XID_COMMITTED;
-		else if (TransactionIdDidAbort(xid))
-			*status = XID_ABORTED;
 		else
-			*status = XID_IN_PROGRESS;
+			*status = XID_ABORTED;
 	}
 	LWLockRelease(XactTruncationLock);
 	ctx->cached_xid = xid;
-- 
2.24.3 (Apple Git-128)

#120Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#119)
Re: pg_amcheck contrib application

On Apr 1, 2021, at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 1, 2021 at 1:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Seems fine other than the typo.

OK, let's try that again.

Looks good!


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#121Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#120)
Re: pg_amcheck contrib application

On Thu, Apr 1, 2021 at 1:24 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Apr 1, 2021, at 10:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:
OK, let's try that again.

Looks good!

OK, committed. We still need to deal with what you had as 0003
upthread, so I guess the next step is for me to spend some time
reviewing that one.

--
Robert Haas
EDB: http://www.enterprisedb.com

#122Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#121)
Re: pg_amcheck contrib application

On Thu, Apr 1, 2021 at 1:41 PM Robert Haas <robertmhaas@gmail.com> wrote:

OK, committed. We still need to deal with what you had as 0003
upthread, so I guess the next step is for me to spend some time
reviewing that one.

I did this, and it was a bit depressing. It appears that we now have
duplicate checks for xmin and xmax being out of the valid range.
Somehow you have the removal of those duplicate checks in 0003, but
why in the world didn't you put them into one of the previous patches
and just make 0003 about fixing the
holding-buffer-lock-while-following-TOAST-pointers problem? (And, gah,
why did I not look carefully enough to notice that you hadn't done
that?)

Other than that, I notice a few other things:

- There are a total of two (2) calls in the current source code to
palloc0fast, and hundreds of calls to palloc0. So I think you should
forget about using the fast variant and just do what almost every
other caller does.

- If you want to make this code faster, a better idea would be to
avoid doing all of this allocate and free work and just allocate an
array that's guaranteed to be big enough, and then keep track of how
many elements of that array are actually in use.

- #ifdef DECOMPRESSION_CORRUPTION_CHECKING is not a useful way of
introducing such a feature. Either we do it for real and expose it via
SQL and pg_amcheck as an optional behavior, or we rip it out and
revisit it later. Given the nearness of feature freeze, my vote is for
the latter.

- I'd remove the USE_LZ4 bit, too. Let's not define the presence of
LZ4 data in a non-LZ4-enabled cluster as corruption. If we do, then
people will expect to be able to use this to find places where they
are dependent on LZ4 if they want to move away from it -- and if we
don't recurse into composite datums, that will not actually work.

- check_toast_tuple() has an odd and slightly unclear calling
convention for which there are no comments. I wonder if it would be
better to reverse things and make bool *error the return value and
what is now the return value into a pointer argument, but whatever we
do I think it needs a few words in the comments. We don't need to
slavishly explain every argument -- I think toasttup and ctx and tctx
are reasonably clear -- but this is not.

- To me it would be more logical to reverse the order of the
toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid and
VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize
- VARHDRSZ checks. Whether we're pointing at the correct relation
feels more fundamental.

- If we moved the toplevel foreach loop in check_toasted_attributes()
out to the caller, say renaming the function to just
check_toasted_attribute(), we'd save a level of indentation in that
whole function and probably add a tad of clarity, too. You wouldn't
feel the need to Assert(ctx.toasted_attributes == NIL) in the caller
if the caller had just done list_free(ctx->toasted_attributes);
ctx->toasted_attributes = NIL.

- Is there a reason we need a cross-check on both the number of chunks
and on the total size? It seems to me that we should check that each
individual chunk has the size we expect, and that the total number of
chunks is what we expect. The overall size is then necessarily
correct.

- Why are all the changes to the tests in this patch? What do they
have to do with getting the TOAST checks out from under the buffer
lock? I really need you to structure the patch series so that each
patch is about one topic and, equally, so that each topic is only
covered by one patch. Otherwise it's just way too confusing.

- I think some of these messages need a bit of word-smithing, too, but
we can leave that for when we're closer to being done with this.

--
Robert Haas
EDB: http://www.enterprisedb.com

#123Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#122)
4 attachment(s)
Re: pg_amcheck contrib application

On Apr 1, 2021, at 1:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

- There are a total of two (2) calls in the current source code to
palloc0fast, and hundreds of calls to palloc0. So I think you should
forget about using the fast variant and just do what almost every
other caller does.

Done.

- If you want to make this code faster, a better idea would be to
avoid doing all of this allocate and free work and just allocate an
array that's guaranteed to be big enough, and then keep track of how
many elements of that array are actually in use.

Sounds like premature optimization to me. I only used palloc0fast because the argument is compile-time known. I wasn't specifically attempting to speed anything up.

- #ifdef DECOMPRESSION_CORRUPTION_CHECKING is not a useful way of
introducing such a feature. Either we do it for real and expose it via
SQL and pg_amcheck as an optional behavior, or we rip it out and
revisit it later. Given the nearness of feature freeze, my vote is for
the latter.

- I'd remove the USE_LZ4 bit, too. Let's not define the presence of
LZ4 data in a non-LZ4-enabled cluster as corruption. If we do, then
people will expect to be able to use this to find places where they
are dependent on LZ4 if they want to move away from it -- and if we
don't recurse into composite datums, that will not actually work.

Ok, I have removed this bit. I also removed the part of the patch that introduced a new corruption check, decompressing the data to see if it decompresses without error.

- check_toast_tuple() has an odd and slightly unclear calling
convention for which there are no comments. I wonder if it would be
better to reverse things and make bool *error the return value and
what is now the return value into a pointer argument, but whatever we
do I think it needs a few words in the comments. We don't need to
slavishly explain every argument -- I think toasttup and ctx and tctx
are reasonably clear -- but this is not.

...

- Is there a reason we need a cross-check on both the number of chunks
and on the total size? It seems to me that we should check that each
individual chunk has the size we expect, and that the total number of
chunks is what we expect. The overall size is then necessarily
correct.

Good point. I've removed the extra check on the total size, since it cannot be wrong if the checks on the individual chunk sizes were all correct. This eliminates the need for the odd calling convention for check_toast_tuple(), so I've changed that to return void and not take any return-by-reference arguments.

- To me it would be more logical to reverse the order of the
toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid and
VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize
- VARHDRSZ checks. Whether we're pointing at the correct relation
feels more fundamental.

Done.

- If we moved the toplevel foreach loop in check_toasted_attributes()
out to the caller, say renaming the function to just
check_toasted_attribute(), we'd save a level of indentation in that
whole function and probably add a tad of clarity, too. You wouldn't
feel the need to Assert(ctx.toasted_attributes == NIL) in the caller
if the caller had just done list_free(ctx->toasted_attributes);
ctx->toasted_attributes = NIL.

You're right. It looks nicer that way. Changed.

- Why are all the changes to the tests in this patch? What do they
have to do with getting the TOAST checks out from under the buffer
lock? I really need you to structure the patch series so that each
patch is about one topic and, equally, so that each topic is only
covered by one patch. Otherwise it's just way too confusing.

v18-0001 - Finishes work started in commit 3b6c1259f9 that was overlooked owing to how I had separated the changes in v17-0002 vs. v17-0003

v18-0002 - Postpones the toast checks for a page until after the main table page lock is released

v18-0003 - Improves the corruption messages in ways already discussed earlier in this thread. Changes the tests to expect the new messages, but adds no new checks

v18-0004 - Adding corruption checks of toast pointers. Extends the regression tests to cover the new checks.

- I think some of these messages need a bit of word-smithing, too, but
we can leave that for when we're closer to being done with this.

Ok.

Attachments:

v18-0001-amcheck-remove-duplicate-xid-bounds-checks.patchapplication/octet-stream; name=v18-0001-amcheck-remove-duplicate-xid-bounds-checks.patch; x-unix-mode=0644Download
From 5ff174db4de8c35f73c3bb9ffc51d854686cb1c5 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 2 Apr 2021 15:03:12 -0700
Subject: [PATCH v18 1/4] amcheck: remove duplicate xid bounds checks

Commit 3b6c1259f9 resulted in the same xmin and xmax bounds checking
being performed in both check_tuple() and check_tuple_visibility().
Leaving the ones in check_tuple_visibility() and removing the
others.

While at it, adjusting some code comments that should have been
changed in 3b6c1259f9 but were not.
---
 contrib/amcheck/verify_heapam.c | 130 ++------------------------------
 1 file changed, 6 insertions(+), 124 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 11ace483d0..1d769035f1 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1390,136 +1390,18 @@ check_tuple_attribute(HeapCheckContext *ctx)
 static void
 check_tuple(HeapCheckContext *ctx)
 {
-	TransactionId xmin;
-	TransactionId xmax;
-	bool		fatal = false;
-	uint16		infomask = ctx->tuphdr->t_infomask;
-
-	/* If xmin is normal, it should be within valid range */
-	xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
-	switch (get_xid_status(xmin, ctx, NULL))
-	{
-		case XID_INVALID:
-		case XID_BOUNDS_OK:
-			break;
-		case XID_IN_FUTURE:
-			report_corruption(ctx,
-							  psprintf("xmin %u equals or exceeds next valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->next_fxid),
-									   XidFromFullTransactionId(ctx->next_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_CLUSTERMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes oldest valid transaction ID %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->oldest_fxid),
-									   XidFromFullTransactionId(ctx->oldest_fxid)));
-			fatal = true;
-			break;
-		case XID_PRECEDES_RELMIN:
-			report_corruption(ctx,
-							  psprintf("xmin %u precedes relation freeze threshold %u:%u",
-									   xmin,
-									   EpochFromFullTransactionId(ctx->relfrozenfxid),
-									   XidFromFullTransactionId(ctx->relfrozenfxid)));
-			fatal = true;
-			break;
-	}
-
-	xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
-
-	if (infomask & HEAP_XMAX_IS_MULTI)
-	{
-		/* xmax is a multixact, so it should be within valid MXID range */
-		switch (check_mxid_valid_in_rel(xmax, ctx))
-		{
-			case XID_INVALID:
-				report_corruption(ctx,
-								  pstrdup("multitransaction ID is invalid"));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes relation minimum multitransaction ID threshold %u",
-										   xmax, ctx->relminmxid));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u precedes oldest valid multitransaction ID threshold %u",
-										   xmax, ctx->oldest_mxact));
-				fatal = true;
-				break;
-			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("multitransaction ID %u equals or exceeds next valid multitransaction ID %u",
-										   xmax,
-										   ctx->next_mxact));
-				fatal = true;
-				break;
-			case XID_BOUNDS_OK:
-				break;
-		}
-	}
-	else
-	{
-		/*
-		 * xmax is not a multixact and is normal, so it should be within the
-		 * valid XID range.
-		 */
-		switch (get_xid_status(xmax, ctx, NULL))
-		{
-			case XID_INVALID:
-			case XID_BOUNDS_OK:
-				break;
-			case XID_IN_FUTURE:
-				report_corruption(ctx,
-								  psprintf("xmax %u equals or exceeds next valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->next_fxid),
-										   XidFromFullTransactionId(ctx->next_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_CLUSTERMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes oldest valid transaction ID %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->oldest_fxid),
-										   XidFromFullTransactionId(ctx->oldest_fxid)));
-				fatal = true;
-				break;
-			case XID_PRECEDES_RELMIN:
-				report_corruption(ctx,
-								  psprintf("xmax %u precedes relation freeze threshold %u:%u",
-										   xmax,
-										   EpochFromFullTransactionId(ctx->relfrozenfxid),
-										   XidFromFullTransactionId(ctx->relfrozenfxid)));
-				fatal = true;
-		}
-	}
-
 	/*
-	 * Cannot process tuple data if tuple header was corrupt, as the offsets
-	 * within the page cannot be trusted, leaving too much risk of reading
-	 * garbage if we continue.
-	 *
-	 * We also cannot process the tuple if the xmin or xmax were invalid
-	 * relative to relfrozenxid or relminmxid, as clog entries for the xids
-	 * may already be gone.
+	 * Check various forms of tuple header corruption, and if the header is too
+	 * corrupt, do not continue with other checks.
 	 */
-	if (fatal)
+	if (!check_tuple_header(ctx))
 		return;
 
 	/*
-	 * Check various forms of tuple header corruption.  If the header is too
-	 * corrupt to continue checking, or if the tuple is not visible to anyone,
-	 * we cannot continue with other checks.
+	 * Check tuple visibility.  If the inserting transaction aborted, we
+	 * cannot assume our relation description matches the tuple structure, and
+	 * therefore cannot check it.
 	 */
-	if (!check_tuple_header(ctx))
-		return;
-
 	if (!check_tuple_visibility(ctx))
 		return;
 
-- 
2.21.1 (Apple Git-122.3)

v18-0002-amcheck-avoid-extra-work-while-holding-buffer-lo.patchapplication/octet-stream; name=v18-0002-amcheck-avoid-extra-work-while-holding-buffer-lo.patch; x-unix-mode=0644Download
From e33e49f1e1325a4439991550789c47fdf54b7220 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 2 Apr 2021 15:05:07 -0700
Subject: [PATCH v18 2/4] amcheck: avoid extra work while holding buffer locks

Refactoring verify_heapam so it does not look up attributes in the
toast table while holding a buffer lock on the main table page.
Instead, check all toasted attributes found in tuples on a main
table page after the lock has been released.

This changes the user visible behavior slightly, in that when both
toasted attributes and non-toasted attributes from the same page are
corrupt, the corruption reports about non-toasted attributes will
all be emitted prior to any corruption reports about the toasted
attributes.  This shouldn't matter, though, as this function was
first committed in this development cycle, and no users should be
familiar with the old behavior.
---
 contrib/amcheck/verify_heapam.c  | 236 +++++++++++++++++++++----------
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 165 insertions(+), 72 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 1d769035f1..d8a3e966c4 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -58,6 +58,19 @@ typedef enum SkipPages
 	SKIP_PAGES_NONE
 } SkipPages;
 
+/*
+ * Struct holding information about a toasted attribute sufficient to both
+ * check the toasted attribute and, if found to be corrupt, to report where it
+ * was encountered in the main table.
+ */
+typedef struct ToastedAttribute
+{
+	struct varatt_external toast_pointer;
+	BlockNumber blkno;			/* block in main table */
+	OffsetNumber offnum;		/* offset in main table */
+	AttrNumber	attnum;			/* attribute in main table */
+} ToastedAttribute;
+
 /*
  * Struct holding the running context information during
  * a lifetime of a verify_heapam execution.
@@ -119,11 +132,11 @@ typedef struct HeapCheckContext
 	/* True if tuple's xmax makes it eligible for pruning */
 	bool		tuple_could_be_pruned;
 
-	/* Values for iterating over toast for the attribute */
-	int32		chunkno;
-	int32		attrsize;
-	int32		endchunk;
-	int32		totalchunks;
+	/*
+	 * List of ToastedAttribute structs for toasted attributes which are not
+	 * eligible for pruning and should be checked
+	 */
+	List	   *toasted_attributes;
 
 	/* Whether verify_heapam has yet encountered any corrupt tuples */
 	bool		is_corrupt;
@@ -136,13 +149,20 @@ typedef struct HeapCheckContext
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx);
+static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+							  ToastedAttribute *ta, int32 chunkno,
+							  int32 endchunk);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
+static void check_toasted_attribute(HeapCheckContext *ctx,
+									ToastedAttribute *ta);
+
 static bool check_tuple_header(HeapCheckContext *ctx);
 static bool check_tuple_visibility(HeapCheckContext *ctx);
 
 static void report_corruption(HeapCheckContext *ctx, char *msg);
+static void report_toast_corruption(HeapCheckContext *ctx,
+									ToastedAttribute *ta, char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
 static FullTransactionId FullTransactionIdFromXidAndCtx(TransactionId xid,
 														const HeapCheckContext *ctx);
@@ -253,6 +273,7 @@ verify_heapam(PG_FUNCTION_ARGS)
 
 	memset(&ctx, 0, sizeof(HeapCheckContext));
 	ctx.cached_xid = InvalidTransactionId;
+	ctx.toasted_attributes = NIL;
 
 	/*
 	 * Any xmin newer than the xmin of our snapshot can't become all-visible
@@ -469,6 +490,19 @@ verify_heapam(PG_FUNCTION_ARGS)
 		/* clean up */
 		UnlockReleaseBuffer(ctx.buffer);
 
+		/*
+		 * Check any toast pointers from the page whose lock we just released
+		 */
+		if (ctx.toasted_attributes != NIL)
+		{
+			ListCell   *cell;
+
+			foreach(cell, ctx.toasted_attributes)
+				check_toasted_attribute(&ctx, lfirst(cell));
+			list_free(ctx.toasted_attributes);
+			ctx.toasted_attributes = NIL;
+		}
+
 		if (on_error_stop && ctx.is_corrupt)
 			break;
 	}
@@ -510,14 +544,13 @@ sanity_check_relation(Relation rel)
 }
 
 /*
- * Record a single corruption found in the table.  The values in ctx should
- * reflect the location of the corruption, and the msg argument should contain
- * a human-readable description of the corruption.
- *
- * The msg argument is pfree'd by this function.
+ * Shared internal implementation for report_corruption and
+ * report_toast_corruption.
  */
 static void
-report_corruption(HeapCheckContext *ctx, char *msg)
+report_corruption_internal(Tuplestorestate *tupstore, TupleDesc tupdesc,
+						   BlockNumber blkno, OffsetNumber offnum,
+						   AttrNumber attnum, char *msg)
 {
 	Datum		values[HEAPCHECK_RELATION_COLS];
 	bool		nulls[HEAPCHECK_RELATION_COLS];
@@ -525,10 +558,10 @@ report_corruption(HeapCheckContext *ctx, char *msg)
 
 	MemSet(values, 0, sizeof(values));
 	MemSet(nulls, 0, sizeof(nulls));
-	values[0] = Int64GetDatum(ctx->blkno);
-	values[1] = Int32GetDatum(ctx->offnum);
-	values[2] = Int32GetDatum(ctx->attnum);
-	nulls[2] = (ctx->attnum < 0);
+	values[0] = Int64GetDatum(blkno);
+	values[1] = Int32GetDatum(offnum);
+	values[2] = Int32GetDatum(attnum);
+	nulls[2] = (attnum < 0);
 	values[3] = CStringGetTextDatum(msg);
 
 	/*
@@ -541,8 +574,39 @@ report_corruption(HeapCheckContext *ctx, char *msg)
 	 */
 	pfree(msg);
 
-	tuple = heap_form_tuple(ctx->tupdesc, values, nulls);
-	tuplestore_puttuple(ctx->tupstore, tuple);
+	tuple = heap_form_tuple(tupdesc, values, nulls);
+	tuplestore_puttuple(tupstore, tuple);
+}
+
+/*
+ * Record a single corruption found in the main table.  The values in ctx should
+ * indicate the location of the corruption, and the msg argument should contain
+ * a human-readable description of the corruption.
+ *
+ * The msg argument is pfree'd by this function.
+ */
+static void
+report_corruption(HeapCheckContext *ctx, char *msg)
+{
+	report_corruption_internal(ctx->tupstore, ctx->tupdesc, ctx->blkno,
+							   ctx->offnum, ctx->attnum, msg);
+	ctx->is_corrupt = true;
+}
+
+/*
+ * Record corruption found in the toast table.  The values in ta should
+ * indicate the location in the main table where the toast pointer was
+ * encountered, and the msg argument should contain a human-readable
+ * description of the toast table corruption.
+ *
+ * As above, the msg argument is pfree'd by this function.
+ */
+static void
+report_toast_corruption(HeapCheckContext *ctx, ToastedAttribute *ta,
+						char *msg)
+{
+	report_corruption_internal(ctx->tupstore, ctx->tupdesc, ta->blkno,
+							   ta->offnum, ta->attnum, msg);
 	ctx->is_corrupt = true;
 }
 
@@ -1094,9 +1158,12 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * tuples that store the toasted value are retrieved and checked in order, with
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
+ *
+ * Returns whether the toast tuple passed the corruption checks.
  */
 static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
+check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
+				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
 {
 	int32		curchunk;
 	Pointer		chunk;
@@ -1111,7 +1178,7 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 										 ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
+		report_toast_corruption(ctx, ta,
 						  pstrdup("toast chunk sequence number is null"));
 		return;
 	}
@@ -1119,7 +1186,7 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_corruption(ctx,
+		report_toast_corruption(ctx, ta,
 						  pstrdup("toast chunk data is null"));
 		return;
 	}
@@ -1137,7 +1204,7 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 		/* should never happen */
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_corruption(ctx,
+		report_toast_corruption(ctx, ta,
 						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
 								   header, curchunk));
 		return;
@@ -1146,30 +1213,28 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != ctx->chunkno)
+	if (curchunk != chunkno)
 	{
-		report_corruption(ctx,
+		report_toast_corruption(ctx, ta,
 						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, ctx->chunkno));
+								   curchunk, chunkno));
 		return;
 	}
-	if (curchunk > ctx->endchunk)
+	if (curchunk > endchunk)
 	{
-		report_corruption(ctx,
+		report_toast_corruption(ctx, ta,
 						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, ctx->endchunk));
+								   curchunk, endchunk));
 		return;
 	}
 
-	expected_size = curchunk < ctx->totalchunks - 1 ? TOAST_MAX_CHUNK_SIZE
-		: ctx->attrsize - ((ctx->totalchunks - 1) * TOAST_MAX_CHUNK_SIZE);
+	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
+		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+
 	if (chunksize != expected_size)
-	{
-		report_corruption(ctx,
+		report_toast_corruption(ctx, ta,
 						  psprintf("toast chunk size %u differs from the expected size %u",
 								   chunksize, expected_size));
-		return;
-	}
 }
 
 /*
@@ -1177,17 +1242,17 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
  * found in ctx->tupstore.
  *
  * This function follows the logic performed by heap_deform_tuple(), and in the
- * case of a toasted value, optionally continues along the logic of
- * detoast_external_attr(), checking for any conditions that would result in
- * either of those functions Asserting or crashing the backend.  The checks
- * performed by Asserts present in those two functions are also performed here.
- * In cases where those two functions are a bit cavalier in their assumptions
- * about data being correct, we perform additional checks not present in either
- * of those two functions.  Where some condition is checked in both of those
- * functions, we perform it here twice, as we parallel the logical flow of
- * those two functions.  The presence of duplicate checks seems a reasonable
- * price to pay for keeping this code tightly coupled with the code it
- * protects.
+ * case of a toasted value, optionally stores the toast pointer so later it can
+ * be checked following the logic of detoast_external_attr(), checking for any
+ * conditions that would result in either of those functions Asserting or
+ * crashing the backend.  The checks performed by Asserts present in those two
+ * functions are also performed here and in check_toasted_attribute.  In cases
+ * where those two functions are a bit cavalier in their assumptions about data
+ * being correct, we perform additional checks not present in either of those
+ * two functions.  Where some condition is checked in both of those functions,
+ * we perform it here twice, as we parallel the logical flow of those two
+ * functions.  The presence of duplicate checks seems a reasonable price to pay
+ * for keeping this code tightly coupled with the code it protects.
  *
  * Returns true if the tuple attribute is sane enough for processing to
  * continue on to the next attribute, false otherwise.
@@ -1195,12 +1260,6 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx)
 static bool
 check_tuple_attribute(HeapCheckContext *ctx)
 {
-	struct varatt_external toast_pointer;
-	ScanKeyData toastkey;
-	SysScanDesc toastscan;
-	SnapshotData SnapshotToast;
-	HeapTuple	toasttup;
-	bool		found_toasttup;
 	Datum		attdatum;
 	struct varlena *attr;
 	char	   *tp;				/* pointer to the tuple data */
@@ -1335,13 +1394,44 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 
 	/*
-	 * Must copy attr into toast_pointer for alignment considerations
+	 * If this tuple is eligible to be pruned, we cannot check the toast.
+	 * Otherwise, we push a copy of the toast tuple so we can check it after
+	 * releasing the main table buffer lock.
 	 */
-	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+	if (!ctx->tuple_could_be_pruned)
+	{
+		ToastedAttribute *ta;
+
+		ta = (ToastedAttribute *) palloc0(sizeof(ToastedAttribute));
+
+		VARATT_EXTERNAL_GET_POINTER(ta->toast_pointer, attr);
+		ta->blkno = ctx->blkno;
+		ta->offnum = ctx->offnum;
+		ta->attnum = ctx->attnum;
+		ctx->toasted_attributes = lappend(ctx->toasted_attributes, ta);
+	}
+
+	return true;
+}
+
+/*
+ * For each attribute collected in ctx->toasted_attributes, look up the value
+ * in the toast table and perform checks on it.  This function should only be
+ * called on toast pointers which cannot be vacuumed away during our
+ * processing.
+ */
+static void
+check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
+{
+	SnapshotData SnapshotToast;
+	ScanKeyData toastkey;
+	SysScanDesc toastscan;
+	bool		found_toasttup;
+	HeapTuple	toasttup;
+	int32		chunkno;
+	int32		endchunk;
 
-	ctx->attrsize = VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer);
-	ctx->endchunk = (ctx->attrsize - 1) / TOAST_MAX_CHUNK_SIZE;
-	ctx->totalchunks = ctx->endchunk + 1;
+	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1349,7 +1439,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	ScanKeyInit(&toastkey,
 				(AttrNumber) 1,
 				BTEqualStrategyNumber, F_OIDEQ,
-				ObjectIdGetDatum(toast_pointer.va_valueid));
+				ObjectIdGetDatum(ta->toast_pointer.va_valueid));
 
 	/*
 	 * Check if any chunks for this toasted object exist in the toast table,
@@ -1360,27 +1450,26 @@ check_tuple_attribute(HeapCheckContext *ctx)
 										   ctx->valid_toast_index,
 										   &SnapshotToast, 1,
 										   &toastkey);
-	ctx->chunkno = 0;
+	chunkno = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx);
-		ctx->chunkno++;
+		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
+		chunkno++;
 	}
-	if (!found_toasttup)
-		report_corruption(ctx,
-						  psprintf("toasted value for attribute %u missing from toast table",
-								   ctx->attnum));
-	else if (ctx->chunkno != (ctx->endchunk + 1))
-		report_corruption(ctx,
-						  psprintf("final toast chunk number %u differs from expected value %u",
-								   ctx->chunkno, (ctx->endchunk + 1)));
 	systable_endscan_ordered(toastscan);
 
-	return true;
+	if (!found_toasttup)
+		report_toast_corruption(ctx, ta,
+								psprintf("toasted value for attribute %u missing from toast table",
+										 ta->attnum));
+	else if (chunkno != (endchunk + 1))
+		report_toast_corruption(ctx, ta,
+								psprintf("final toast chunk number %u differs from expected value %u",
+										 chunkno, (endchunk + 1)));
 }
 
 /*
@@ -1391,8 +1480,8 @@ static void
 check_tuple(HeapCheckContext *ctx)
 {
 	/*
-	 * Check various forms of tuple header corruption, and if the header is too
-	 * corrupt, do not continue with other checks.
+	 * Check various forms of tuple header corruption, and if the header is
+	 * too corrupt, do not continue with other checks.
 	 */
 	if (!check_tuple_header(ctx))
 		return;
@@ -1423,7 +1512,10 @@ check_tuple(HeapCheckContext *ctx)
 	 * Check each attribute unless we hit corruption that confuses what to do
 	 * next, at which point we abort further attribute checks for this tuple.
 	 * Note that we don't abort for all types of corruption, only for those
-	 * types where we don't know how to continue.
+	 * types where we don't know how to continue.  We also don't abort the
+	 * checking of toasted attributes collected from the tuple prior to
+	 * aborting.  Those will still be checked later along with other toasted
+	 * attributes collected from the page.
 	 */
 	ctx->offset = 0;
 	for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e6777e9d0..862a6df949 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2558,6 +2558,7 @@ TmFromChar
 TmToChar
 ToastAttrInfo
 ToastTupleContext
+ToastedAttribute
 TocEntry
 TokenAuxData
 TokenizedLine
-- 
2.21.1 (Apple Git-122.3)

v18-0003-amcheck-improving-corruption-messages.patchapplication/octet-stream; name=v18-0003-amcheck-improving-corruption-messages.patch; x-unix-mode=0644Download
From 4195bb5d6e39ec0cc2b323b0ad5b24c4c843df88 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 2 Apr 2021 15:06:50 -0700
Subject: [PATCH v18 3/4] amcheck: improving corruption messages.

Removing redundant mention of attnum in the corruption message text,
as the attnum is already its own separate column.

When reporting toast corruption, mentioning the toast value in the
message since that information is not otherwise reported.
---
 contrib/amcheck/verify_heapam.c           | 61 +++++++++++++----------
 src/bin/pg_amcheck/t/004_verify_heapam.pl |  4 +-
 2 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index d8a3e966c4..cd1f2c4113 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1179,7 +1179,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
-						  pstrdup("toast chunk sequence number is null"));
+								psprintf("toast value %u has toast chunk with null sequence number",
+										 ta->toast_pointer.va_valueid));
 		return;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
@@ -1187,7 +1188,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
-						  pstrdup("toast chunk data is null"));
+								psprintf("toast value %u chunk data is null",
+										 ta->toast_pointer.va_valueid));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1205,8 +1207,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
 		report_toast_corruption(ctx, ta,
-						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-								   header, curchunk));
+								psprintf("toast value %u corrupt extended chunk has invalid varlena header: %0x (sequence number %d)",
+										 ta->toast_pointer.va_valueid,
+										 header, curchunk));
 		return;
 	}
 
@@ -1216,15 +1219,17 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (curchunk != chunkno)
 	{
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, chunkno));
+								psprintf("toast value %u chunk sequence number %u does not match the expected sequence number %u",
+										 ta->toast_pointer.va_valueid,
+										 curchunk, chunkno));
 		return;
 	}
 	if (curchunk > endchunk)
 	{
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, endchunk));
+								psprintf("toast value %u chunk sequence number %u exceeds the end chunk sequence number %u",
+										 ta->toast_pointer.va_valueid,
+										 curchunk, endchunk));
 		return;
 	}
 
@@ -1233,8 +1238,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk size %u differs from the expected size %u",
-								   chunksize, expected_size));
+								psprintf("toast value %u chunk size %u differs from the expected size %u",
+										 ta->toast_pointer.va_valueid,
+										 chunksize, expected_size));
 }
 
 /*
@@ -1265,6 +1271,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	char	   *tp;				/* pointer to the tuple data */
 	uint16		infomask;
 	Form_pg_attribute thisatt;
+	struct varatt_external toast_pointer;
 
 	infomask = ctx->tuphdr->t_infomask;
 	thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), ctx->attnum);
@@ -1274,8 +1281,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u starts at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1295,8 +1301,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 		{
 			report_corruption(ctx,
-							  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-									   ctx->attnum,
+							  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 									   thisatt->attlen,
 									   ctx->tuphdr->t_hoff + ctx->offset,
 									   ctx->lp_len));
@@ -1328,8 +1333,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (va_tag != VARTAG_ONDISK)
 		{
 			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
+							  psprintf("toasted attribute has unexpected TOAST tag %u",
 									   va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
@@ -1343,8 +1347,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1371,12 +1374,17 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	/* It is external, and we're looking at a page on disk */
 
+	/*
+	 * Must copy attr into toast_pointer for alignment considerations
+	 */
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
+								   toast_pointer.va_valueid));
 		return true;
 	}
 
@@ -1384,8 +1392,8 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but relation has no toast relation",
+								   toast_pointer.va_valueid));
 		return true;
 	}
 
@@ -1464,12 +1472,13 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 
 	if (!found_toasttup)
 		report_toast_corruption(ctx, ta,
-								psprintf("toasted value for attribute %u missing from toast table",
-										 ta->attnum));
+								psprintf("toast value %u not found in toast table",
+										 ta->toast_pointer.va_valueid));
 	else if (chunkno != (endchunk + 1))
 		report_toast_corruption(ctx, ta,
-								psprintf("final toast chunk number %u differs from expected value %u",
-										 chunkno, (endchunk + 1)));
+								psprintf("toast value %u was expected to end at chunk %u, but ended at chunk %u",
+										 ta->toast_pointer.va_valueid,
+										 (endchunk + 1), chunkno));
 }
 
 /*
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 36607596b1..307f14611c 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -480,7 +480,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 1);
 		push @expected,
-			qr/${header}attribute \d+ with length \d+ ends at offset \d+ beyond total tuple length \d+/;
+			qr/${header}attribute with length \d+ ends at offset \d+ beyond total tuple length \d+/;
 	}
 	elsif ($offnum == 13)
 	{
@@ -489,7 +489,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}toasted value for attribute 2 missing from toast table/;
+			qr/${header}toast value \d+ not found in toast table/;
 	}
 	elsif ($offnum == 14)
 	{
-- 
2.21.1 (Apple Git-122.3)

v18-0004-amcheck-adding-toast-pointer-corruption-checks.patchapplication/octet-stream; name=v18-0004-amcheck-adding-toast-pointer-corruption-checks.patch; x-unix-mode=0644Download
From c3645b0af895767c40b1ead2f2e076c7c33e482b Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Fri, 2 Apr 2021 15:07:52 -0700
Subject: [PATCH v18 4/4] amcheck: adding toast pointer corruption checks

Verifying that toast pointer va_toastrelid fields match their heap
table's reltoastrelid.

Checking the extsize for a toast pointer against the raw size.  This
check could fail if buggy compression logic fails to notice that
compressing the attribute makes it bigger.  But assuming the logic
for that is correct, overlarge extsize indicates a corrupted toast
pointer.

Checking the toast is not too large to be allocated.  No such
toasted value should ever be stored, but a corrupted toast pointer
could record an unreasonbly large size, so check that.
---
 contrib/amcheck/verify_heapam.c           | 34 +++++++++++++++++++-
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 39 +++++++++++++++++++++--
 2 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index cd1f2c4113..b2e121ed38 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1397,6 +1397,28 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 	}
 
+	/* The toast pointer had better point at the relation's toast table */
+	if (toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+	{
+		report_corruption(ctx,
+						  psprintf("toast value %u toast relation oid %u differs from expected oid %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
+		return true;
+	}
+
+	/* Compression should never expand the attribute */
+	if (VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize - VARHDRSZ)
+	{
+		report_corruption(ctx,
+						  psprintf("toast value %u external size %u exceeds maximum expected for rawsize %u",
+								   toast_pointer.va_valueid,
+								   VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer),
+								   toast_pointer.va_rawsize));
+		return true;
+	}
+
 	/* If we were told to skip toast checking, then we're done. */
 	if (ctx->toast_rel == NULL)
 		return true;
@@ -1471,14 +1493,24 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	systable_endscan_ordered(toastscan);
 
 	if (!found_toasttup)
+	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
+		return;
+	}
+
+	if (chunkno != (endchunk + 1))
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u was expected to end at chunk %u, but ended at chunk %u",
 										 ta->toast_pointer.va_valueid,
 										 (endchunk + 1), chunkno));
+
+	if (!AllocSizeIsValid(ta->toast_pointer.va_rawsize))
+		report_toast_corruption(ctx, ta,
+								psprintf("toast value %u rawsize %u too large to be allocated",
+										 ta->toast_pointer.va_valueid,
+										 ta->toast_pointer.va_rawsize));
 }
 
 /*
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 307f14611c..de525fbdd8 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -224,7 +224,7 @@ my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.te
 my $relpath = "$pgdata/$rel";
 
 # Insert data and freeze public.test
-use constant ROWCOUNT => 16;
+use constant ROWCOUNT => 19;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
@@ -240,6 +240,13 @@ my $relfrozenxid = $node->safe_psql('postgres',
 my $datfrozenxid = $node->safe_psql('postgres',
 	q(select datfrozenxid from pg_database where datname = 'postgres'));
 
+# Find our toast relation id
+my $toastrelid = $node->safe_psql('postgres', qq(
+	SELECT c.reltoastrelid
+		FROM pg_catalog.pg_class c
+		WHERE c.oid = 'public.test'::regclass
+		));
+
 # Sanity check that our 'test' table has a relfrozenxid newer than the
 # datfrozenxid for the database, and that the datfrozenxid is greater than the
 # first normal xid.  We rely on these invariants in some of our tests.
@@ -296,7 +303,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 19;
+plan tests => 22;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -501,7 +508,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
 	}
-	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	elsif ($offnum == 15)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
@@ -511,6 +518,32 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
 	}
+	elsif ($offnum == 16)
+	{
+		# Corrupt column c's toast pointer va_toastrelid field
+		my $otherid = $toastrelid + 1;
+		$tup->{c_va_toastrelid} = $otherid;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ toast relation oid $otherid differs from expected oid $toastrelid/;
+	}
+	elsif ($offnum == 17)
+	{
+		# Corrupt column c's toast pointer va_extinfo field
+		$tup->{c_va_extinfo} = 7654321;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ external size 7654321 exceeds maximum expected for rawsize 10004/;
+	}
+	elsif ($offnum == 18)	# Last offnum should equal ROWCOUNT-1
+	{
+		# Corrupt column c's toast pointer va_rawsize field with a value
+		# exceeding maximum allowable allocation size
+		$tup->{c_va_rawsize} = 0x40000000;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ rawsize 1073741824 too large to be allocated/;
+	}
 	write_tuple($file, $offset, $tup);
 }
 close($file)
-- 
2.21.1 (Apple Git-122.3)

#124Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#123)
Re: pg_amcheck contrib application

On Sun, Apr 4, 2021 at 8:02 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

v18-0001 - Finishes work started in commit 3b6c1259f9 that was overlooked owing to how I had separated the changes in v17-0002 vs. v17-0003

Committed.

v18-0002 - Postpones the toast checks for a page until after the main table page lock is released

Committed, but I changed list_free() to list_free_deep() to avoid a
memory leak, and I revised the commit message to mention the important
point that we need to avoid following TOAST pointers from
potentially-prunable tuples.

v18-0003 - Improves the corruption messages in ways already discussed earlier in this thread. Changes the tests to expect the new messages, but adds no new checks

Kibitizing your message wording:

"toast value %u chunk data is null" -> "toast value %u chunk %d has
null data". We can mention the chunk number this way.

"toast value %u corrupt extended chunk has invalid varlena header: %0x
(sequence number %d)" -> "toast value %u chunk %d has invalid varlena
header %0x". We can be more consistent about how we incorporate the
chunk number into the text, and we don't really need to include the
word corrupt, because all of these are corruption complaints, and I
think it looks better without the colon.

"toast value %u chunk sequence number %u does not match the expected
sequence number %u" -> "toast value %u contains chunk %d where chunk
%d was expected". Shorter. Uses %d for a sequence number instead of
%u, which I think is correct -- anyway we should have them all one way
or all the other. I think I'd rather ditch the "sequence number"
technology and just talk about "chunk %d" or whatever.

"toast value %u chunk sequence number %u exceeds the end chunk
sequence number %u" -> "toast value %u chunk %d follows last expected
chunk %d"

"toast value %u chunk size %u differs from the expected size %u" ->
"toast value %u chunk %d has size %u, but expected size %u"

Other complaints:

Your commit message fails to mention the addition of
VARATT_EXTERNAL_GET_POINTER, which is a significant change/bug fix
unrelated to message wording.

It feels like we have a non-minimal number of checks/messages for the
series of toast chunks. I think that if we find a chunk after the last
chunk we were expecting to find (curchunk > endchunk) and you also get
a message if we have the wrong number of chunks in total (chunkno !=
(endchunk + 1)). Now maybe I'm wrong, but if the first message
triggers, it seems like the second message must also trigger. Is that
wrong? If not, maybe we can get rid of the first one entirely? That's
such a small change I think we could include it in this same patch, if
it's a correct idea.

On a related note, as I think I said before, I still think we should
be rejiggering this so that we're not testing both the size of each
individual chunk and the total size, because that ought to be
redundant. That might be better done as a separate patch but I think
we should try to clean it up.

v18-0004 - Adding corruption checks of toast pointers. Extends the regression tests to cover the new checks.

I think we could check that the result of
VARATT_EXTERNAL_GET_COMPRESS_METHOD is one of the values we expect to
see.

Using AllocSizeIsValid() seems pretty vile. I know that MaxAllocSize
is 0x3FFFFFFF in no small part because that's the maximum length that
can be represented by a varlena, but I'm not sure it's a good idea to
couple the concepts so closely like this. Maybe we can just #define
VARLENA_SIZE_LIMIT in this file and use that, and a message that says
size %u exceeds limit %u.

I'm a little worried about whether the additional test cases are
Endian-dependent at all. I don't immediately know what might be wrong
with them, but I'm going to think about that some more later. Any
chance you have access to a Big-endian box where you can test this?

--
Robert Haas
EDB: http://www.enterprisedb.com

#125Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#124)
3 attachment(s)
Re: pg_amcheck contrib application

On Apr 7, 2021, at 1:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Apr 4, 2021 at 8:02 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

v18-0001 - Finishes work started in commit 3b6c1259f9 that was overlooked owing to how I had separated the changes in v17-0002 vs. v17-0003

Committed.

Thank you.

v18-0002 - Postpones the toast checks for a page until after the main table page lock is released

Committed, but I changed list_free() to list_free_deep() to avoid a
memory leak, and I revised the commit message to mention the important
point that we need to avoid following TOAST pointers from
potentially-prunable tuples.

Thank you, and yes, I agree with that change.

v18-0003 - Improves the corruption messages in ways already discussed earlier in this thread. Changes the tests to expect the new messages, but adds no new checks

Kibitizing your message wording:

"toast value %u chunk data is null" -> "toast value %u chunk %d has
null data". We can mention the chunk number this way.

Changed.

"toast value %u corrupt extended chunk has invalid varlena header: %0x
(sequence number %d)" -> "toast value %u chunk %d has invalid varlena
header %0x". We can be more consistent about how we incorporate the
chunk number into the text, and we don't really need to include the
word corrupt, because all of these are corruption complaints, and I
think it looks better without the colon.

Changed.

"toast value %u chunk sequence number %u does not match the expected
sequence number %u" -> "toast value %u contains chunk %d where chunk
%d was expected". Shorter. Uses %d for a sequence number instead of
%u, which I think is correct -- anyway we should have them all one way
or all the other. I think I'd rather ditch the "sequence number"
technology and just talk about "chunk %d" or whatever.

I don't agree with this one. I do agree with changing the message, but not to the message you suggest.

Imagine a toasted attribute with 18 chunks numbered [0..17]. Then we update the toast to have only 6 chunks numbered [0..5] except we corruptly keep chunks numbered [12..17] in the toast table. We'd rather see a report like this:

# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 6 has sequence number 12, but expected sequence number 6
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 7 has sequence number 13, but expected sequence number 7
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 8 has sequence number 14, but expected sequence number 8
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 9 has sequence number 15, but expected sequence number 9
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 10 has sequence number 16, but expected sequence number 10
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 11 has sequence number 17, but expected sequence number 11
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 was expected to end at chunk 6, but ended at chunk 12

than one like this:

# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 contains chunk 12 where chunk 6 was expected
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 contains chunk 13 where chunk 7 was expected
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 contains chunk 14 where chunk 8 was expected
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 contains chunk 15 where chunk 9 was expected
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 contains chunk 16 where chunk 10 was expected
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 contains chunk 17 where chunk 11 was expected
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 was expected to end at chunk 6, but ended at chunk 12

because saying the toast value ended at "chunk 12" after saying that it contains "chunk 17" is contradictory. You need the distinction between the chunk number and the chunk sequence number, since in corrupt circumstances they may not be the same.

"toast value %u chunk sequence number %u exceeds the end chunk
sequence number %u" -> "toast value %u chunk %d follows last expected
chunk %d"

Changed.

"toast value %u chunk size %u differs from the expected size %u" ->
"toast value %u chunk %d has size %u, but expected size %u"

Changed.

Other complaints:

Your commit message fails to mention the addition of
VARATT_EXTERNAL_GET_POINTER, which is a significant change/bug fix
unrelated to message wording.

Right you are.

It feels like we have a non-minimal number of checks/messages for the
series of toast chunks. I think that if we find a chunk after the last
chunk we were expecting to find (curchunk > endchunk) and you also get
a message if we have the wrong number of chunks in total (chunkno !=
(endchunk + 1)). Now maybe I'm wrong, but if the first message
triggers, it seems like the second message must also trigger. Is that
wrong? If not, maybe we can get rid of the first one entirely? That's
such a small change I think we could include it in this same patch, if
it's a correct idea.

Motivated by discussions we had off-list, I dug into this one.

Purely as manual testing, and not part of the patch, I hacked the backend a bit to allow direct modification of the toast table. After corrupting the toast with the following bit of SQL:

WITH chunk_limit AS (
SELECT chunk_id, MAX(chunk_seq) AS maxseq
FROM $toastname
GROUP BY chunk_id)
INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
(SELECT t.chunk_id,
t.chunk_seq + cl.maxseq + CASE WHEN t.chunk_seq < 3 THEN 1 ELSE 7 END,
t.chunk_data
FROM $toastname t
INNER JOIN chunk_limit cl
ON t.chunk_id = cl.chunk_id)

pg_amcheck reports the following corruption messages:

# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 6 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 7 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 8 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 9 has sequence number 15, but expected sequence number 9
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 10 has sequence number 16, but expected sequence number 10
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 11 has sequence number 17, but expected sequence number 11
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 was expected to end at chunk 6, but ended at chunk 12

I think if we'd left out the first three messages, it would read strangely. We would be complaining about three chunks with the wrong sequence number, then conclude that there were six extra chunks. A sufficiently savvy user might deduce the presence of chunks 6, 7, and 8, but the problem is more obvious (to my eyes, at least) if we keep the first three messages. This seems like a judgement call and not a clear argument either way, so if you still want me to change it, I guess I don't mind doing so.

On a related note, as I think I said before, I still think we should
be rejiggering this so that we're not testing both the size of each
individual chunk and the total size, because that ought to be
redundant. That might be better done as a separate patch but I think
we should try to clean it up.

Can you point me to the exact check you are mentioning, and with which patch applied? I don't see any examples of this after applying the v18-0003.

v18-0004 - Adding corruption checks of toast pointers. Extends the regression tests to cover the new checks.

I think we could check that the result of
VARATT_EXTERNAL_GET_COMPRESS_METHOD is one of the values we expect to
see.

Yes. I had that before, pulled it out along with other toast compression checks, but have put it back in for v19.

Using AllocSizeIsValid() seems pretty vile. I know that MaxAllocSize
is 0x3FFFFFFF in no small part because that's the maximum length that
can be represented by a varlena, but I'm not sure it's a good idea to
couple the concepts so closely like this. Maybe we can just #define
VARLENA_SIZE_LIMIT in this file and use that, and a message that says
size %u exceeds limit %u.

Changed.

I'm a little worried about whether the additional test cases are
Endian-dependent at all. I don't immediately know what might be wrong
with them, but I'm going to think about that some more later. Any
chance you have access to a Big-endian box where you can test this?

I don't have a Big-endian box, but I think one of them may be wrong now that you mention the issue:

# Corrupt column c's toast pointer va_extinfo field

The problem is that the 30-bit extsize and 2-bit cmid split is not being handled in the perl test, and I don't see an easy way to have perl's pack/unpack do that for us. There isn't any requirement that each possible corruption we check actually be manifested in the regression tests. The simplest solution is to remove this problematic test, so that's what I did. The other two new tests corrupt c_va_toastrelid and c_va_rawsize, both of which are read/written using unpack/pack, so perl should handle the endianness for us (I hope).

If you'd rather not commit these two extra tests, you don't have to, as I've split them out into v19-0003. But if you do commit them, it makes more sense to me to be one commit with 0002+0003 together, rather than separately. Not committing the new tests just means that verify_heapam() is able to detect additional forms of corruption that we're not covering in the regression tests. But that's already true for some other corruption types, such as detecting toast chunks with null sequence numbers.

Attachments:

v19-0001-amcheck-rewording-messages-and-fixing-alignment.patchapplication/octet-stream; name=v19-0001-amcheck-rewording-messages-and-fixing-alignment.patch; x-unix-mode=0644Download
From b6ea48417cc99258bf2f9106feed4bfebfcf8212 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 8 Apr 2021 09:48:37 -0700
Subject: [PATCH v19 1/3] amcheck: rewording messages and fixing alignment

Removing redundant mention of attnum in the corruption message text,
as the attnum is already its own separate column.

When reporting toast corruption, mentioning the toast value in the
message since that information is not otherwise reported.

Being more careful about alignment when accessing a toast pointer.
---
 contrib/amcheck/verify_heapam.c           | 63 +++++++++++++----------
 src/bin/pg_amcheck/t/004_verify_heapam.pl |  4 +-
 2 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index e8aa0d68d4..13f420d9ad 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1179,7 +1179,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
-						  pstrdup("toast chunk sequence number is null"));
+								psprintf("toast value %u has toast chunk with null sequence number",
+										 ta->toast_pointer.va_valueid));
 		return;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
@@ -1187,7 +1188,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
-						  pstrdup("toast chunk data is null"));
+								psprintf("toast value %u chunk %d has null data",
+										 ta->toast_pointer.va_valueid, chunkno));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1205,8 +1207,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
 		report_toast_corruption(ctx, ta,
-						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-								   header, curchunk));
+								psprintf("toast value %u chunk %d has invalid varlena header %0x",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, header));
 		return;
 	}
 
@@ -1216,15 +1219,17 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (curchunk != chunkno)
 	{
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, chunkno));
+								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, curchunk, chunkno));
 		return;
 	}
-	if (curchunk > endchunk)
+	if (chunkno > endchunk)
 	{
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, endchunk));
+								psprintf("toast value %u chunk %d follows last expected chunk %d",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, endchunk));
 		return;
 	}
 
@@ -1233,8 +1238,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk size %u differs from the expected size %u",
-								   chunksize, expected_size));
+								psprintf("toast value %u chunk %d has size %u, but expected size %u",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, chunksize, expected_size));
 }
 
 /*
@@ -1265,6 +1271,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	char	   *tp;				/* pointer to the tuple data */
 	uint16		infomask;
 	Form_pg_attribute thisatt;
+	struct varatt_external toast_pointer;
 
 	infomask = ctx->tuphdr->t_infomask;
 	thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), ctx->attnum);
@@ -1274,8 +1281,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u starts at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1295,8 +1301,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 		{
 			report_corruption(ctx,
-							  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-									   ctx->attnum,
+							  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 									   thisatt->attlen,
 									   ctx->tuphdr->t_hoff + ctx->offset,
 									   ctx->lp_len));
@@ -1328,8 +1333,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (va_tag != VARTAG_ONDISK)
 		{
 			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
+							  psprintf("toasted attribute has unexpected TOAST tag %u",
 									   va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
@@ -1343,8 +1347,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1371,12 +1374,17 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	/* It is external, and we're looking at a page on disk */
 
+	/*
+	 * Must copy attr into toast_pointer for alignment considerations
+	 */
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
+								   toast_pointer.va_valueid));
 		return true;
 	}
 
@@ -1384,8 +1392,8 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but relation has no toast relation",
+								   toast_pointer.va_valueid));
 		return true;
 	}
 
@@ -1464,12 +1472,13 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 
 	if (!found_toasttup)
 		report_toast_corruption(ctx, ta,
-								psprintf("toasted value for attribute %u missing from toast table",
-										 ta->attnum));
+								psprintf("toast value %u not found in toast table",
+										 ta->toast_pointer.va_valueid));
 	else if (chunkno != (endchunk + 1))
 		report_toast_corruption(ctx, ta,
-								psprintf("final toast chunk number %u differs from expected value %u",
-										 chunkno, (endchunk + 1)));
+								psprintf("toast value %u was expected to end at chunk %u, but ended at chunk %u",
+										 ta->toast_pointer.va_valueid,
+										 (endchunk + 1), chunkno));
 }
 
 /*
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 36607596b1..307f14611c 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -480,7 +480,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 1);
 		push @expected,
-			qr/${header}attribute \d+ with length \d+ ends at offset \d+ beyond total tuple length \d+/;
+			qr/${header}attribute with length \d+ ends at offset \d+ beyond total tuple length \d+/;
 	}
 	elsif ($offnum == 13)
 	{
@@ -489,7 +489,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}toasted value for attribute 2 missing from toast table/;
+			qr/${header}toast value \d+ not found in toast table/;
 	}
 	elsif ($offnum == 14)
 	{
-- 
2.21.1 (Apple Git-122.3)

v19-0002-amcheck-adding-toast-pointer-corruption-checks.patchapplication/octet-stream; name=v19-0002-amcheck-adding-toast-pointer-corruption-checks.patch; x-unix-mode=0644Download
From 86addc39596cf980a5850d9b1dba23fdaa927485 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 8 Apr 2021 10:09:59 -0700
Subject: [PATCH v19 2/3] amcheck: adding toast pointer corruption checks

Verifying that toast pointer va_toastrelid fields match their heap
table's reltoastrelid.

Checking the extsize for a toast pointer against the raw size.  This
check could fail if buggy compression logic fails to notice that
compressing the attribute makes it bigger.  But assuming the logic
for that is correct, overlarge extsize indicates a corrupted toast
pointer.

Checking if a toast pointer indicates the data is compressed, that
the toast pointer records a valid compression method.

Checking the toast is not too large to be allocated.  No such
toasted value should ever be stored, but a corrupted toast pointer
could record an unreasonbly large size, so check that.

Changing the logic to continue checking toast even after reporting
that HEAP_HASEXTERNAL is false.  Previously, the toast checking
stopped here, but that wasn't necessary, and subsequent checks may
provide additional useful diagnostic information.
---
 contrib/amcheck/verify_heapam.c | 56 +++++++++++++++++++++++++++++++--
 1 file changed, 53 insertions(+), 3 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 13f420d9ad..7b7b40f415 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -30,6 +30,8 @@ PG_FUNCTION_INFO_V1(verify_heapam);
 /* The number of columns in tuples returned by verify_heapam */
 #define HEAPCHECK_RELATION_COLS 4
 
+#define VARLENA_SIZE_LIMIT 0x3FFFFFFF
+
 /*
  * Despite the name, we use this for reporting problems with both XIDs and
  * MXIDs.
@@ -1379,14 +1381,54 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	 */
 	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
+	if (toast_pointer.va_rawsize > VARLENA_SIZE_LIMIT)
+		report_corruption(ctx,
+						  psprintf("toast value %u rawsize %u exceeds limit %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_rawsize,
+								   VARLENA_SIZE_LIMIT));
+
+	/* Compression should never expand the attribute */
+	if (VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize - VARHDRSZ)
+		report_corruption(ctx,
+						  psprintf("toast value %u external size %u exceeds maximum expected for rawsize %u",
+								   toast_pointer.va_valueid,
+								   VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer),
+								   toast_pointer.va_rawsize));
+
+	/* Compressed attributes should have a valid compression method */
+	if (VARATT_IS_COMPRESSED(&toast_pointer))
+	{
+		ToastCompressionId cmid;
+		bool		invalid = true;
+
+		cmid = TOAST_COMPRESS_METHOD(&toast_pointer);
+		switch (cmid)
+		{
+			/* List of all valid compression method IDs */
+			case TOAST_PGLZ_COMPRESSION_ID:
+			case TOAST_LZ4_COMPRESSION_ID:
+				invalid = false;
+				break;
+
+			/* Recognized but invalid compression method ID */
+			case TOAST_INVALID_COMPRESSION_ID:
+				break;
+
+			/* Intentionally no default here */
+		}
+
+		if (invalid)
+			report_corruption(ctx,
+							  psprintf("toast value %u has invalid compression method id %d",
+									   toast_pointer.va_valueid, cmid));
+	}
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
-	{
 		report_corruption(ctx,
 						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
 								   toast_pointer.va_valueid));
-		return true;
-	}
 
 	/* The relation better have a toast table */
 	if (!ctx->rel->rd_rel->reltoastrelid)
@@ -1397,6 +1439,14 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 	}
 
+	/* The toast pointer had better point at the relation's toast table */
+	if (toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+		report_corruption(ctx,
+						  psprintf("toast value %u toast relation oid %u differs from expected oid %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
+
 	/* If we were told to skip toast checking, then we're done. */
 	if (ctx->toast_rel == NULL)
 		return true;
-- 
2.21.1 (Apple Git-122.3)

v19-0003-amcheck-additional-regression-test-coverage.patchapplication/octet-stream; name=v19-0003-amcheck-additional-regression-test-coverage.patch; x-unix-mode=0644Download
From cd3534caf75aeb8397c7754a26a8e3722508bbd2 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 8 Apr 2021 10:24:40 -0700
Subject: [PATCH v19 3/3] amcheck: additional regression test coverage

Adding regression checks for two corruption checks added in prior
commit.  We now verify that corruption reports are generated when we
set a toast pointer's va_toastrelid to the wrong Oid or set a toast
pointer's va_rawsize to an overlarge value.
---
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 31 ++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 307f14611c..e64801c7e1 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -224,7 +224,7 @@ my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.te
 my $relpath = "$pgdata/$rel";
 
 # Insert data and freeze public.test
-use constant ROWCOUNT => 16;
+use constant ROWCOUNT => 18;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
@@ -240,6 +240,13 @@ my $relfrozenxid = $node->safe_psql('postgres',
 my $datfrozenxid = $node->safe_psql('postgres',
 	q(select datfrozenxid from pg_database where datname = 'postgres'));
 
+# Find our toast relation id
+my $toastrelid = $node->safe_psql('postgres', qq(
+	SELECT c.reltoastrelid
+		FROM pg_catalog.pg_class c
+		WHERE c.oid = 'public.test'::regclass
+		));
+
 # Sanity check that our 'test' table has a relfrozenxid newer than the
 # datfrozenxid for the database, and that the datfrozenxid is greater than the
 # first normal xid.  We rely on these invariants in some of our tests.
@@ -296,7 +303,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 19;
+plan tests => 22;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -501,7 +508,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4 equals or exceeds next valid multitransaction ID 1/;
 	}
-	elsif ($offnum == 15)	# Last offnum must equal ROWCOUNT
+	elsif ($offnum == 15)
 	{
 		# Set both HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI
 		$tup->{t_infomask} |= HEAP_XMAX_COMMITTED;
@@ -511,6 +518,24 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
 	}
+	elsif ($offnum == 16)
+	{
+		# Corrupt column c's toast pointer va_toastrelid field
+		my $otherid = $toastrelid + 1;
+		$tup->{c_va_toastrelid} = $otherid;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ toast relation oid $otherid differs from expected oid $toastrelid/;
+	}
+	elsif ($offnum == 17)	# Last offnum should equal ROWCOUNT-1
+	{
+		# Corrupt column c's toast pointer va_rawsize field with a value
+		# exceeding maximum allowable allocation size
+		$tup->{c_va_rawsize} = 0x40000000;
+		$header = header(0, $offnum, 2);
+		push @expected,
+			qr/${header}toast value \d+ rawsize 1073741824 exceeds limit 1073741823/;
+	}
 	write_tuple($file, $offset, $tup);
 }
 close($file)
-- 
2.21.1 (Apple Git-122.3)

#126Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#125)
Re: pg_amcheck contrib application

On Thu, Apr 8, 2021 at 3:02 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Imagine a toasted attribute with 18 chunks numbered [0..17]. Then we update the toast to have only 6 chunks numbered [0..5] except we corruptly keep chunks numbered [12..17] in the toast table. We'd rather see a report like this:

[ toast value NNN chunk NNN has sequence number NNN, but expected
sequence number NNN ]

than one like this:

[ toast value NNN contains chunk NNN where chunk NNN was expected ]

because saying the toast value ended at "chunk 12" after saying that it contains "chunk 17" is contradictory. You need the distinction between the chunk number and the chunk sequence number, since in corrupt circumstances they may not be the same.

Hmm, I see your point, and that's a good example to illustrate it.
But, with that example in front of me, I am rather doubtful that
either of these is what users actually want. Consider the case where I
should have chunks 0..17 and chunk 1 is just plain gone. This, by the
way, seems like a pretty likely case to arise in practice, since all
we need is for a block to get truncated away or zeroed erroneously, or
for a tuple to get pruned that shouldn't. With either of the above
schemes, I guess we're going to get a message about every chunk from 2
to 17, complaining that they're all misnumbered. We might also get a
complaint that the last chunk is the wrong size, and that the total
number of chunks isn't right. What we really want is a single
complaint saying chunk 1 is missing.

Likewise, in your example, I sort of feel like what I really want,
rather than either of the above outputs, is to get some messages like
this:

toast value NNN contains unexpected extra chunk [12-17]

Both your phrasing for those messages and what I suggested make it
sound like the problem is that the chunk number is wrong. But that
doesn't seem like it's taking the right view of the situation. Chunks
12-17 shouldn't exist at all, and if they do, we should say that, e.g.
by complaining about something like "toast value 16444 chunk 12
follows last expected chunk 5"

In other words, I don't buy the idea that the user will accept the
idea that there's a chunk number and a chunk sequence number, and that
they should know the difference between those things and what each of
them are. They're entitled to imagine that there's just one thing, and
that we're going to tell them about value that are extra or missing.
The fact that we're not doing that seems like it's just a matter of
missing code. If we start the index scan and get chunk 4, we can
easily emit messages for chunks 0..3 right on the spot, declaring them
missing. Things do get a bit hairy if the index scan returns values
out of order: what if it gives us chunk_seq = 2 and then chunk_seq =
1? But I think we could handle that by just issuing a complaint in any
such case that "toast index returns chunks out of order for toast
value NNN" and stopping further checking of that toast value.

Purely as manual testing, and not part of the patch, I hacked the backend a bit to allow direct modification of the toast table. After corrupting the toast with the following bit of SQL:

WITH chunk_limit AS (
SELECT chunk_id, MAX(chunk_seq) AS maxseq
FROM $toastname
GROUP BY chunk_id)
INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
(SELECT t.chunk_id,
t.chunk_seq + cl.maxseq + CASE WHEN t.chunk_seq < 3 THEN 1 ELSE 7 END,
t.chunk_data
FROM $toastname t
INNER JOIN chunk_limit cl
ON t.chunk_id = cl.chunk_id)

pg_amcheck reports the following corruption messages:

# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 6 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 7 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 8 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 9 has sequence number 15, but expected sequence number 9
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 10 has sequence number 16, but expected sequence number 10
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 11 has sequence number 17, but expected sequence number 11
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 was expected to end at chunk 6, but ended at chunk 12

I think if we'd left out the first three messages, it would read strangely. We would be complaining about three chunks with the wrong sequence number, then conclude that there were six extra chunks. A sufficiently savvy user might deduce the presence of chunks 6, 7, and 8, but the problem is more obvious (to my eyes, at least) if we keep the first three messages. This seems like a judgement call and not a clear argument either way, so if you still want me to change it, I guess I don't mind doing so.

I mean, looking at it, the question here is why it's not just using
the same message for all of them. The fact that the chunk numbers are
higher than 5 is the problem. The sequence numbers seem like just a
distraction.

On a related note, as I think I said before, I still think we should
be rejiggering this so that we're not testing both the size of each
individual chunk and the total size, because that ought to be
redundant. That might be better done as a separate patch but I think
we should try to clean it up.

Can you point me to the exact check you are mentioning, and with which patch applied? I don't see any examples of this after applying the v18-0003.

Hmm, my mistake, I think.

I'm a little worried about whether the additional test cases are
Endian-dependent at all. I don't immediately know what might be wrong
with them, but I'm going to think about that some more later. Any
chance you have access to a Big-endian box where you can test this?

I don't have a Big-endian box, but I think one of them may be wrong now that you mention the issue:

# Corrupt column c's toast pointer va_extinfo field

The problem is that the 30-bit extsize and 2-bit cmid split is not being handled in the perl test, and I don't see an easy way to have perl's pack/unpack do that for us. There isn't any requirement that each possible corruption we check actually be manifested in the regression tests. The simplest solution is to remove this problematic test, so that's what I did. The other two new tests corrupt c_va_toastrelid and c_va_rawsize, both of which are read/written using unpack/pack, so perl should handle the endianness for us (I hope).

I don't immediately see why this particular thing should be an issue.
The format of the varlena header itself is different on big-endian and
little-endian machines, which is why postgres.h has all this stuff
conditioned on WORDS_BIGENDIAN. But va_extinfo doesn't have any
similar treatment, so I'm not sure what could go wrong there, as long
as the 4-byte value as a whole is being packed and unpacked according
to the machine's endian-ness.

--
Robert Haas
EDB: http://www.enterprisedb.com

#127Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#126)
Re: pg_amcheck contrib application

On Apr 8, 2021, at 1:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 8, 2021 at 3:02 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Imagine a toasted attribute with 18 chunks numbered [0..17]. Then we update the toast to have only 6 chunks numbered [0..5] except we corruptly keep chunks numbered [12..17] in the toast table. We'd rather see a report like this:

[ toast value NNN chunk NNN has sequence number NNN, but expected
sequence number NNN ]

than one like this:

[ toast value NNN contains chunk NNN where chunk NNN was expected ]

because saying the toast value ended at "chunk 12" after saying that it contains "chunk 17" is contradictory. You need the distinction between the chunk number and the chunk sequence number, since in corrupt circumstances they may not be the same.

Hmm, I see your point, and that's a good example to illustrate it.
But, with that example in front of me, I am rather doubtful that
either of these is what users actually want. Consider the case where I
should have chunks 0..17 and chunk 1 is just plain gone. This, by the
way, seems like a pretty likely case to arise in practice, since all
we need is for a block to get truncated away or zeroed erroneously, or
for a tuple to get pruned that shouldn't. With either of the above
schemes, I guess we're going to get a message about every chunk from 2
to 17, complaining that they're all misnumbered. We might also get a
complaint that the last chunk is the wrong size, and that the total
number of chunks isn't right. What we really want is a single
complaint saying chunk 1 is missing.

Likewise, in your example, I sort of feel like what I really want,
rather than either of the above outputs, is to get some messages like
this:

toast value NNN contains unexpected extra chunk [12-17]

Both your phrasing for those messages and what I suggested make it
sound like the problem is that the chunk number is wrong. But that
doesn't seem like it's taking the right view of the situation. Chunks
12-17 shouldn't exist at all, and if they do, we should say that, e.g.
by complaining about something like "toast value 16444 chunk 12
follows last expected chunk 5"

In other words, I don't buy the idea that the user will accept the
idea that there's a chunk number and a chunk sequence number, and that
they should know the difference between those things and what each of
them are. They're entitled to imagine that there's just one thing, and
that we're going to tell them about value that are extra or missing.
The fact that we're not doing that seems like it's just a matter of
missing code.

Somehow, we have to get enough information about chunk_seq discontinuity into the output that if the user forwards it to -hackers, or if the output comes from a buildfarm critter, that we have all the information to help diagnose what went wrong.

As a specific example, if the va_rawsize suggests 2 chunks, and we find 150 chunks all with contiguous chunk_seq values, that is different from a debugging point of view than if we find 150 chunks with chunk_seq values spread all over the [0..MAXINT] range. We can't just tell the user that there were 148 extra chunks. We also shouldn't phrase the error in terms of "extra chunks", since it might be the va_rawsize that is corrupt.

I agree that the current message output might be overly verbose in how it reports this information. Conceptually, we want to store up information about the chunk issues and report them all at once, but that's hard to do in general, as the number of chunk_seq discontinuities might be quite large, much too large to fit reasonably into any one message. Maybe we could report just the first N discontinuities rather than all of them, but if somebody wants to open a hex editor and walk through the toast table, they won't appreciate having the corruption information truncated like that.

All this leads me to believe that we should report the following:

1) If the total number of chunks retrieved differs from the expected number, report how many we expected vs. how many we got
2) If the chunk_seq numbers are discontiguous, report each discontiguity.
3) If the index scan returned chunks out of chunk_seq order, report that
4) If any chunk is not the expected size, report that

So, for your example of chunk 1 missing from chunks [0..17], we'd report that we got one fewer chunks than we expected, that the second chunk seen was discontiguous from the first chunk seen, that the final chunk seen was smaller than expected by M bytes, and that the total size was smaller than we expected by N bytes. The third of those is somewhat misleading, since the final chunk was presumably the right size; we just weren't expecting to hit a partial chunk quite yet. But I don't see how to make that better in the general case.

If we start the index scan and get chunk 4, we can
easily emit messages for chunks 0..3 right on the spot, declaring them
missing. Things do get a bit hairy if the index scan returns values
out of order: what if it gives us chunk_seq = 2 and then chunk_seq =
1? But I think we could handle that by just issuing a complaint in any
such case that "toast index returns chunks out of order for toast
value NNN" and stopping further checking of that toast value.

Purely as manual testing, and not part of the patch, I hacked the backend a bit to allow direct modification of the toast table. After corrupting the toast with the following bit of SQL:

WITH chunk_limit AS (
SELECT chunk_id, MAX(chunk_seq) AS maxseq
FROM $toastname
GROUP BY chunk_id)
INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
(SELECT t.chunk_id,
t.chunk_seq + cl.maxseq + CASE WHEN t.chunk_seq < 3 THEN 1 ELSE 7 END,
t.chunk_data
FROM $toastname t
INNER JOIN chunk_limit cl
ON t.chunk_id = cl.chunk_id)

pg_amcheck reports the following corruption messages:

# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 6 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 7 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 8 follows last expected chunk 5
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 9 has sequence number 15, but expected sequence number 9
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 10 has sequence number 16, but expected sequence number 10
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 chunk 11 has sequence number 17, but expected sequence number 11
# heap table "postgres"."public"."test", block 0, offset 1, attribute 2:
# toast value 16444 was expected to end at chunk 6, but ended at chunk 12

I think if we'd left out the first three messages, it would read strangely. We would be complaining about three chunks with the wrong sequence number, then conclude that there were six extra chunks. A sufficiently savvy user might deduce the presence of chunks 6, 7, and 8, but the problem is more obvious (to my eyes, at least) if we keep the first three messages. This seems like a judgement call and not a clear argument either way, so if you still want me to change it, I guess I don't mind doing so.

I mean, looking at it, the question here is why it's not just using
the same message for all of them. The fact that the chunk numbers are
higher than 5 is the problem. The sequence numbers seem like just a
distraction.

Again, I don't think we can reach that conclusion. You are biasing the corruption reports in favor of believing the va_rawsize rather than believing the toast table.

On a related note, as I think I said before, I still think we should
be rejiggering this so that we're not testing both the size of each
individual chunk and the total size, because that ought to be
redundant. That might be better done as a separate patch but I think
we should try to clean it up.

Can you point me to the exact check you are mentioning, and with which patch applied? I don't see any examples of this after applying the v18-0003.

Hmm, my mistake, I think.

I'm a little worried about whether the additional test cases are
Endian-dependent at all. I don't immediately know what might be wrong
with them, but I'm going to think about that some more later. Any
chance you have access to a Big-endian box where you can test this?

I don't have a Big-endian box, but I think one of them may be wrong now that you mention the issue:

# Corrupt column c's toast pointer va_extinfo field

The problem is that the 30-bit extsize and 2-bit cmid split is not being handled in the perl test, and I don't see an easy way to have perl's pack/unpack do that for us. There isn't any requirement that each possible corruption we check actually be manifested in the regression tests. The simplest solution is to remove this problematic test, so that's what I did. The other two new tests corrupt c_va_toastrelid and c_va_rawsize, both of which are read/written using unpack/pack, so perl should handle the endianness for us (I hope).

I don't immediately see why this particular thing should be an issue.
The format of the varlena header itself is different on big-endian and
little-endian machines, which is why postgres.h has all this stuff
conditioned on WORDS_BIGENDIAN. But va_extinfo doesn't have any
similar treatment, so I'm not sure what could go wrong there, as long
as the 4-byte value as a whole is being packed and unpacked according
to the machine's endian-ness.

Good point. Perhaps the test was ok after all.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#128Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#127)
Re: pg_amcheck contrib application

On Thu, Apr 8, 2021 at 5:21 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

All this leads me to believe that we should report the following:

1) If the total number of chunks retrieved differs from the expected number, report how many we expected vs. how many we got
2) If the chunk_seq numbers are discontiguous, report each discontiguity.
3) If the index scan returned chunks out of chunk_seq order, report that
4) If any chunk is not the expected size, report that

So, for your example of chunk 1 missing from chunks [0..17], we'd report that we got one fewer chunks than we expected, that the second chunk seen was discontiguous from the first chunk seen, that the final chunk seen was smaller than expected by M bytes, and that the total size was smaller than we expected by N bytes. The third of those is somewhat misleading, since the final chunk was presumably the right size; we just weren't expecting to hit a partial chunk quite yet. But I don't see how to make that better in the general case.

Hmm, that might be OK. It seems like it's going to be a bit verbose in
simple cases like 1 missing chunk, but on the plus side, it avoids a
mountain of output if the raw size has been overwritten with a
gigantic bogus value. But, how is #2 different from #3? Those sound
like the same thing to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#129Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#128)
Re: pg_amcheck contrib application

On Apr 8, 2021, at 3:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 8, 2021 at 5:21 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

All this leads me to believe that we should report the following:

1) If the total number of chunks retrieved differs from the expected number, report how many we expected vs. how many we got
2) If the chunk_seq numbers are discontiguous, report each discontiguity.
3) If the index scan returned chunks out of chunk_seq order, report that
4) If any chunk is not the expected size, report that

So, for your example of chunk 1 missing from chunks [0..17], we'd report that we got one fewer chunks than we expected, that the second chunk seen was discontiguous from the first chunk seen, that the final chunk seen was smaller than expected by M bytes, and that the total size was smaller than we expected by N bytes. The third of those is somewhat misleading, since the final chunk was presumably the right size; we just weren't expecting to hit a partial chunk quite yet. But I don't see how to make that better in the general case.

Hmm, that might be OK. It seems like it's going to be a bit verbose in
simple cases like 1 missing chunk, but on the plus side, it avoids a
mountain of output if the raw size has been overwritten with a
gigantic bogus value. But, how is #2 different from #3? Those sound
like the same thing to me.

#2 is if chunk_seq goes up but skips numbers. #3 is if chunk_seq ever goes down, meaning the index scan did something unexpected.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#130Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#129)
Re: pg_amcheck contrib application

On Thu, Apr 8, 2021 at 6:51 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

#2 is if chunk_seq goes up but skips numbers. #3 is if chunk_seq ever goes down, meaning the index scan did something unexpected.

Yeah, sure. But I think we could probably treat those the same way.

--
Robert Haas
EDB: http://www.enterprisedb.com

#131Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#128)
Re: pg_amcheck contrib application

On Apr 8, 2021, at 3:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 8, 2021 at 5:21 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

All this leads me to believe that we should report the following:

1) If the total number of chunks retrieved differs from the expected number, report how many we expected vs. how many we got
2) If the chunk_seq numbers are discontiguous, report each discontiguity.
3) If the index scan returned chunks out of chunk_seq order, report that
4) If any chunk is not the expected size, report that

So, for your example of chunk 1 missing from chunks [0..17], we'd report that we got one fewer chunks than we expected, that the second chunk seen was discontiguous from the first chunk seen, that the final chunk seen was smaller than expected by M bytes, and that the total size was smaller than we expected by N bytes. The third of those is somewhat misleading, since the final chunk was presumably the right size; we just weren't expecting to hit a partial chunk quite yet. But I don't see how to make that better in the general case.

Hmm, that might be OK. It seems like it's going to be a bit verbose in
simple cases like 1 missing chunk, but on the plus side, it avoids a
mountain of output if the raw size has been overwritten with a
gigantic bogus value. But, how is #2 different from #3? Those sound
like the same thing to me.

I think #4, above, requires some clarification. If there are missing chunks, the very definition of how large we expect subsequent chunks to be is ill-defined. I took a fairly conservative approach to avoid lots of bogus complaints about chunks that are of unexpected size. Not all such complaints are removed, but enough are removed that I needed to add a final complaint at the end about the total size seen not matching the total size expected.

Here are a set of corruptions with the corresponding corruption reports from before and from after the code changes. The corruptions are *not* cumulative.

Honestly, I'm not totally convinced that these changes are improvements in all cases. Let me know if you want further changes, or if you'd like to see other corruptions and their before and after results.

Corruption #1:

UPDATE $toastname SET chunk_seq = chunk_seq + 1000

Before:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 0 has sequence number 1000, but expected sequence number 0
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 1 has sequence number 1001, but expected sequence number 1
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 2 has sequence number 1002, but expected sequence number 2
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 3 has sequence number 1003, but expected sequence number 3
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 4 has sequence number 1004, but expected sequence number 4
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 5 has sequence number 1005, but expected sequence number 5

After:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 0 through 999

Corruption #2:

UPDATE $toastname SET chunk_seq = chunk_seq * 1000

Before:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 1 has sequence number 1000, but expected sequence number 1
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 2 has sequence number 2000, but expected sequence number 2
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 3 has sequence number 3000, but expected sequence number 3
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 4 has sequence number 4000, but expected sequence number 4
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 5 has sequence number 5000, but expected sequence number 5

After:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 1 through 999
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 1001 through 1999
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 2001 through 2999
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 3001 through 3999
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 4001 through 4999
# heap table "postgres"."public"."test", block 0, offset 3, attribute 2:

Corruption #3:

UPDATE $toastname SET chunk_id = (chunk_id::integer + 10000000)::oid WHERE chunk_seq = 3

Before:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 3 has sequence number 4, but expected sequence number 3
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 4 has sequence number 5, but expected sequence number 4
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 was expected to end at chunk 6, but ended at chunk 5

After:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunk 3
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 4 has size 20, but expected size 1996
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 was expected to end at chunk 6, but ended at chunk 5
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 was expected to have size 10000, but had size 8004
# heap table "postgres"."public"."test", block 0, offset 3, attribute 2:


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#132Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#131)
Re: pg_amcheck contrib application

On Fri, Apr 9, 2021 at 2:50 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I think #4, above, requires some clarification. If there are missing chunks, the very definition of how large we expect subsequent chunks to be is ill-defined. I took a fairly conservative approach to avoid lots of bogus complaints about chunks that are of unexpected size. Not all such complaints are removed, but enough are removed that I needed to add a final complaint at the end about the total size seen not matching the total size expected.

My instinct is to suppose that the size that we expect for future
chunks is independent of anything being wrong with previous chunks. So
if each chunk is supposed to be 2004 bytes (which probably isn't the
real number) and the value is 7000 bytes long, we expect chunks 0-2 to
be 2004 bytes each, chunk 3 to be 988 bytes, and chunk 4 and higher to
not exist. If chunk 1 happens to be missing or the wrong length or
whatever, our expectations for chunks 2 and 3 are utterly unchanged.

Corruption #1:

UPDATE $toastname SET chunk_seq = chunk_seq + 1000

Before:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 0 has sequence number 1000, but expected sequence number 0
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 1 has sequence number 1001, but expected sequence number 1
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 2 has sequence number 1002, but expected sequence number 2
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 3 has sequence number 1003, but expected sequence number 3
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 4 has sequence number 1004, but expected sequence number 4
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 5 has sequence number 1005, but expected sequence number 5

After:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 0 through 999

Applying the above principle would lead to complaints that chunks 0-5
are missing, and 1000-1005 are extra.

Corruption #2:

UPDATE $toastname SET chunk_seq = chunk_seq * 1000

Similarly here, except the extra chunk numbers are different.

Corruption #3:

UPDATE $toastname SET chunk_id = (chunk_id::integer + 10000000)::oid WHERE chunk_seq = 3

And here we'd just get a complaint that chunk 3 is missing.

--
Robert Haas
EDB: http://www.enterprisedb.com

#133Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#132)
2 attachment(s)
Re: pg_amcheck contrib application

On Apr 9, 2021, at 1:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 9, 2021 at 2:50 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I think #4, above, requires some clarification. If there are missing chunks, the very definition of how large we expect subsequent chunks to be is ill-defined. I took a fairly conservative approach to avoid lots of bogus complaints about chunks that are of unexpected size. Not all such complaints are removed, but enough are removed that I needed to add a final complaint at the end about the total size seen not matching the total size expected.

My instinct is to suppose that the size that we expect for future
chunks is independent of anything being wrong with previous chunks. So
if each chunk is supposed to be 2004 bytes (which probably isn't the
real number) and the value is 7000 bytes long, we expect chunks 0-2 to
be 2004 bytes each, chunk 3 to be 988 bytes, and chunk 4 and higher to
not exist. If chunk 1 happens to be missing or the wrong length or
whatever, our expectations for chunks 2 and 3 are utterly unchanged.

Fair enough.

Corruption #1:

UPDATE $toastname SET chunk_seq = chunk_seq + 1000

Before:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 0 has sequence number 1000, but expected sequence number 0
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 1 has sequence number 1001, but expected sequence number 1
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 2 has sequence number 1002, but expected sequence number 2
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 3 has sequence number 1003, but expected sequence number 3
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 4 has sequence number 1004, but expected sequence number 4
# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 chunk 5 has sequence number 1005, but expected sequence number 5

After:

# heap table "postgres"."public"."test", block 0, offset 2, attribute 2:
# toast value 16445 missing chunks 0 through 999

Applying the above principle would lead to complaints that chunks 0-5
are missing, and 1000-1005 are extra.

That sounds right. It now reports:

# heap table "postgres"."public"."test", block 0, offset 16, attribute 2:
# toast value 16459 missing chunks 0 through 4 with expected size 1996 and chunk 5 with expected size 20
# heap table "postgres"."public"."test", block 0, offset 16, attribute 2:
# toast value 16459 unexpected chunks 1000 through 1004 each with size 1996 followed by chunk 1005 with size 20

Corruption #2:

UPDATE $toastname SET chunk_seq = chunk_seq * 1000

Similarly here, except the extra chunk numbers are different.

It now reports:

# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 missing chunks 1 through 4 with expected size 1996 and chunk 5 with expected size 20
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 1000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 2000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 3000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 4000 with size 1996
# heap table "postgres"."public"."test", block 0, offset 17, attribute 2:
# toast value 16460 unexpected chunk 5000 with size 20

I don't see any good way in this case to report the extra chunks in one row, as in the general case there could be arbitrarily many of them, with the message text getting arbitrarily large if it reported the chunks as "chunks 1000, 2000, 3000, 4000, 5000, ...". I don't expect this sort of corruption being particularly common, though, so I'm not too bothered about it.

Corruption #3:

UPDATE $toastname SET chunk_id = (chunk_id::integer + 10000000)::oid WHERE chunk_seq = 3

And here we'd just get a complaint that chunk 3 is missing.

It now reports:

# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 missing chunk 3 with expected size 1996
# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 was expected to end at chunk 6 with total size 10000, but ended at chunk 5 with total size 8004

It sounds like you weren't expecting the second of these reports. I think it is valuable, especially when there are multiple missing chunks and multiple extraneous chunks, as it makes it easier for the user to reconcile the missing chunks against the extraneous chunks.

Attachments:

v20-0001-amcheck-rewording-messages-and-fixing-alignment.patchapplication/octet-stream; name=v20-0001-amcheck-rewording-messages-and-fixing-alignment.patch; x-unix-mode=0644Download
From 953142b5deb16f2133e23bcc71f1bbc55ff630b4 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 8 Apr 2021 09:48:37 -0700
Subject: [PATCH v20 1/2] amcheck: rewording messages and fixing alignment

Removing redundant mention of attnum in the corruption message text,
as the attnum is already its own separate column.

When reporting toast corruption, mentioning the toast value in the
message since that information is not otherwise reported.

Being more careful about alignment when accessing a toast pointer.
---
 contrib/amcheck/verify_heapam.c           | 63 +++++++++++++----------
 src/bin/pg_amcheck/t/004_verify_heapam.pl |  4 +-
 2 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index e8aa0d68d4..13f420d9ad 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -1179,7 +1179,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
-						  pstrdup("toast chunk sequence number is null"));
+								psprintf("toast value %u has toast chunk with null sequence number",
+										 ta->toast_pointer.va_valueid));
 		return;
 	}
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
@@ -1187,7 +1188,8 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
-						  pstrdup("toast chunk data is null"));
+								psprintf("toast value %u chunk %d has null data",
+										 ta->toast_pointer.va_valueid, chunkno));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1205,8 +1207,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
 		report_toast_corruption(ctx, ta,
-						  psprintf("corrupt extended toast chunk has invalid varlena header: %0x (sequence number %d)",
-								   header, curchunk));
+								psprintf("toast value %u chunk %d has invalid varlena header %0x",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, header));
 		return;
 	}
 
@@ -1216,15 +1219,17 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 	if (curchunk != chunkno)
 	{
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk sequence number %u does not match the expected sequence number %u",
-								   curchunk, chunkno));
+								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, curchunk, chunkno));
 		return;
 	}
-	if (curchunk > endchunk)
+	if (chunkno > endchunk)
 	{
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk sequence number %u exceeds the end chunk sequence number %u",
-								   curchunk, endchunk));
+								psprintf("toast value %u chunk %d follows last expected chunk %d",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, endchunk));
 		return;
 	}
 
@@ -1233,8 +1238,9 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
-						  psprintf("toast chunk size %u differs from the expected size %u",
-								   chunksize, expected_size));
+								psprintf("toast value %u chunk %d has size %u, but expected size %u",
+										 ta->toast_pointer.va_valueid,
+										 chunkno, chunksize, expected_size));
 }
 
 /*
@@ -1265,6 +1271,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	char	   *tp;				/* pointer to the tuple data */
 	uint16		infomask;
 	Form_pg_attribute thisatt;
+	struct varatt_external toast_pointer;
 
 	infomask = ctx->tuphdr->t_infomask;
 	thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), ctx->attnum);
@@ -1274,8 +1281,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u starts at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u starts at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1295,8 +1301,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 		{
 			report_corruption(ctx,
-							  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-									   ctx->attnum,
+							  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 									   thisatt->attlen,
 									   ctx->tuphdr->t_hoff + ctx->offset,
 									   ctx->lp_len));
@@ -1328,8 +1333,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		if (va_tag != VARTAG_ONDISK)
 		{
 			report_corruption(ctx,
-							  psprintf("toasted attribute %u has unexpected TOAST tag %u",
-									   ctx->attnum,
+							  psprintf("toasted attribute has unexpected TOAST tag %u",
 									   va_tag));
 			/* We can't know where the next attribute begins */
 			return false;
@@ -1343,8 +1347,7 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u with length %u ends at offset %u beyond total tuple length %u",
-								   ctx->attnum,
+						  psprintf("attribute with length %u ends at offset %u beyond total tuple length %u",
 								   thisatt->attlen,
 								   ctx->tuphdr->t_hoff + ctx->offset,
 								   ctx->lp_len));
@@ -1371,12 +1374,17 @@ check_tuple_attribute(HeapCheckContext *ctx)
 
 	/* It is external, and we're looking at a page on disk */
 
+	/*
+	 * Must copy attr into toast_pointer for alignment considerations
+	 */
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but tuple header flag HEAP_HASEXTERNAL not set",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
+								   toast_pointer.va_valueid));
 		return true;
 	}
 
@@ -1384,8 +1392,8 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	if (!ctx->rel->rd_rel->reltoastrelid)
 	{
 		report_corruption(ctx,
-						  psprintf("attribute %u is external but relation has no toast relation",
-								   ctx->attnum));
+						  psprintf("toast value %u is external but relation has no toast relation",
+								   toast_pointer.va_valueid));
 		return true;
 	}
 
@@ -1464,12 +1472,13 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 
 	if (!found_toasttup)
 		report_toast_corruption(ctx, ta,
-								psprintf("toasted value for attribute %u missing from toast table",
-										 ta->attnum));
+								psprintf("toast value %u not found in toast table",
+										 ta->toast_pointer.va_valueid));
 	else if (chunkno != (endchunk + 1))
 		report_toast_corruption(ctx, ta,
-								psprintf("final toast chunk number %u differs from expected value %u",
-										 chunkno, (endchunk + 1)));
+								psprintf("toast value %u was expected to end at chunk %u, but ended at chunk %u",
+										 ta->toast_pointer.va_valueid,
+										 (endchunk + 1), chunkno));
 }
 
 /*
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 2171d236a7..3c1277adf3 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -480,7 +480,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 1);
 		push @expected,
-			qr/${header}attribute \d+ with length \d+ ends at offset \d+ beyond total tuple length \d+/;
+			qr/${header}attribute with length \d+ ends at offset \d+ beyond total tuple length \d+/;
 	}
 	elsif ($offnum == 13)
 	{
@@ -489,7 +489,7 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 
 		$header = header(0, $offnum, 2);
 		push @expected,
-			qr/${header}toasted value for attribute 2 missing from toast table/;
+			qr/${header}toast value \d+ not found in toast table/;
 	}
 	elsif ($offnum == 14)
 	{
-- 
2.21.1 (Apple Git-122.3)

v20-0002-amcheck-adding-toast-pointer-corruption-checks.patchapplication/octet-stream; name=v20-0002-amcheck-adding-toast-pointer-corruption-checks.patch; x-unix-mode=0644Download
From 7a78901dbebfc099208f041faf30e4707d4a0647 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 8 Apr 2021 10:09:59 -0700
Subject: [PATCH v20 2/2] amcheck: adding toast pointer corruption checks

Verifying that toast pointer va_toastrelid fields match their heap
table's reltoastrelid.

Checking the extsize for a toast pointer against the raw size.  This
check could fail if buggy compression logic fails to notice that
compressing the attribute makes it bigger.  But assuming the logic
for that is correct, overlarge extsize indicates a corrupted toast
pointer.

Checking if a toast pointer indicates the data is compressed, that
the toast pointer records a valid compression method.

Checking the toast is not too large to be allocated.  No such
toasted value should ever be stored, but a corrupted toast pointer
could record an unreasonbly large size, so check that.

Changing the logic to continue checking toast even after reporting
that HEAP_HASEXTERNAL is false.  Previously, the toast checking
stopped here, but that wasn't necessary, and subsequent checks may
provide additional useful diagnostic information.

Checking for missing or extraneous toast chunks for a toasted
attribute.  While at it, checking that the toast index scan returns
chunks in order.
---
 contrib/amcheck/verify_heapam.c  | 402 +++++++++++++++++++++++++++----
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 351 insertions(+), 52 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 13f420d9ad..93315c1f24 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -30,6 +30,9 @@ PG_FUNCTION_INFO_V1(verify_heapam);
 /* The number of columns in tuples returned by verify_heapam */
 #define HEAPCHECK_RELATION_COLS 4
 
+/* The largest valid toast va_rawsize */
+#define VARLENA_SIZE_LIMIT 0x3FFFFFFF
+
 /*
  * Despite the name, we use this for reporting problems with both XIDs and
  * MXIDs.
@@ -146,12 +149,40 @@ typedef struct HeapCheckContext
 	Tuplestorestate *tupstore;
 } HeapCheckContext;
 
+/*
+ * Struct holding the running context information during the check of a
+ * a single toasted attribute.
+ */
+typedef struct ToastCheckContext
+{
+	/*
+	 * Cache tracking a sequence of contiguous toast chunks, each of size
+	 * TOAST_MAX_CHUNK_SIZE, and having sequence numbers outside the expected
+	 * range.  The sequence numbers of such chunks are cached until the
+	 * sequence ends so that a single toast corruption report can be emitted
+	 * for the group, rather than one report per chunk.
+	 */
+	bool		have_extraneous_chunks;
+	int32		first_extraneous;
+	int32		last_extraneous;
+
+	/* Most recent previously seen chunk sequence number */
+	int32		last_chunk_seen;
+
+	/*
+	 * Expected sequence number and size of the final chunk expected for this
+	 * toasted attribute
+	 */
+	int32		final_expected_chunk;
+	int32		final_expected_size;
+} ToastCheckContext;
+
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+static int32 check_toast_tuple(HeapTuple toasttup, ToastedAttribute *ta,
+							   ToastCheckContext *tctx, HeapCheckContext *hctx,
+							   int32 chunkno);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -1147,10 +1178,184 @@ check_tuple_visibility(HeapCheckContext *ctx)
 	return true;
 }
 
+/*
+ * Issues toast corruption reports for the given extraneous partial chunk, if
+ * not null, along with any extraneous full chunks in the tctx cache, which is
+ * then cleared.  A "partial chunk" is any chunk with size less than
+ * TOAST_MAX_CHUNK_SIZE.  An "extraneous chunk" is one with a sequence number
+ * outside the expected range for the toasted attribute.
+ *
+ * To report extraneous full chunks and clear the cache, call with partialchunk
+ * and partialsize NULL.  If the cache is already empty, the call is harmless.
+ *
+ * Extraneous partial chunks are never cached.  When reporting one, any cached
+ * extraneous full chunks will also be reported and the cache cleared.
+ */
+static void
+report_extraneous_chunks(HeapCheckContext *hctx, ToastedAttribute *ta,
+						 ToastCheckContext *tctx, int32 *partialchunk,
+						 int32 *partialsize)
+{
+	if (tctx->have_extraneous_chunks && partialchunk != NULL &&
+		tctx->last_extraneous == *partialchunk - 1)
+	{
+		/*
+		 * For brevity, combine or more contiguous full chunks ending with a
+		 * partial chunk into just one corruption report.
+		 */
+		if (tctx->first_extraneous < tctx->last_extraneous)
+			report_toast_corruption(hctx, ta,
+									psprintf("toast value %u unexpected chunks %d through %d each with size %d followed by chunk %d with size %d",
+										 ta->toast_pointer.va_valueid,
+										 tctx->first_extraneous,
+										 tctx->last_extraneous,
+										 (int)TOAST_MAX_CHUNK_SIZE,
+										 *partialchunk, *partialsize));
+		else
+			report_toast_corruption(hctx, ta,
+									psprintf("toast value %u unexpected chunk %d with size %d followed by chunk %d with size %d",
+										 ta->toast_pointer.va_valueid,
+										 tctx->first_extraneous,
+										 (int)TOAST_MAX_CHUNK_SIZE,
+										 *partialchunk, *partialsize));
+		tctx->have_extraneous_chunks = false;
+		return;
+	}
+
+	if (tctx->have_extraneous_chunks)
+	{
+		/*
+		 * Either the previously seen extraneous chunks are not contiguous with
+		 * the partial chunk being reported, or there is no partial chunk being
+		 * reported.
+		 */
+		if (tctx->first_extraneous < tctx->last_extraneous)
+			report_toast_corruption(hctx, ta,
+									psprintf("toast value %u unexpected chunks %d through %d each with size %d",
+										 ta->toast_pointer.va_valueid,
+										 tctx->first_extraneous,
+										 tctx->last_extraneous,
+										 (int)TOAST_MAX_CHUNK_SIZE));
+		else
+			report_toast_corruption(hctx, ta,
+									psprintf("toast value %u unexpected chunk %d with size %d",
+										 ta->toast_pointer.va_valueid,
+										 tctx->first_extraneous,
+										 (int)TOAST_MAX_CHUNK_SIZE));
+	}
+
+	if (partialchunk != NULL)
+		/* Report the partial chunk */
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u unexpected chunk %d with size %d",
+										 ta->toast_pointer.va_valueid,
+										 *partialchunk, *partialsize));
+	tctx->have_extraneous_chunks = false;
+}
+
+/*
+ * Records that a toast chunk should be reported as extraneous.  After
+ * finishing all calls to this function for a given toasted attribute, a call
+ * to report_extraneous_chunks() should be issued to flush the cache.
+ */
+static void
+handle_extraneous_chunk(HeapCheckContext *hctx, ToastedAttribute *ta,
+						ToastCheckContext *tctx, int32 curchunk,
+						int32 chunksize)
+{
+	if (chunksize == TOAST_MAX_CHUNK_SIZE)
+	{
+		if (tctx->have_extraneous_chunks)
+		{
+			if (tctx->last_extraneous == curchunk - 1)
+			{
+				/*
+				 * This is the next chunk in an ongoing sequence.  Extend it,
+				 * but do not report it yet.
+				 */
+				tctx->last_extraneous = curchunk;
+				return;
+			}
+
+			/*
+			 * There is an ongoing sequence, but this chunk is discontiguous
+			 * with it.  Report the sequence and clear the cache so we can
+			 * start over with this chunk.
+			 */
+			report_extraneous_chunks(hctx, ta, tctx, NULL, NULL);
+		}
+
+		/* Start a new sequence, but do not report it yet. */
+		tctx->first_extraneous = curchunk;
+		tctx->last_extraneous = curchunk;
+		tctx->have_extraneous_chunks = true;
+		return;
+	}
+
+	/*
+	 * This is a partial chunk.  Report it.  If there is an ongoing full chunk
+	 * sequence, this will report and flush that, too, but we don't care.
+	 */
+	report_extraneous_chunks(hctx, ta, tctx, &curchunk, &chunksize);
+}
+
+/*
+ * Issues toast corruption reports for one or more missing toast chunks
+ * in the [first_missing..last_missing] range, inclusive.
+ */
+static void
+report_missing_chunks(HeapCheckContext *hctx, ToastedAttribute *ta,
+					  ToastCheckContext *tctx, int32 first_missing,
+					  int32 last_missing)
+{
+	int32	expected_size;
+
+	Assert(first_missing >= 0);
+	Assert(last_missing >= first_missing);
+	Assert(last_missing <= tctx->final_expected_chunk);
+
+	if (last_missing < tctx->final_expected_chunk)
+		expected_size = TOAST_MAX_CHUNK_SIZE;
+	else
+		expected_size = tctx->final_expected_size;
+
+	/*
+	 * Report missing chunks with language matching language used for reporting
+	 * extraneous chunks.  Mention the sizes expected for the missing chunks so
+	 * the user can reconcile that against any extraneous chunk reports.
+	 */
+	if (last_missing > first_missing + 1 &&
+		expected_size < TOAST_MAX_CHUNK_SIZE)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunks %d through %d with expected size %d and chunk %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 first_missing, last_missing - 1,
+										 (int)TOAST_MAX_CHUNK_SIZE,
+										 last_missing, expected_size));
+	else if (last_missing == first_missing + 1 &&
+			 expected_size < TOAST_MAX_CHUNK_SIZE)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunk %d with expected size %d and chunk %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 first_missing,
+										 (int)TOAST_MAX_CHUNK_SIZE,
+										 last_missing, expected_size));
+	else if (last_missing > first_missing)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunks %d through %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 first_missing, last_missing,
+										 (int)TOAST_MAX_CHUNK_SIZE));
+	else
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunk %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 last_missing, expected_size));
+}
 
 /*
- * Check the current toast tuple against the state tracked in ctx, recording
- * any corruption found in ctx->tupstore.
+ * Check the current toast tuple, recording any corruption found in
+ * ctx->tupstore.
  *
  * This is not equivalent to running verify_heapam on the toast table itself,
  * and is not hardened against corruption of the toast table.  Rather, when
@@ -1159,38 +1364,73 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
  *
- * Returns whether the toast tuple passed the corruption checks.
+ * Returns the size of the current toast tuple chunk, or zero if the
+ * chunk is not sufficiently sensible for the chunk size to be determined.
  */
-static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+static int32
+check_toast_tuple(HeapTuple toasttup, ToastedAttribute *ta,
+				  ToastCheckContext *tctx, HeapCheckContext *hctx,
+				  int32 chunkno)
 {
 	int32		curchunk;
 	Pointer		chunk;
 	bool		isnull;
 	int32		chunksize;
-	int32		expected_size;
+	int32		max_valid_prior_chunk;
 
 	/*
 	 * Have a chunk, extract the sequence number and the data
 	 */
 	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
+										 hctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_toast_corruption(ctx, ta,
+		report_toast_corruption(hctx, ta,
 								psprintf("toast value %u has toast chunk with null sequence number",
 										 ta->toast_pointer.va_valueid));
-		return;
+		return 0;
 	}
+
+	/*
+	 * Maximum chunk sequence number in the expected range which is less than
+	 * curchunk.  Note that curchunk itself may be outside the valid range.
+	 */
+	max_valid_prior_chunk = Min(curchunk-1, tctx->final_expected_chunk);
+
+	/* Report any missing chunks at the beginning of the expected sequence */
+	if (chunkno == 0 && max_valid_prior_chunk >= 0)
+		report_missing_chunks(hctx, ta, tctx, 0, max_valid_prior_chunk);
+
+	/* Complain if the chunk sequence number retreats */
+	if (chunkno > 0 && curchunk < tctx->last_chunk_seen)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u index scan returned chunk number %d after chunk number %d",
+										 ta->toast_pointer.va_valueid,
+										 curchunk, tctx->last_chunk_seen));
+
+	/* Complain if the same chunk sequence number is returned multiple times */
+	else if (chunkno > 0 && curchunk == tctx->last_chunk_seen)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u index scan returned chunk number %d more than once",
+										 ta->toast_pointer.va_valueid,
+										 curchunk));
+
+	/* Report any missing chunks in the middle of the expected sequence */
+	else if (chunkno > 0 && max_valid_prior_chunk > tctx->last_chunk_seen)
+		report_missing_chunks(hctx, ta, tctx, tctx->last_chunk_seen + 1,
+							  max_valid_prior_chunk);
+
+	/* Remember this chunk sequence number as the last one seen */
+	tctx->last_chunk_seen = curchunk;
+
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
-										ctx->toast_rel->rd_att, &isnull));
+										hctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_toast_corruption(ctx, ta,
+		report_toast_corruption(hctx, ta,
 								psprintf("toast value %u chunk %d has null data",
 										 ta->toast_pointer.va_valueid, chunkno));
-		return;
+		return 0;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
 		chunksize = VARSIZE(chunk) - VARHDRSZ;
@@ -1206,41 +1446,34 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		/* should never happen */
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_toast_corruption(ctx, ta,
+		report_toast_corruption(hctx, ta,
 								psprintf("toast value %u chunk %d has invalid varlena header %0x",
 										 ta->toast_pointer.va_valueid,
 										 chunkno, header));
-		return;
+		return 0;
 	}
 
-	/*
-	 * Some checks on the data we've found
-	 */
-	if (curchunk != chunkno)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
-	}
-	if (chunkno > endchunk)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d follows last expected chunk %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
-		return;
-	}
+	/* Report an extraneous chunk outside the expected sequence */
+	if (curchunk < 0 || curchunk > tctx->final_expected_chunk)
+		handle_extraneous_chunk(hctx, ta, tctx, curchunk, chunksize);
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+	/* Report a partial chunk before the final expected chunk */
+	else if (curchunk < tctx->final_expected_chunk && chunksize != TOAST_MAX_CHUNK_SIZE)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u chunk %d has size %d, but expected chunk with size %d",
+										 ta->toast_pointer.va_valueid,
+										 curchunk, chunksize,
+										 (int)TOAST_MAX_CHUNK_SIZE));
 
-	if (chunksize != expected_size)
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has size %u, but expected size %u",
+	/* Report a final chunk of the wrong size */
+	else if (curchunk == tctx->final_expected_chunk && chunksize != tctx->final_expected_size)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u chunk %d has size %d, but expected chunk with size %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+										 curchunk, chunksize,
+										 tctx->final_expected_size));
+
+	return chunksize;
 }
 
 /*
@@ -1379,14 +1612,55 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	 */
 	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
+	/* Oversized toasted attributes should never be stored */
+	if (toast_pointer.va_rawsize > VARLENA_SIZE_LIMIT)
+		report_corruption(ctx,
+						  psprintf("toast value %u rawsize %u exceeds limit %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_rawsize,
+								   VARLENA_SIZE_LIMIT));
+
+	/* Compression should never expand the attribute */
+	if (VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize - VARHDRSZ)
+		report_corruption(ctx,
+						  psprintf("toast value %u external size %u exceeds maximum expected for rawsize %u",
+								   toast_pointer.va_valueid,
+								   VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer),
+								   toast_pointer.va_rawsize));
+
+	/* Compressed attributes should have a valid compression method */
+	if (VARATT_IS_COMPRESSED(&toast_pointer))
+	{
+		ToastCompressionId cmid;
+		bool		invalid = true;
+
+		cmid = TOAST_COMPRESS_METHOD(&toast_pointer);
+		switch (cmid)
+		{
+			/* List of all valid compression method IDs */
+			case TOAST_PGLZ_COMPRESSION_ID:
+			case TOAST_LZ4_COMPRESSION_ID:
+				invalid = false;
+				break;
+
+			/* Recognized but invalid compression method ID */
+			case TOAST_INVALID_COMPRESSION_ID:
+				break;
+
+			/* Intentionally no default here */
+		}
+
+		if (invalid)
+			report_corruption(ctx,
+							  psprintf("toast value %u has invalid compression method id %d",
+									   toast_pointer.va_valueid, cmid));
+	}
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
-	{
 		report_corruption(ctx,
 						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
 								   toast_pointer.va_valueid));
-		return true;
-	}
 
 	/* The relation better have a toast table */
 	if (!ctx->rel->rd_rel->reltoastrelid)
@@ -1397,6 +1671,14 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 	}
 
+	/* The toast pointer had better point at the relation's toast table */
+	if (toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+		report_corruption(ctx,
+						  psprintf("toast value %u toast relation oid %u differs from expected oid %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
+
 	/* If we were told to skip toast checking, then we're done. */
 	if (ctx->toast_rel == NULL)
 		return true;
@@ -1437,9 +1719,20 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	bool		found_toasttup;
 	HeapTuple	toasttup;
 	int32		chunkno;
-	int32		endchunk;
+	int64		totalsize;		/* corrupt toast could overflow 32 bits */
+	int32		extsize;
+	ToastCheckContext tctx;
 
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	/* Calculate expected number of chunks and size of final chunk */
+	extsize = VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer);
+	tctx.final_expected_chunk = (extsize - 1) / TOAST_MAX_CHUNK_SIZE;
+	tctx.final_expected_size = extsize - tctx.final_expected_chunk * TOAST_MAX_CHUNK_SIZE;
+
+	/* Have not yet seen any chunks for this toast tuple */
+	tctx.have_extraneous_chunks = false;
+	tctx.first_extraneous = -1;
+	tctx.last_extraneous = -1;
+	tctx.last_chunk_seen = -1;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1459,13 +1752,14 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 										   &SnapshotToast, 1,
 										   &toastkey);
 	chunkno = 0;
+	totalsize = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
+		totalsize += check_toast_tuple(toasttup, ta, &tctx, ctx, chunkno);
 		chunkno++;
 	}
 	systable_endscan_ordered(toastscan);
@@ -1474,11 +1768,15 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
+	else if (chunkno != tctx.final_expected_chunk + 1 || extsize != totalsize)
 		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u was expected to end at chunk %u, but ended at chunk %u",
+								psprintf("toast value %u was expected to end at chunk %u with total size %d, but ended at chunk %u with total size " INT64_FORMAT,
 										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
+										 (tctx.final_expected_chunk + 1),
+										 extsize, chunkno, totalsize));
+
+	/* Report any remaining cached extraneous chunks */
+	report_extraneous_chunks(ctx, ta, &tctx, NULL, NULL);
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7aff677d4..996ef03180 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2558,6 +2558,7 @@ TimestampTz
 TmFromChar
 TmToChar
 ToastAttrInfo
+ToastCheckContext
 ToastTupleContext
 ToastedAttribute
 TocEntry
-- 
2.21.1 (Apple Git-122.3)

#134Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#133)
Re: pg_amcheck contrib application

On Mon, Apr 12, 2021 at 11:06 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

It now reports:

# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 missing chunk 3 with expected size 1996
# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 was expected to end at chunk 6 with total size 10000, but ended at chunk 5 with total size 8004

It sounds like you weren't expecting the second of these reports. I think it is valuable, especially when there are multiple missing chunks and multiple extraneous chunks, as it makes it easier for the user to reconcile the missing chunks against the extraneous chunks.

I wasn't, but I'm not overwhelmingly opposed to it, either. I do think
I would be in favor of splitting this kind of thing up into two
messages:

# toast value 16459 unexpected chunks 1000 through 1004 each with
size 1996 followed by chunk 1005 with size 20

We'll have fewer message variants and I don't think any real
regression in usability if we say:

# toast value 16459 has unexpected chunks 1000 through 1004 each
with size 1996
# toast value 16459 has unexpected chunk 1005 with size 20

(Notice that I also inserted "has" so that the sentence a verb. Or we
could "contains.")

I committed 0001.

--
Robert Haas
EDB: http://www.enterprisedb.com

#135Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#134)
1 attachment(s)
Re: pg_amcheck contrib application

On Apr 14, 2021, at 10:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 12, 2021 at 11:06 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

It now reports:

# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 missing chunk 3 with expected size 1996
# heap table "postgres"."public"."test", block 0, offset 18, attribute 2:
# toast value 16461 was expected to end at chunk 6 with total size 10000, but ended at chunk 5 with total size 8004

It sounds like you weren't expecting the second of these reports. I think it is valuable, especially when there are multiple missing chunks and multiple extraneous chunks, as it makes it easier for the user to reconcile the missing chunks against the extraneous chunks.

I wasn't, but I'm not overwhelmingly opposed to it, either. I do think
I would be in favor of splitting this kind of thing up into two
messages:

# toast value 16459 unexpected chunks 1000 through 1004 each with
size 1996 followed by chunk 1005 with size 20

We'll have fewer message variants and I don't think any real
regression in usability if we say:

# toast value 16459 has unexpected chunks 1000 through 1004 each
with size 1996
# toast value 16459 has unexpected chunk 1005 with size 20

Changed.

(Notice that I also inserted "has" so that the sentence a verb. Or we
could "contains.")

I have added the verb "has" rather than "contains" because "has" is more consistent with the phrasing of other similar corruption reports.

Attachments:

v21-0001-amcheck-adding-toast-pointer-corruption-checks.patchapplication/octet-stream; name=v21-0001-amcheck-adding-toast-pointer-corruption-checks.patch; x-unix-mode=0644Download
From 2f41e74d211d0bb939bedd688933aa891c511b2d Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 15 Apr 2021 08:56:59 -0700
Subject: [PATCH v21] amcheck: adding toast pointer corruption checks

Adding additional checks of toast pointers: checking the extsize
against the rawsize, the uncompressed size against the size limit
for varlena datums, the va_toastrelid field against the heap table's
reltoastrelid, and if compressed, the validity of the compression
method ID.

Adding checks that the toasted attribute chunks are returned by the
toast index scan in order and without duplicates, and improving the
reports of missing or extra chunks to be more clear to the user.

Changing the logic to continue checking toast even after reporting
that HEAP_HASEXTERNAL is false.  Previously, the toast checking
stopped here, but that wasn't necessary, and subsequent checks may
provide additional useful diagnostic information.
---
 contrib/amcheck/verify_heapam.c  | 370 ++++++++++++++++++++++++++-----
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 319 insertions(+), 52 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9366f45d74..b5500c53a0 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -30,6 +30,9 @@ PG_FUNCTION_INFO_V1(verify_heapam);
 /* The number of columns in tuples returned by verify_heapam */
 #define HEAPCHECK_RELATION_COLS 4
 
+/* The largest valid toast va_rawsize */
+#define VARLENA_SIZE_LIMIT 0x3FFFFFFF
+
 /*
  * Despite the name, we use this for reporting problems with both XIDs and
  * MXIDs.
@@ -146,12 +149,40 @@ typedef struct HeapCheckContext
 	Tuplestorestate *tupstore;
 } HeapCheckContext;
 
+/*
+ * Struct holding the running context information during the check of a
+ * a single toasted attribute.
+ */
+typedef struct ToastCheckContext
+{
+	/*
+	 * Cache tracking a sequence of contiguous toast chunks, each of size
+	 * TOAST_MAX_CHUNK_SIZE, and having sequence numbers outside the expected
+	 * range.  The sequence numbers of such chunks are cached until the
+	 * sequence ends so that a single toast corruption report can be emitted
+	 * for the group, rather than one report per chunk.
+	 */
+	bool		have_extraneous_chunks;
+	int32		first_extraneous;
+	int32		last_extraneous;
+
+	/* Most recent previously seen chunk sequence number */
+	int32		last_chunk_seen;
+
+	/*
+	 * Expected sequence number and size of the final chunk expected for this
+	 * toasted attribute
+	 */
+	int32		final_expected_chunk;
+	int32		final_expected_size;
+} ToastCheckContext;
+
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+static int32 check_toast_tuple(HeapTuple toasttup, ToastedAttribute *ta,
+							   ToastCheckContext *tctx, HeapCheckContext *hctx,
+							   int32 chunkno);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -1147,10 +1178,152 @@ check_tuple_visibility(HeapCheckContext *ctx)
 	return true;
 }
 
+/*
+ * Issues toast corruption reports for the given extraneous partial chunk, if
+ * not null, along with any extraneous full chunks in the tctx cache, which is
+ * then cleared.  A "partial chunk" is any chunk with size different from
+ * TOAST_MAX_CHUNK_SIZE.  An "extraneous chunk" is one with a sequence number
+ * outside the expected range for the toasted attribute.
+ *
+ * To report extraneous full chunks and clear the cache, call with partialchunk
+ * and partialsize NULL.  If the cache is already empty, the call is harmless.
+ *
+ * Extraneous partial chunks are never cached.  When reporting one, any cached
+ * extraneous full chunks will also be reported and the cache cleared.
+ */
+static void
+report_extraneous_chunks(HeapCheckContext *hctx, ToastedAttribute *ta,
+						 ToastCheckContext *tctx, int32 *partialchunk,
+						 int32 *partialsize)
+{
+	if (tctx->have_extraneous_chunks)
+	{
+		if (tctx->first_extraneous < tctx->last_extraneous)
+			report_toast_corruption(hctx, ta,
+									psprintf("toast value %u has unexpected chunks %d through %d each with size %d",
+										 ta->toast_pointer.va_valueid,
+										 tctx->first_extraneous,
+										 tctx->last_extraneous,
+										 (int)TOAST_MAX_CHUNK_SIZE));
+		else
+			report_toast_corruption(hctx, ta,
+									psprintf("toast value %u has unexpected chunk %d with size %d",
+										 ta->toast_pointer.va_valueid,
+										 tctx->first_extraneous,
+										 (int)TOAST_MAX_CHUNK_SIZE));
+		tctx->have_extraneous_chunks = false;
+	}
+
+	if (partialchunk != NULL)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u has unexpected chunk %d with size %d",
+										 ta->toast_pointer.va_valueid,
+										 *partialchunk, *partialsize));
+}
 
 /*
- * Check the current toast tuple against the state tracked in ctx, recording
- * any corruption found in ctx->tupstore.
+ * Records that a toast chunk should be reported as extraneous.  After
+ * finishing all calls to this function for a given toasted attribute, a call
+ * to report_extraneous_chunks() should be issued to flush the cache.
+ */
+static void
+handle_extraneous_chunk(HeapCheckContext *hctx, ToastedAttribute *ta,
+						ToastCheckContext *tctx, int32 curchunk,
+						int32 chunksize)
+{
+	if (chunksize == TOAST_MAX_CHUNK_SIZE)
+	{
+		if (tctx->have_extraneous_chunks)
+		{
+			if (tctx->last_extraneous == curchunk - 1)
+			{
+				/*
+				 * This is the next chunk in an ongoing sequence.  Extend it,
+				 * but do not report it yet.
+				 */
+				tctx->last_extraneous = curchunk;
+				return;
+			}
+
+			/*
+			 * There is an ongoing sequence, but this chunk is discontiguous
+			 * with it.  Report the sequence and clear the cache so we can
+			 * start over with this chunk.
+			 */
+			report_extraneous_chunks(hctx, ta, tctx, NULL, NULL);
+		}
+
+		/* Start a new sequence, but do not report it yet. */
+		tctx->first_extraneous = curchunk;
+		tctx->last_extraneous = curchunk;
+		tctx->have_extraneous_chunks = true;
+		return;
+	}
+
+	/*
+	 * This is a partial chunk.  Report it.  If there is an ongoing full chunk
+	 * sequence, this will report and flush that, too, but we don't care.
+	 */
+	report_extraneous_chunks(hctx, ta, tctx, &curchunk, &chunksize);
+}
+
+/*
+ * Issues toast corruption reports for one or more missing toast chunks
+ * in the [first_missing..last_missing] range, inclusive.
+ */
+static void
+report_missing_chunks(HeapCheckContext *hctx, ToastedAttribute *ta,
+					  ToastCheckContext *tctx, int32 first_missing,
+					  int32 last_missing)
+{
+	int32	expected_size;
+
+	Assert(first_missing >= 0);
+	Assert(last_missing >= first_missing);
+	Assert(last_missing <= tctx->final_expected_chunk);
+
+	if (last_missing < tctx->final_expected_chunk)
+		expected_size = TOAST_MAX_CHUNK_SIZE;
+	else
+		expected_size = tctx->final_expected_size;
+
+	/*
+	 * Report missing chunks with language matching language used for reporting
+	 * extraneous chunks.  Mention the sizes expected for the missing chunks so
+	 * the user can reconcile that against any extraneous chunk reports.
+	 */
+	if (last_missing > first_missing + 1 &&
+		expected_size < TOAST_MAX_CHUNK_SIZE)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunks %d through %d with expected size %d and chunk %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 first_missing, last_missing - 1,
+										 (int)TOAST_MAX_CHUNK_SIZE,
+										 last_missing, expected_size));
+	else if (last_missing == first_missing + 1 &&
+			 expected_size < TOAST_MAX_CHUNK_SIZE)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunk %d with expected size %d and chunk %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 first_missing,
+										 (int)TOAST_MAX_CHUNK_SIZE,
+										 last_missing, expected_size));
+	else if (last_missing > first_missing)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunks %d through %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 first_missing, last_missing,
+										 (int)TOAST_MAX_CHUNK_SIZE));
+	else
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u missing chunk %d with expected size %d",
+										 ta->toast_pointer.va_valueid,
+										 last_missing, expected_size));
+}
+
+/*
+ * Check the current toast tuple, recording any corruption found in
+ * ctx->tupstore.
  *
  * This is not equivalent to running verify_heapam on the toast table itself,
  * and is not hardened against corruption of the toast table.  Rather, when
@@ -1159,38 +1332,73 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
  *
- * Returns whether the toast tuple passed the corruption checks.
+ * Returns the size of the current toast tuple chunk, or zero if the
+ * chunk is not sufficiently sensible for the chunk size to be determined.
  */
-static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+static int32
+check_toast_tuple(HeapTuple toasttup, ToastedAttribute *ta,
+				  ToastCheckContext *tctx, HeapCheckContext *hctx,
+				  int32 chunkno)
 {
 	int32		curchunk;
 	Pointer		chunk;
 	bool		isnull;
 	int32		chunksize;
-	int32		expected_size;
+	int32		max_valid_prior_chunk;
 
 	/*
 	 * Have a chunk, extract the sequence number and the data
 	 */
 	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
+										 hctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_toast_corruption(ctx, ta,
+		report_toast_corruption(hctx, ta,
 								psprintf("toast value %u has toast chunk with null sequence number",
 										 ta->toast_pointer.va_valueid));
-		return;
+		return 0;
 	}
+
+	/*
+	 * Maximum chunk sequence number in the expected range which is less than
+	 * curchunk.  Note that curchunk itself may be outside the valid range.
+	 */
+	max_valid_prior_chunk = Min(curchunk-1, tctx->final_expected_chunk);
+
+	/* Report any missing chunks at the beginning of the expected sequence */
+	if (chunkno == 0 && max_valid_prior_chunk >= 0)
+		report_missing_chunks(hctx, ta, tctx, 0, max_valid_prior_chunk);
+
+	/* Complain if the chunk sequence number retreats */
+	if (chunkno > 0 && curchunk < tctx->last_chunk_seen)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u index scan returned chunk number %d after chunk number %d",
+										 ta->toast_pointer.va_valueid,
+										 curchunk, tctx->last_chunk_seen));
+
+	/* Complain if the same chunk sequence number is returned multiple times */
+	else if (chunkno > 0 && curchunk == tctx->last_chunk_seen)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u index scan returned chunk number %d more than once",
+										 ta->toast_pointer.va_valueid,
+										 curchunk));
+
+	/* Report any missing chunks in the middle of the expected sequence */
+	else if (chunkno > 0 && max_valid_prior_chunk > tctx->last_chunk_seen)
+		report_missing_chunks(hctx, ta, tctx, tctx->last_chunk_seen + 1,
+							  max_valid_prior_chunk);
+
+	/* Remember this chunk sequence number as the last one seen */
+	tctx->last_chunk_seen = curchunk;
+
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
-										ctx->toast_rel->rd_att, &isnull));
+										hctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
-		report_toast_corruption(ctx, ta,
+		report_toast_corruption(hctx, ta,
 								psprintf("toast value %u chunk %d has null data",
 										 ta->toast_pointer.va_valueid, chunkno));
-		return;
+		return 0;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
 		chunksize = VARSIZE(chunk) - VARHDRSZ;
@@ -1206,41 +1414,34 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		/* should never happen */
 		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_toast_corruption(ctx, ta,
+		report_toast_corruption(hctx, ta,
 								psprintf("toast value %u chunk %d has invalid varlena header %0x",
 										 ta->toast_pointer.va_valueid,
 										 chunkno, header));
-		return;
+		return 0;
 	}
 
-	/*
-	 * Some checks on the data we've found
-	 */
-	if (curchunk != chunkno)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
-	}
-	if (chunkno > endchunk)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d follows last expected chunk %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
-		return;
-	}
+	/* Report an extraneous chunk outside the expected sequence */
+	if (curchunk < 0 || curchunk > tctx->final_expected_chunk)
+		handle_extraneous_chunk(hctx, ta, tctx, curchunk, chunksize);
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+	/* Report a partial chunk before the final expected chunk */
+	else if (curchunk < tctx->final_expected_chunk && chunksize != TOAST_MAX_CHUNK_SIZE)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u chunk %d has size %d, but expected chunk with size %d",
+										 ta->toast_pointer.va_valueid,
+										 curchunk, chunksize,
+										 (int)TOAST_MAX_CHUNK_SIZE));
 
-	if (chunksize != expected_size)
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has size %u, but expected size %u",
+	/* Report a final chunk of the wrong size */
+	else if (curchunk == tctx->final_expected_chunk && chunksize != tctx->final_expected_size)
+		report_toast_corruption(hctx, ta,
+								psprintf("toast value %u chunk %d has size %d, but expected chunk with size %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+										 curchunk, chunksize,
+										 tctx->final_expected_size));
+
+	return chunksize;
 }
 
 /*
@@ -1379,14 +1580,55 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	 */
 	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
+	/* Oversized toasted attributes should never be stored */
+	if (toast_pointer.va_rawsize > VARLENA_SIZE_LIMIT)
+		report_corruption(ctx,
+						  psprintf("toast value %u rawsize %u exceeds limit %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_rawsize,
+								   VARLENA_SIZE_LIMIT));
+
+	/* Compression should never expand the attribute */
+	if (VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize - VARHDRSZ)
+		report_corruption(ctx,
+						  psprintf("toast value %u external size %u exceeds maximum expected for rawsize %u",
+								   toast_pointer.va_valueid,
+								   VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer),
+								   toast_pointer.va_rawsize));
+
+	/* Compressed attributes should have a valid compression method */
+	if (VARATT_IS_COMPRESSED(&toast_pointer))
+	{
+		ToastCompressionId cmid;
+		bool		invalid = true;
+
+		cmid = TOAST_COMPRESS_METHOD(&toast_pointer);
+		switch (cmid)
+		{
+			/* List of all valid compression method IDs */
+			case TOAST_PGLZ_COMPRESSION_ID:
+			case TOAST_LZ4_COMPRESSION_ID:
+				invalid = false;
+				break;
+
+			/* Recognized but invalid compression method ID */
+			case TOAST_INVALID_COMPRESSION_ID:
+				break;
+
+			/* Intentionally no default here */
+		}
+
+		if (invalid)
+			report_corruption(ctx,
+							  psprintf("toast value %u has invalid compression method id %d",
+									   toast_pointer.va_valueid, cmid));
+	}
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
-	{
 		report_corruption(ctx,
 						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
 								   toast_pointer.va_valueid));
-		return true;
-	}
 
 	/* The relation better have a toast table */
 	if (!ctx->rel->rd_rel->reltoastrelid)
@@ -1397,6 +1639,14 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 	}
 
+	/* The toast pointer had better point at the relation's toast table */
+	if (toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+		report_corruption(ctx,
+						  psprintf("toast value %u toast relation oid %u differs from expected oid %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
+
 	/* If we were told to skip toast checking, then we're done. */
 	if (ctx->toast_rel == NULL)
 		return true;
@@ -1437,9 +1687,20 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	bool		found_toasttup;
 	HeapTuple	toasttup;
 	int32		chunkno;
-	int32		endchunk;
+	int64		totalsize;		/* corrupt toast could overflow 32 bits */
+	int32		extsize;
+	ToastCheckContext tctx;
+
+	/* Calculate expected number of chunks and size of final chunk */
+	extsize = VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer);
+	tctx.final_expected_chunk = (extsize - 1) / TOAST_MAX_CHUNK_SIZE;
+	tctx.final_expected_size = extsize - tctx.final_expected_chunk * TOAST_MAX_CHUNK_SIZE;
 
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	/* Have not yet seen any chunks for this toast tuple */
+	tctx.have_extraneous_chunks = false;
+	tctx.first_extraneous = -1;
+	tctx.last_extraneous = -1;
+	tctx.last_chunk_seen = -1;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1459,13 +1720,14 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 										   &SnapshotToast, 1,
 										   &toastkey);
 	chunkno = 0;
+	totalsize = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
+		totalsize += check_toast_tuple(toasttup, ta, &tctx, ctx, chunkno);
 		chunkno++;
 	}
 	systable_endscan_ordered(toastscan);
@@ -1474,11 +1736,15 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
+	else if (chunkno != tctx.final_expected_chunk + 1 || extsize != totalsize)
 		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
+								psprintf("toast value %u was expected to end at chunk %u with total size %d, but ended at chunk %u with total size " INT64_FORMAT,
 										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
+										 (tctx.final_expected_chunk + 1),
+										 extsize, chunkno, totalsize));
+
+	/* Report any remaining cached extraneous chunks */
+	report_extraneous_chunks(ctx, ta, &tctx, NULL, NULL);
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7aff677d4..996ef03180 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2558,6 +2558,7 @@ TimestampTz
 TmFromChar
 TmToChar
 ToastAttrInfo
+ToastCheckContext
 ToastTupleContext
 ToastedAttribute
 TocEntry
-- 
2.21.1 (Apple Git-122.3)

#136Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#135)
Re: pg_amcheck contrib application

On Thu, Apr 15, 2021 at 1:07 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I have added the verb "has" rather than "contains" because "has" is more consistent with the phrasing of other similar corruption reports.

That makes sense.

I think it's odd that a range of extraneous chunks is collapsed into a
single report if the size of each chunk happens to be
TOAST_MAX_CHUNK_SIZE and not otherwise. Why not just remember the
first and last extraneous chunk and the size of each? If the next
chunk you see is the next one in sequence and the same size as all the
others, extend your notion of the sequence end by 1. Otherwise, report
the range accumulated so far. It seems to me that this wouldn't be any
more code than you have now, and might actually be less.

I think that report_missing_chunks() could probably just report the
range of missing chunks and not bother reporting how big they were
expected to be. But, if it is going to report how big they were
expected to be, I think it should have only 2 cases rather than 4:
either a range of missing chunks of equal size, or a single missing
chunk of some size. If, as I propose, it doesn't report the expected
size, then you still have just 2 cases: a range of missing chunks, or
a single missing chunk.

Somehow I have a hard time feeling confident that check_toast_tuple()
is going to do the right thing. The logic is really complex and hard
to understand. 'chunkno' is a counter that advances every time we move
to the next chunk, and 'curchunk' is the value we actually find in the
TOAST tuple. This terminology is not easy to understand. Most messages
now report 'curchunk', but some still report 'chunkno'. Why does
'chunkno' need to exist at all? AFAICS the combination of 'curchunk'
and 'tctx->last_chunk_seen' ought to be sufficient. I can see no
particular reason why what you're calling 'chunkno' needs to exist
even as a local variable, let alone be printed out. Either we haven't
yet validated that the chunk_id extracted from the tuple is non-null
and greater than the last chunk number we saw, in which case we can
just complain about it if we find it to be otherwise, or we have
already done that validation, in which case we should complain about
that value and not 'chunkno' in any subsequent messages.

The conditionals between where you set max_valid_prior_chunk and where
you set last_chunk_seen seem hard to understand, particularly the
bifurcated way that missing chunks are reported. Initial missing
chunks are detected by (chunkno == 0 && max_valid_prior_chunk >= 0)
and later missing chunks are detected by (chunkno > 0 &&
max_valid_prior_chunk > tctx->last_chunk_seen). I'm not sure if this
is correct; I find it hard to get my head around what
max_valid_prior_chunk is supposed to represent. But in any case I
think it can be written more simply. Just keep track of what chunk_id
we expect to extract from the next TOAST tuple. Initially it's 0.
Then:

if (chunkno < tctx->expected_chunkno)
{
// toast value %u index scan returned chunk number %d when chunk %d
was expected
// don't modify tctx->expected_chunkno here, just hope the next
thing matches our previous expectation
}
else
{
if (chunkno > tctx->expected_chunkno)
// chunks are missing from tctx->expected_chunkno through
Min(chunkno - 1, tctx->final_expected_chunk), provided that the latter
value is greater than or equal to the former
tctx->expected_chunkno = chunkno + 1;
}

If you do this, you only need to report extraneous chunks when chunkno

tctx->final_expected_chunk, since chunkno < 0 is guaranteed to

trigger the first of the two complaints shown above.

In check_tuple_attribute I suggest "bool valid = false" rather than
"bool invalid = true". I think it's easier to understand.

I object to check_toasted_attribute() using 'chunkno' in a message for
the same reasons as above in regards to check_toast_tuple() i.e. I
think it's a concept which should not exist.

I think this patch could possibly be split up into multiple patches.
There's some question in my mind whether it's getting too late to
commit any of this, since some of it looks suspiciously like new
features after feature freeze. However, I kind of hate to ship this
release without at least doing something about the chunkno vs.
curchunk stuff, which is even worse in the committed code than in your
patch, and which I think will confuse the heck out of users if those
messages actually fire for anyone.

--
Robert Haas
EDB: http://www.enterprisedb.com

#137Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#136)
Re: pg_amcheck contrib application

On Apr 19, 2021, at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 15, 2021 at 1:07 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I have added the verb "has" rather than "contains" because "has" is more consistent with the phrasing of other similar corruption reports.

That makes sense.

I think it's odd that a range of extraneous chunks is collapsed into a
single report if the size of each chunk happens to be
TOAST_MAX_CHUNK_SIZE and not otherwise. Why not just remember the
first and last extraneous chunk and the size of each? If the next
chunk you see is the next one in sequence and the same size as all the
others, extend your notion of the sequence end by 1. Otherwise, report
the range accumulated so far. It seems to me that this wouldn't be any
more code than you have now, and might actually be less.

In all cases of uncorrupted toasted attributes, the sequence of N chunks that make up the attribute should be N-1 chunks of TOAST_MAX_CHUNK_SIZE ending with a single chunk of up to TOAST_MAX_CHUNK_SIZE. I'd like to refer to such sequences as "reasonably sized" sequences to make conversation easier.

If the toast pointer's va_extsize field leads us to believe that we should find 10 reasonably sized chunks, but instead we find 30 reasonably sized chunks, we know something is corrupt. We shouldn't automatically prejudice the user against the additional 20 chunks. We didn't expect them, but maybe that's because va_extsize was corrupt and gave us a false expectation. We're not pointing fingers one way or the other.

On the other hand, if we expect 10 chunks and find an additional 20 unreasonably sized chunks, we can and should point fingers at the extra 20 chunks. Even if we somehow knew that va_extsize was also corrupt, we'd still be justified in saying the 20 unreasonably sized chunks are each, individually corrupt.

I tried to write the code to report one corruption message per corruption found. There are some edge cases where this is a definitional challenge, so it's not easy to say that I've always achieved this goal, but I think i've done so where the definitions are clear. As such, the only time I'd want to combine toast chunks into a single corruption message is when they are not in themselves necessarily *individually* corrupt. That is why I wrote the code to use TOAST_MAX_CHUNK_SIZE rather than just storing up any series of equally sized chunks.

On a related note, when complaining about a sequence of toast chunks, often the sequence is something like [maximal, maximal, ..., maximal, partial], but sometimes it's just [maximal...maximal], sometimes just [maximal], and sometimes just [partial]. If I'm complaining about that entire sequence, I'd really like to do so in just one message, otherwise it looks like separate complaints.

I can certainly change the code to be how you are asking, but I'd first like to know that you really understood what I was doing here and why the reports read the way they do.

I think that report_missing_chunks() could probably just report the
range of missing chunks and not bother reporting how big they were
expected to be. But, if it is going to report how big they were
expected to be, I think it should have only 2 cases rather than 4:
either a range of missing chunks of equal size, or a single missing
chunk of some size. If, as I propose, it doesn't report the expected
size, then you still have just 2 cases: a range of missing chunks, or
a single missing chunk.

Right, this is the same as above. I'm trying not to split a single corruption complaint into separate reports.

Somehow I have a hard time feeling confident that check_toast_tuple()
is going to do the right thing. The logic is really complex and hard
to understand. 'chunkno' is a counter that advances every time we move
to the next chunk, and 'curchunk' is the value we actually find in the
TOAST tuple. This terminology is not easy to understand. Most messages
now report 'curchunk', but some still report 'chunkno'. Why does
'chunkno' need to exist at all? AFAICS the combination of 'curchunk'
and 'tctx->last_chunk_seen' ought to be sufficient. I can see no
particular reason why what you're calling 'chunkno' needs to exist
even as a local variable, let alone be printed out. Either we haven't
yet validated that the chunk_id extracted from the tuple is non-null
and greater than the last chunk number we saw, in which case we can
just complain about it if we find it to be otherwise, or we have
already done that validation, in which case we should complain about
that value and not 'chunkno' in any subsequent messages.

If we use tctx->last_chunk_seen as you propose, I imagine we'd set that to -1 prior to the first call to check_toast_tuple(). In the first call, we'd extract the toast chunk_seq and store it in curchunk and verify that it's one greater than tctx->last_chunk_seen. That all seems fine.

But under corrupt conditions, curchunk = DatumGetInt32(fastgetattr(toasttup, 2, hctx->toast_rel->rd_att, &isnull)) could return -1. That's invalid, of course, but now we don't know what to do. We're supposed to complain when we get the same chunk_seq from the index scan more than once in a row, but we don't know if the value in last_chunk_seen is a real value or just the dummy initial value. Worse still, when we get the next toast tuple back and it has a chunk_seq of -2, we want to complain that the index is returning tuples in reverse order, but we can't, because we still don't know if the -1 in last_chunk_seen is legitimate or a dummy value because that state information isn't carried over from the previous call.

Using chunkno solves this problem. If chunkno == 0, it means this is our first call, and tctx->last_chunk_seen is uninitialized. Otherwise, this is not the first call, and tctx->last_chunk_seen really is the chunk_seq seen in the prior call. There is no ambiguity.

I could probably change "int chunkno" to "bool is_first_call" or similar. I had previously used chunkno in the corruption report about chunks whose chunk_seq is null. The idea was that if you have 100 chunks and the 30th chunk is corruptly nulled out, you could say something like "toast value 178337 has toast chunk 30 with null sequence number", but you had me change that to "toast value 178337 has toast chunk with null sequence number", so generation of that message no longer needs the chunkno. I had kept chunkno around for the other purpose of knowing whether tctx->last_chunk_seen has been initialized yet, but a bool for that would now be sufficient. In any event, though you disagree with me about this below, I think the caller of this code still needs to track chunkno.

The conditionals between where you set max_valid_prior_chunk and where
you set last_chunk_seen seem hard to understand, particularly the
bifurcated way that missing chunks are reported. Initial missing
chunks are detected by (chunkno == 0 && max_valid_prior_chunk >= 0)
and later missing chunks are detected by (chunkno > 0 &&
max_valid_prior_chunk > tctx->last_chunk_seen). I'm not sure if this
is correct;

When we read a chunk_seq from a toast tuple, we need to determine if it indicates a gap in the chunk sequence, but we need to be careful.

The (chunkno == 0) and (chunkno > 0) stuff is just distinguishing between the first call and all subsequent calls.

For illustrative purposes, imagine that we expect chunks [0..4].

On the first call, we expect chunk_seq = 0, but that's not what we actually complain about if we get chunk_seq = 15. We complain about all missing expected chunks, namely [0..4], not [0..14]. We also don't complain yet about seeing extraneous chunk 15, because it might be the first in a series of contiguous extraneous chunks, and we want to wait and report those all at once when the sequence finishes. Simply complaining at this point that we didn't expect to see chunk_seq 15 is the kind of behavior that we already have committed and are trying to fix because the corruption reports are not on point.

On subsequent calls, we expect chunk_seq = last_chunk_seen+1, but that's also not what we actually complain about if we get some other value for chunk_seq. What we complain about are the missing and extraneous sequences, not the individual chunk that had an unexpected value.

I find it hard to get my head around what
max_valid_prior_chunk is supposed to represent. But in any case I
think it can be written more simply. Just keep track of what chunk_id
we expect to extract from the next TOAST tuple. Initially it's 0.
Then:

if (chunkno < tctx->expected_chunkno)
{
// toast value %u index scan returned chunk number %d when chunk %d
was expected
// don't modify tctx->expected_chunkno here, just hope the next
thing matches our previous expectation
}
else
{
if (chunkno > tctx->expected_chunkno)
// chunks are missing from tctx->expected_chunkno through
Min(chunkno - 1, tctx->final_expected_chunk), provided that the latter
value is greater than or equal to the former
tctx->expected_chunkno = chunkno + 1;
}

If you do this, you only need to report extraneous chunks when chunkno

tctx->final_expected_chunk, since chunkno < 0 is guaranteed to

trigger the first of the two complaints shown above.

In the example above, if we're expecting chunks [0..4] and get chunk_seq = 5, the max_valid_prior_chunk is 4. If we instead get chunk_seq = 6, the max_valid_prior_chunk is still 4, because chunk 5 is out of bounds.

In check_tuple_attribute I suggest "bool valid = false" rather than
"bool invalid = true". I think it's easier to understand.

Yeah, I had it that way and changed it, because I don't much like having the only use of a boolean be a negation.

bool foo = false; ... if (!foo) { ... }

seems worse to me than

bool foo = true; ... if (foo) { ... }

But you're looking at it more from the perspective of english grammar, where "invalid = false" reads as a double-negative. That's fine. I can change it back.

I object to check_toasted_attribute() using 'chunkno' in a message for
the same reasons as above in regards to check_toast_tuple() i.e. I
think it's a concept which should not exist.

So if we expect 100 chunks, get chunks [0..19, 80..99], you'd have me write the message as "expected 100 chunks but sequence ended at chunk 99"? I think that's odd. It makes infinitely more sense to me to say "expected 100 chunks but sequence ended at chunk 40". Actually, this is an argument against changing "int chunkno" to "bool is_first_call", as I alluded to above, because we have to keep the chunkno around anyway.

I think this patch could possibly be split up into multiple patches.
There's some question in my mind whether it's getting too late to
commit any of this, since some of it looks suspiciously like new
features after feature freeze. However, I kind of hate to ship this
release without at least doing something about the chunkno vs.
curchunk stuff, which is even worse in the committed code than in your
patch, and which I think will confuse the heck out of users if those
messages actually fire for anyone.

I'm in favor of cleaning up the committed code to have easier to understand output. I don't really agree with any of your proposed changes to my patch, though, which is I think a first.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#138Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#137)
1 attachment(s)
Re: pg_amcheck contrib application

On Apr 19, 2021, at 5:07 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

On Apr 19, 2021, at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 15, 2021 at 1:07 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I have added the verb "has" rather than "contains" because "has" is more consistent with the phrasing of other similar corruption reports.

That makes sense.

I have refactored the patch to address your other concerns. Breaking the patch into multiple pieces didn't add any clarity, but refactoring portions of it made things simpler to read, I think, so here it is as one patch file.

Attachments:

v22-0001-amcheck-adding-toast-pointer-corruption-checks.patchapplication/octet-stream; name=v22-0001-amcheck-adding-toast-pointer-corruption-checks.patch; x-unix-mode=0644Download
From 8ddfaea4530d0b2c73b2f826558716330167299b Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Thu, 22 Apr 2021 12:06:18 -0700
Subject: [PATCH v22] amcheck: adding toast pointer corruption checks

Adding additional checks of toast pointers: checking the extsize
against the rawsize, the uncompressed size against the size limit
for varlena datums, the va_toastrelid field against the heap table's
reltoastrelid, and if compressed, the validity of the compression
method ID.

Adding checks that the toasted attribute chunks are returned by the
toast index scan in order and without duplicates.  Checking that the
chunks do not contain null entries and that the chunks belong to the
right toasted attribute.  Improving the reports of missing or extra
chunks to be more clear to the user.

Changing the logic to continue checking toast even after reporting
that HEAP_HASEXTERNAL is false.  Previously, the toast checking
stopped here, but that wasn't necessary, and subsequent checks may
provide additional useful diagnostic information.
---
 contrib/amcheck/verify_heapam.c  | 647 ++++++++++++++++++++++++++-----
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 558 insertions(+), 90 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9f159eb3db..00cd43353b 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -30,6 +30,9 @@ PG_FUNCTION_INFO_V1(verify_heapam);
 /* The number of columns in tuples returned by verify_heapam */
 #define HEAPCHECK_RELATION_COLS 4
 
+/* The largest valid toast va_rawsize */
+#define VARLENA_SIZE_LIMIT 0x3FFFFFFF
+
 /*
  * Despite the name, we use this for reporting problems with both XIDs and
  * MXIDs.
@@ -146,12 +149,64 @@ typedef struct HeapCheckContext
 	Tuplestorestate *tupstore;
 } HeapCheckContext;
 
+/*
+ * Struct holding the running context information during the check of a single
+ * toasted attribute.
+ */
+typedef struct ToastCheckContext
+{
+	/*
+	 * Cache tracking a sequence of contiguous toast chunks, each of size
+	 * 'extraneous_size', and having sequence numbers outside the expected
+	 * range.  The sequence numbers of such chunks are cached until the
+	 * sequence ends, or a chunk of a different size is encountered, so that
+	 * only then a single toast corruption report can be emitted for the group.
+	 *
+	 * Note that if another type of corruption occurs mid sequence, the
+	 * extraneous sequence of chunks thus far encountered will be reported
+	 * first, then the other corruption.  This means that a single, contiguous
+	 * sequence of extraneous chunks will not always be reported as such, but
+	 * instead be reported as multiple subsequences interrupted by other
+	 * corruption reports.
+	 *
+	 * Note that we do not need a cache tracking missing chunks, because we
+	 * immediately know that contiguous chunks are missing when we see the
+	 * first chunk that means they have been skipped.  The reporting of
+	 * sequences of extraneous chunks and that of seqeunces of missing chunks
+	 * is nearly identical, but the manner in which we calculate them differ.
+	 */
+	bool		have_extraneous_chunks;
+	bool		have_extraneous_size;
+	int32		first_extraneous;
+	int32		last_extraneous;
+	uint32		extraneous_size;
+
+	/*
+	 * How many chunks have been seen so far, including expected, extraneous,
+	 * and corrupt chunks.
+	 */
+	int32		total_chunks;
+
+	/*
+	 * Whether we have seen a chunk with a non-NULL chunk_seq value, and that
+	 * value if so.  Neither value gets updated for chunks with NULL chunk_seq.
+	 */
+	bool		chunk_seq_seen;
+	int32		last_chunk_seq;
+
+	/*
+	 * Expected sequence number and size of the final chunk expected for this
+	 * toasted attribute.
+	 */
+	int32		final_expected_chunk;
+	uint32		final_expected_size;
+} ToastCheckContext;
+
 /* Internal implementation */
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
-static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+static int32 check_toast_tuple(HeapTuple toasttup, ToastedAttribute *ta,
+							   ToastCheckContext *tctx, HeapCheckContext *hctx);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -160,9 +215,14 @@ static void check_toasted_attribute(HeapCheckContext *ctx,
 static bool check_tuple_header(HeapCheckContext *ctx);
 static bool check_tuple_visibility(HeapCheckContext *ctx);
 
+static void report_extraneous_chunks(HeapCheckContext *hctx,
+									 ToastedAttribute *ta,
+									 ToastCheckContext *tctx);
 static void report_corruption(HeapCheckContext *ctx, char *msg);
 static void report_toast_corruption(HeapCheckContext *ctx,
-									ToastedAttribute *ta, char *msg);
+									ToastedAttribute *ta,
+									ToastCheckContext *tctx,
+									char *msg);
 static TupleDesc verify_heapam_tupdesc(void);
 static FullTransactionId FullTransactionIdFromXidAndCtx(TransactionId xid,
 														const HeapCheckContext *ctx);
@@ -603,8 +663,16 @@ report_corruption(HeapCheckContext *ctx, char *msg)
  */
 static void
 report_toast_corruption(HeapCheckContext *ctx, ToastedAttribute *ta,
-						char *msg)
+						ToastCheckContext *tctx, char *msg)
 {
+	/*
+	 * If there are any cached extraneous chunks, report those before this
+	 * next message, otherwise the corruptions will appear out of order.
+	 */
+	if (tctx->have_extraneous_chunks)
+		report_extraneous_chunks(ctx, ta, tctx);
+
+	/* Ok, now report the message we were called to report. */
 	report_corruption_internal(ctx->tupstore, ctx->tupdesc, ta->blkno,
 							   ta->offnum, ta->attnum, msg);
 	ctx->is_corrupt = true;
@@ -1147,100 +1215,406 @@ check_tuple_visibility(HeapCheckContext *ctx)
 	return true;
 }
 
-
 /*
- * Check the current toast tuple against the state tracked in ctx, recording
- * any corruption found in ctx->tupstore.
+ * Issues toast corruption reports for the given extraneous chunk, if not null,
+ * along with any extraneous chunks in the tctx cache, which is then cleared.
+ * An "extraneous chunk" is one with a sequence number outside the expected
+ * range for the toasted attribute.
  *
- * This is not equivalent to running verify_heapam on the toast table itself,
- * and is not hardened against corruption of the toast table.  Rather, when
- * validating a toasted attribute in the main table, the sequence of toast
- * tuples that store the toasted value are retrieved and checked in order, with
- * each toast tuple being checked against where we are in the sequence, as well
- * as each toast tuple having its varlena structure sanity checked.
+ * To report extraneous chunks and clear the cache, call with chunk and
+ * chunksize NULL.  If the cache is already empty, the call is harmless.
  *
- * Returns whether the toast tuple passed the corruption checks.
+ * chunksize: the size of the single chunk being reported
+ * have_chunksize: true if 'chunksize' is valid
  */
 static void
-check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+report_extraneous_chunks(HeapCheckContext *hctx, ToastedAttribute *ta,
+						 ToastCheckContext *tctx)
 {
-	int32		curchunk;
-	Pointer		chunk;
-	bool		isnull;
-	int32		chunksize;
-	int32		expected_size;
+	if (!tctx->have_extraneous_chunks)
+		return;
 
 	/*
-	 * Have a chunk, extract the sequence number and the data
+	 * Clear the flag before calling report_toast_corruption to avoid
+	 * infinite recursion.
 	 */
-	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
-	if (isnull)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u has toast chunk with null sequence number",
-										 ta->toast_pointer.va_valueid));
-		return;
-	}
-	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
-										ctx->toast_rel->rd_att, &isnull));
-	if (isnull)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has null data",
-										 ta->toast_pointer.va_valueid, chunkno));
-		return;
-	}
-	if (!VARATT_IS_EXTENDED(chunk))
-		chunksize = VARSIZE(chunk) - VARHDRSZ;
-	else if (VARATT_IS_SHORT(chunk))
+	tctx->have_extraneous_chunks = false;
+
+	if (tctx->first_extraneous < tctx->last_extraneous &&
+		tctx->have_extraneous_size)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u has unexpected chunks %d through %d each with size %u",
+									 ta->toast_pointer.va_valueid,
+									 tctx->first_extraneous,
+									 tctx->last_extraneous,
+									 tctx->extraneous_size));
+	else if (tctx->first_extraneous < tctx->last_extraneous)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u has unexpected chunks %d through %d each with corrupt chunk data",
+									 ta->toast_pointer.va_valueid,
+									 tctx->first_extraneous,
+									 tctx->last_extraneous));
+	else if (tctx->have_extraneous_size)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u has unexpected chunk %d with size %u",
+									 ta->toast_pointer.va_valueid,
+									 tctx->first_extraneous,
+									 tctx->extraneous_size));
+	else
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u has unexpected chunk %d with corrupt chunk data",
+									 ta->toast_pointer.va_valueid,
+									 tctx->first_extraneous));
+}
+
+/*
+ * Records that a toast chunk should be reported as extraneous.  After
+ * finishing all calls to this function for a given toasted attribute, a call
+ * to report_extraneous_chunks() should be issued to flush the cache.
+ */
+static void
+handle_extraneous_chunk(HeapCheckContext *hctx, ToastedAttribute *ta,
+						ToastCheckContext *tctx, int32 chunk_seq,
+						uint32 chunksize, bool have_chunksize)
+{
+	if (tctx->have_extraneous_chunks)
 	{
+		if (tctx->last_extraneous == chunk_seq - 1 &&
+			tctx->have_extraneous_size == have_chunksize &&
+			(tctx->extraneous_size == chunksize || !have_chunksize))
+		{
+			/*
+			 * This is the next chunk in an ongoing sequence of equally sized
+			 * or corrupted chunks.  Extend it, but do not report it yet.
+			 */
+			tctx->last_extraneous = chunk_seq;
+			return;
+		}
+
 		/*
-		 * could happen due to heap_form_tuple doing its thing
+		 * There is an ongoing sequence, but this chunk is discontiguous with
+		 * it or of a different size or corruption status.  Report the sequence
+		 * and clear the cache so we can start over with this chunk.
 		 */
-		chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+		report_extraneous_chunks(hctx, ta, tctx);
 	}
-	else
-	{
-		/* should never happen */
-		uint32		header = ((varattrib_4b *) chunk)->va_4byte.va_header;
 
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has invalid varlena header %0x",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, header));
+	/* Start a new sequence, but do not report it yet. */
+	tctx->first_extraneous = chunk_seq;
+	tctx->last_extraneous = chunk_seq;
+	tctx->extraneous_size = chunksize;
+	tctx->have_extraneous_size = have_chunksize;
+	tctx->have_extraneous_chunks = true;
+	return;
+}
+
+/*
+ * Helper function for report_missing_chunks()
+ */
+static void
+report_missing_sequence(HeapCheckContext *hctx, ToastedAttribute *ta,
+						ToastCheckContext *tctx, int32 first_missing,
+						int32 last_missing)
+{
+	report_toast_corruption(hctx, ta, tctx,
+							psprintf("toast value %u missing chunks %d through %d with expected size %u",
+									 ta->toast_pointer.va_valueid,
+									 first_missing, last_missing,
+									 (unsigned)TOAST_MAX_CHUNK_SIZE));
+}
+
+/*
+ * Helper function for report_missing_chunks()
+ */
+static void
+report_missing_chunk(HeapCheckContext *hctx, ToastedAttribute *ta,
+					 ToastCheckContext *tctx, int32 missing_chunk,
+					 uint32 missing_size)
+{
+	report_toast_corruption(hctx, ta, tctx,
+							psprintf("toast value %u missing chunk %d with expected size %u",
+									 ta->toast_pointer.va_valueid,
+									 missing_chunk, missing_size));
+}
+
+/*
+ * Issues toast corruption reports for one or more missing toast chunks
+ * in the [first_missing..last_missing] range intersected with the
+ * [0..final_expected_chunk] range.
+ */
+static void
+report_missing_chunks(HeapCheckContext *hctx, ToastedAttribute *ta,
+					  ToastCheckContext *tctx, int32 first_missing,
+					  int32 last_missing)
+{
+	uint32	expected_size;
+
+	/*
+	 * Adjust the range of missing values to not extend beyond
+	 * [0..final_expected_chunk] on either end of the range.
+	 */
+	if (first_missing < 0)
+		first_missing = 0;
+	if (last_missing > tctx->final_expected_chunk)
+		last_missing = tctx->final_expected_chunk;
+
+	/* Check whether any missing chunks remain to complain about. */
+	if (first_missing > last_missing)
 		return;
-	}
+
+	if (last_missing < tctx->final_expected_chunk)
+		expected_size = TOAST_MAX_CHUNK_SIZE;
+	else
+		expected_size = tctx->final_expected_size;
 
 	/*
-	 * Some checks on the data we've found
+	 * Report missing chunks with language matching language used for reporting
+	 * extraneous chunks.  Mention the sizes expected for the missing chunks so
+	 * the user can reconcile that against any extraneous chunk reports.
 	 */
-	if (curchunk != chunkno)
+	if (last_missing > first_missing + 1 &&
+		expected_size < TOAST_MAX_CHUNK_SIZE)
 	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
+		report_missing_sequence(hctx, ta, tctx, first_missing, last_missing -1);
+		report_missing_chunk(hctx, ta, tctx, last_missing, expected_size);
 	}
-	if (chunkno > endchunk)
+	else if (last_missing == first_missing + 1 &&
+			 expected_size < TOAST_MAX_CHUNK_SIZE)
 	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d follows last expected chunk %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
-		return;
+		report_missing_chunk(hctx, ta, tctx, first_missing, TOAST_MAX_CHUNK_SIZE);
+		report_missing_chunk(hctx, ta, tctx, last_missing, expected_size);
 	}
+	else if (last_missing > first_missing)
+		report_missing_sequence(hctx, ta, tctx, first_missing, last_missing);
+	else
+		report_missing_chunk(hctx, ta, tctx, last_missing, expected_size);
+}
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+/*
+ * Check the current toast tuple, recording any corruption found in
+ * ctx->tupstore.
+ *
+ * This is not equivalent to running verify_heapam on the toast table itself,
+ * and is not hardened against corruption of the toast table.  Rather, when
+ * validating a toasted attribute in the main table, the sequence of toast
+ * tuples that store the toasted value are retrieved and checked in order, with
+ * each toast tuple being checked against where we are in the sequence, as well
+ * as each toast tuple having its varlena structure sanity checked.
+ *
+ * Returns the size of the current toast tuple chunk, or zero if the chunk is
+ * not sufficiently sensible for the chunk size to be determined.
+ */
+static int32
+check_toast_tuple(HeapTuple toasttup, ToastedAttribute *ta,
+				  ToastCheckContext *tctx, HeapCheckContext *hctx)
+{
+	int32		chunk_id;
+	int32		chunk_seq;
+	Pointer		chunk_data;
+	bool		id_isnull;
+	bool		seq_isnull;
+	bool		data_isnull;
+	uint32		chunksize;
+	int32		va_valueid;
+	uint32		va_header;
+	bool		header_invalid = false;
+	bool		id_mismatch = false;
+
+	/* Extract the valueid from our toast pointer. */
+	va_valueid = ta->toast_pointer.va_valueid;
+
+	/* Have a chunk, extract the chunk id, sequence number, and data. */
+	chunk_id = DatumGetObjectId(fastgetattr(toasttup, 1,
+											hctx->toast_rel->rd_att,
+											&id_isnull));
+	chunk_seq = DatumGetInt32(fastgetattr(toasttup, 2, hctx->toast_rel->rd_att,
+										  &seq_isnull));
+	chunk_data = DatumGetPointer(fastgetattr(toasttup, 3,
+											 hctx->toast_rel->rd_att,
+											 &data_isnull));
+
+	/* Sanity check the chunk data and get the size. */
+	if (!data_isnull)
+	{
+		if (!VARATT_IS_EXTENDED(chunk_data))
+			chunksize = VARSIZE(chunk_data) - VARHDRSZ;
+		else if (VARATT_IS_SHORT(chunk_data))
+			chunksize = VARSIZE_SHORT(chunk_data) - VARHDRSZ_SHORT;
+		else
+		{
+			header_invalid = true;
+			va_header = ((varattrib_4b *) chunk_data)->va_4byte.va_header;
+		}
+	}
 
-	if (chunksize != expected_size)
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has size %u, but expected size %u",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+	/* The chunk_id should match this attribute's va_valueid. */
+	if (!id_isnull && chunk_id != va_valueid)
+		id_mismatch = true;
+
+	/*
+	 * The toast table should never contain null values, and the toast index
+	 * scan should never return chunks for values other than the one we
+	 * requested.  The data's varlena header should also be valid.
+	 *
+	 * If these expectations are violated in multiple ways, we cannot reliably
+	 * identify the chunk we are complaining about across multiple messages, so
+	 * we have to report all the problems in a single combined message.  (There
+	 * are specific examples below that we could break apart, but it hardly
+	 * seems worth doing so.)  Reporting each problem separately would create
+	 * ambiguity between corruptions occurring across successive chunks and
+	 * those same corruptions all in the same chunk.
+	 */
+	if (id_isnull && seq_isnull && data_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk with null value, sequence number and data",
+										 va_valueid));
+	else if (id_mismatch && seq_isnull && data_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk for value %d, sequence number and data",
+										 va_valueid,
+										 chunk_id));
+	else if (id_isnull && seq_isnull && header_invalid)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk with null value, null sequence number and invalid varlena header %0x",
+										 va_valueid,
+										 va_header));
+	else if (id_mismatch && seq_isnull && header_invalid)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk for value %d, null sequence number and invalid varlena header %0x",
+										 va_valueid,
+										 chunk_id,
+										 va_header));
+	else if (id_isnull && seq_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk with null value and sequence number",
+										 va_valueid));
+	else if (id_mismatch && seq_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk for value %d with null sequence number",
+										 va_valueid,
+										 chunk_id));
+	else if (id_isnull && data_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d with null value and data",
+										 va_valueid,
+										 chunk_seq));
+	else if (id_mismatch && data_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d for value %d with null data",
+										 va_valueid,
+										 chunk_seq,
+										 chunk_id));
+	else if (id_isnull && header_invalid)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d with null value and invalid varlena header %0x",
+										 va_valueid,
+										 chunk_seq,
+										 va_header));
+	else if (id_mismatch && header_invalid)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d for value %d with invalid varlena header %0x",
+										 va_valueid,
+										 chunk_seq,
+										 chunk_id,
+										 va_header));
+	else if (id_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d with null value",
+										 va_valueid,
+										 chunk_seq));
+	else if (id_mismatch)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d for value %d",
+										 va_valueid,
+										 chunk_seq,
+										 chunk_id));
+	else if (seq_isnull && data_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk with null sequence number and data",
+										 va_valueid));
+	else if (seq_isnull && header_invalid)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk with null sequence number and invalid varlena header %0x",
+										 va_valueid,
+										 va_header));
+	else if (seq_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk with null sequence number",
+										 va_valueid));
+	else if (data_isnull)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d with null data",
+										 va_valueid,
+										 chunk_seq));
+	else if (header_invalid)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast index scan for value %u returned toast chunk %d with invalid varlena header %0x",
+										 va_valueid,
+										 chunk_seq,
+										 va_header));
+
+	/*
+	 * Remaining checks concern where this chunk falls into the sequence
+	 * relative to other chunks for this attribute.  If this chunk does not
+	 * properly belong to the attribute or has a null chunk_seq value, we
+	 * cannot perform such checks, so we're done.
+	 */
+	if (id_isnull || id_mismatch || seq_isnull)
+		return 0;
+
+	/*
+	 * Assuming the chunk_seq values are being returned to us in the correct
+	 * order, complain if this chunk_seq indicates that any expected chunks
+	 * have been skipped.  Note that if the skipped chunks are later returned,
+	 * an additional report about the misordering will be issued.
+	 */
+	if (!tctx->chunk_seq_seen && chunk_seq > 0)
+		report_missing_chunks(hctx, ta, tctx, 0, chunk_seq-1);
+	else if (tctx->chunk_seq_seen && chunk_seq > tctx->last_chunk_seq+1)
+		report_missing_chunks(hctx, ta, tctx, tctx->last_chunk_seq+1, chunk_seq-1);
+
+	/* Complain if the chunk sequence number retreats. */
+	else if (tctx->chunk_seq_seen && chunk_seq < tctx->last_chunk_seq)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u index scan returned chunk %d after chunk %d",
+										 va_valueid,
+										 chunk_seq, tctx->last_chunk_seq));
+
+	/* Complain if the same chunk sequence number is returned multiple times. */
+	else if (tctx->chunk_seq_seen && chunk_seq == tctx->last_chunk_seq)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u index scan returned duplicate chunk %d",
+										 va_valueid,
+										 chunk_seq));
+
+
+	/* Report an extraneous chunk outside the expected sequence. */
+	if (chunk_seq < 0 || chunk_seq > tctx->final_expected_chunk)
+		handle_extraneous_chunk(hctx, ta, tctx, chunk_seq, chunksize,
+								!header_invalid);
+
+	/* Report a partial chunk before the final expected chunk. */
+	else if (chunk_seq < tctx->final_expected_chunk && chunksize != TOAST_MAX_CHUNK_SIZE)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u chunk %d has size %u, but expected chunk with size %u",
+										 va_valueid,
+										 chunk_seq, chunksize,
+										 (unsigned)TOAST_MAX_CHUNK_SIZE));
+
+	/* Report a final chunk of the wrong size. */
+	else if (chunk_seq == tctx->final_expected_chunk && chunksize != tctx->final_expected_size)
+		report_toast_corruption(hctx, ta, tctx,
+								psprintf("toast value %u chunk %d has size %u, but expected chunk with size %u",
+										 va_valueid,
+										 chunk_seq, chunksize,
+										 tctx->final_expected_size));
+
+	/* Remember that we have seen this chunk for next time. */
+	tctx->chunk_seq_seen = true;
+	tctx->last_chunk_seq = chunk_seq;
+	tctx->total_chunks++;
+
+	return chunksize;
 }
 
 /*
@@ -1379,14 +1753,55 @@ check_tuple_attribute(HeapCheckContext *ctx)
 	 */
 	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
 
+	/* Oversized toasted attributes should never be stored */
+	if (toast_pointer.va_rawsize > VARLENA_SIZE_LIMIT)
+		report_corruption(ctx,
+						  psprintf("toast value %u rawsize %u exceeds limit %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_rawsize,
+								   VARLENA_SIZE_LIMIT));
+
+	/* Compression should never expand the attribute */
+	if (VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer) > toast_pointer.va_rawsize - VARHDRSZ)
+		report_corruption(ctx,
+						  psprintf("toast value %u external size %u exceeds maximum expected for rawsize %u",
+								   toast_pointer.va_valueid,
+								   VARATT_EXTERNAL_GET_EXTSIZE(toast_pointer),
+								   toast_pointer.va_rawsize));
+
+	/* Compressed attributes should have a valid compression method */
+	if (VARATT_IS_COMPRESSED(&toast_pointer))
+	{
+		ToastCompressionId cmid;
+		bool		valid = false;
+
+		cmid = TOAST_COMPRESS_METHOD(&toast_pointer);
+		switch (cmid)
+		{
+			/* List of all valid compression method IDs */
+			case TOAST_PGLZ_COMPRESSION_ID:
+			case TOAST_LZ4_COMPRESSION_ID:
+				valid = true;
+				break;
+
+			/* Recognized but invalid compression method ID */
+			case TOAST_INVALID_COMPRESSION_ID:
+				break;
+
+			/* Intentionally no default here */
+		}
+
+		if (!valid)
+			report_corruption(ctx,
+							  psprintf("toast value %u has invalid compression method id %d",
+									   toast_pointer.va_valueid, cmid));
+	}
+
 	/* The tuple header better claim to contain toasted values */
 	if (!(infomask & HEAP_HASEXTERNAL))
-	{
 		report_corruption(ctx,
 						  psprintf("toast value %u is external but tuple header flag HEAP_HASEXTERNAL not set",
 								   toast_pointer.va_valueid));
-		return true;
-	}
 
 	/* The relation better have a toast table */
 	if (!ctx->rel->rd_rel->reltoastrelid)
@@ -1397,6 +1812,14 @@ check_tuple_attribute(HeapCheckContext *ctx)
 		return true;
 	}
 
+	/* The toast pointer had better point at the relation's toast table */
+	if (toast_pointer.va_toastrelid != ctx->rel->rd_rel->reltoastrelid)
+		report_corruption(ctx,
+						  psprintf("toast value %u toast relation oid %u differs from expected oid %u",
+								   toast_pointer.va_valueid,
+								   toast_pointer.va_toastrelid,
+								   ctx->rel->rd_rel->reltoastrelid));
+
 	/* If we were told to skip toast checking, then we're done. */
 	if (ctx->toast_rel == NULL)
 		return true;
@@ -1436,10 +1859,22 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	SysScanDesc toastscan;
 	bool		found_toasttup;
 	HeapTuple	toasttup;
-	int32		chunkno;
-	int32		endchunk;
-
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	int64		totalsize;		/* corrupt toast could overflow 32 bits */
+	int32		extsize;
+	ToastCheckContext tctx;
+
+	/* Calculate expected number of chunks and size of final chunk */
+	extsize = VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer);
+	tctx.final_expected_chunk = (extsize - 1) / TOAST_MAX_CHUNK_SIZE;
+	tctx.final_expected_size = extsize - tctx.final_expected_chunk * TOAST_MAX_CHUNK_SIZE;
+
+	/* Have not yet seen any chunks for this toast tuple */
+	tctx.have_extraneous_chunks = false;
+	tctx.first_extraneous = -1;
+	tctx.last_extraneous = -1;
+	tctx.chunk_seq_seen = false;
+	tctx.last_chunk_seq = -1;
+	tctx.total_chunks = 0;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1458,27 +1893,59 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 										   ctx->valid_toast_index,
 										   &SnapshotToast, 1,
 										   &toastkey);
-	chunkno = 0;
+	totalsize = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
+		totalsize += check_toast_tuple(toasttup, ta, &tctx, ctx);
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
-		chunkno++;
 	}
 	systable_endscan_ordered(toastscan);
 
 	if (!found_toasttup)
-		report_toast_corruption(ctx, ta,
+	{
+		report_toast_corruption(ctx, ta, &tctx,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
+		return;
+	}
+
+	/* Flush any cached extraneous chunks seen in the loop above. */
+	report_extraneous_chunks(ctx, ta, &tctx);
+
+	/*
+	 * Any chunks missing from the beginning or middle of the sequence were
+	 * already reported within check_toast_tuple(), but we need to report
+	 * any chunks missing from the end of the sequence.
+	 */
+	if (tctx.last_chunk_seq < tctx.final_expected_chunk)
+		report_missing_chunks(ctx, ta, &tctx, tctx.last_chunk_seq+1,
+							  tctx.final_expected_chunk);
+
+	/*
+	 * Report a summary message for this toasted attribute if the size and
+	 * structure of the attribute in its totality differs from our
+	 * expectations.
+	 */
+	if (!tctx.chunk_seq_seen)
+		report_toast_corruption(ctx, ta, &tctx,
+								psprintf(ngettext("toast value %u was expected to end with chunk %d and total size %d but ends after %d chunk with null sequence number",
+												  "toast value %u was expected to end with chunk %d and total size %d but ends after %d chunks with null sequence number",
+												  tctx.total_chunks),
+										 ta->toast_pointer.va_valueid,
+										 tctx.final_expected_chunk, extsize,
+										 tctx.total_chunks));
+	else if (extsize != totalsize || tctx.final_expected_chunk != tctx.last_chunk_seq)
+		report_toast_corruption(ctx, ta, &tctx,
+								psprintf(ngettext("toast value %u was expected to end with chunk %d and total size %d but ends after %d chunk with chunk %d and total size " INT64_FORMAT,
+												  "toast value %u was expected to end with chunk %d and total size %d but ends after %d chunks with chunk %d and total size " INT64_FORMAT,
+												  tctx.total_chunks),
 										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
+										 tctx.final_expected_chunk, extsize,
+										 tctx.total_chunks,
+										 tctx.last_chunk_seq, totalsize));
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c7aff677d4..996ef03180 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2558,6 +2558,7 @@ TimestampTz
 TmFromChar
 TmToChar
 ToastAttrInfo
+ToastCheckContext
 ToastTupleContext
 ToastedAttribute
 TocEntry
-- 
2.21.1 (Apple Git-122.3)

#139Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#138)
1 attachment(s)
Re: pg_amcheck contrib application

On Thu, Apr 22, 2021 at 7:28 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I have refactored the patch to address your other concerns. Breaking the patch into multiple pieces didn't add any clarity, but refactoring portions of it made things simpler to read, I think, so here it is as one patch file.

I was hoping that this version was going to be smaller than the last
version, but instead it went from 300+ lines to 500+ lines.

The main thing I'm unhappy about in the status quo is the use of
chunkno in error messages. I have suggested several times making that
concept go away, because I think users will be confused. Here's a
minimal patch that does just that. It's 32 lines and results in a net
removal of 4 lines. It differs somewhat from my earlier suggestions,
because my priority here is to get reasonably understandable output
without needing a ton of code, and as I was working on this I found
that some of my earlier suggestions would have needed more code to
implement and I didn't think it bought enough to be worth it. It's
possible this is too simple, or that it's buggy, so let me know what
you think. But basically, I think what got committed before is
actually mostly fine and doesn't need major revision. It just needs
tidying up to avoid the confusing chunkno concept.

Now, the other thing we've talked about is adding a few more checks,
to verify for example that the toastrelid is what we expect, and I see
in your v22 you thought of a few other things. I think we can consider
those, possibly as things where we consider it tidying up loose ends
for v14, or else as improvements for v15. But I don't think that the
fairly large size of your patch comes primarily from additional
checks. I think it mostly comes from the code to produce error reports
getting a lot more complicated. I apologize if my comments have driven
that complexity, but they weren't intended to.

One tiny problem with the attached patch is that it does not make any
regression tests fail, which also makes it hard for me to tell if it
breaks anything, or if the existing code works. I don't know how
practical it is to do anything about that. Do you have a patch handy
that allows manual updates and deletes on TOAST tables, for manual
testing purposes?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

simply-remove-chunkno-concept.patchapplication/octet-stream; name=simply-remove-chunkno-concept.patchDownload
diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9366f45d74..094a44d993 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -150,8 +150,8 @@ typedef struct HeapCheckContext
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+							  ToastedAttribute *ta, int32 *expected_chunk_seq,
+							  uint32 extsize);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -1163,19 +1163,19 @@ check_tuple_visibility(HeapCheckContext *ctx)
  */
 static void
 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+				  ToastedAttribute *ta, int32 *expected_chunk_seq,
+				  uint32 extsize)
 {
-	int32		curchunk;
+	int32		chunk_seq;
+	int32		last_chunk_seq = (extsize + 1) / TOAST_MAX_CHUNK_SIZE;
 	Pointer		chunk;
 	bool		isnull;
 	int32		chunksize;
 	int32		expected_size;
 
-	/*
-	 * Have a chunk, extract the sequence number and the data
-	 */
-	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
+	/* Sanity-check the sequence number. */
+	chunk_seq = DatumGetInt32(fastgetattr(toasttup, 2,
+										  ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
@@ -1183,13 +1183,25 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 										 ta->toast_pointer.va_valueid));
 		return;
 	}
+	if (chunk_seq != *expected_chunk_seq)
+	{
+		/* Either the TOAST index is corrupt, or we don't have all chunks. */
+		report_toast_corruption(ctx, ta,
+								psprintf("toast value %u index scan returned chunk %d when expecting chunk %d",
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq, *expected_chunk_seq));
+	}
+	*expected_chunk_seq = chunk_seq + 1;
+
+	/* Sanity-check the chunk data. */
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has null data",
-										 ta->toast_pointer.va_valueid, chunkno));
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1209,40 +1221,31 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has invalid varlena header %0x",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, header));
+										 chunk_seq, header));
 		return;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != chunkno)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
-	}
-	if (chunkno > endchunk)
+	if (chunk_seq > last_chunk_seq)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d follows last expected chunk %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
+										 chunk_seq, last_chunk_seq));
 		return;
 	}
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+	expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+		: extsize % TOAST_MAX_CHUNK_SIZE;
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has size %u, but expected size %u",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+										 chunk_seq, chunksize, expected_size));
 }
-
 /*
  * Check the current attribute as tracked in ctx, recording any corruption
  * found in ctx->tupstore.
@@ -1436,10 +1439,10 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	SysScanDesc toastscan;
 	bool		found_toasttup;
 	HeapTuple	toasttup;
-	int32		chunkno;
-	int32		endchunk;
+	uint32		extsize;
+	int32		expected_chunkno = 0;
 
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	extsize = VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer);
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1458,15 +1461,13 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 										   ctx->valid_toast_index,
 										   &SnapshotToast, 1,
 										   &toastkey);
-	chunkno = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
-		chunkno++;
+		check_toast_tuple(toasttup, ctx, ta, &expected_chunkno, extsize);
 	}
 	systable_endscan_ordered(toastscan);
 
@@ -1474,11 +1475,6 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
-										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
 }
 
 /*
#140Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#139)
Re: pg_amcheck contrib application

On Apr 23, 2021, at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 22, 2021 at 7:28 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I have refactored the patch to address your other concerns. Breaking the patch into multiple pieces didn't add any clarity, but refactoring portions of it made things simpler to read, I think, so here it is as one patch file.

I was hoping that this version was going to be smaller than the last
version, but instead it went from 300+ lines to 500+ lines.

The main thing I'm unhappy about in the status quo is the use of
chunkno in error messages. I have suggested several times making that
concept go away, because I think users will be confused. Here's a
minimal patch that does just that. It's 32 lines and results in a net
removal of 4 lines. It differs somewhat from my earlier suggestions,
because my priority here is to get reasonably understandable output
without needing a ton of code, and as I was working on this I found
that some of my earlier suggestions would have needed more code to
implement and I didn't think it bought enough to be worth it. It's
possible this is too simple, or that it's buggy, so let me know what
you think. But basically, I think what got committed before is
actually mostly fine and doesn't need major revision. It just needs
tidying up to avoid the confusing chunkno concept.

Now, the other thing we've talked about is adding a few more checks,
to verify for example that the toastrelid is what we expect, and I see
in your v22 you thought of a few other things. I think we can consider
those, possibly as things where we consider it tidying up loose ends
for v14, or else as improvements for v15. But I don't think that the
fairly large size of your patch comes primarily from additional
checks. I think it mostly comes from the code to produce error reports
getting a lot more complicated. I apologize if my comments have driven
that complexity, but they weren't intended to.

One tiny problem with the attached patch is that it does not make any
regression tests fail, which also makes it hard for me to tell if it
breaks anything, or if the existing code works. I don't know how
practical it is to do anything about that. Do you have a patch handy
that allows manual updates and deletes on TOAST tables, for manual
testing purposes?

Yes, I haven't been posting that with the patch because, but I will test your patch and see what differs.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#141Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#140)
Re: pg_amcheck contrib application

On Apr 23, 2021, at 10:31 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I will test your patch and see what differs.

Here are the differences between master and you patch:

UPDATE $toastname SET chunk_seq = chunk_seq + 1000 WHERE chunk_id = $value_id_to_corrupt

-                       qr/${header}toast value 16459 chunk 0 has sequence number 1000, but expected sequence number 0/,
-                       qr/${header}toast value 16459 chunk 1 has sequence number 1001, but expected sequence number 1/,
-                       qr/${header}toast value 16459 chunk 2 has sequence number 1002, but expected sequence number 2/,
-                       qr/${header}toast value 16459 chunk 3 has sequence number 1003, but expected sequence number 3/,
-                       qr/${header}toast value 16459 chunk 4 has sequence number 1004, but expected sequence number 4/,
-                       qr/${header}toast value 16459 chunk 5 has sequence number 1005, but expected sequence number 5/;
+               qr/${header}toast value 16459 index scan returned chunk 1000 when expecting chunk 0/,
+               qr/${header}toast value 16459 chunk 1000 follows last expected chunk 5/,
+               qr/${header}toast value 16459 chunk 1001 follows last expected chunk 5/,
+               qr/${header}toast value 16459 chunk 1002 follows last expected chunk 5/,
+               qr/${header}toast value 16459 chunk 1003 follows last expected chunk 5/,
+               qr/${header}toast value 16459 chunk 1004 follows last expected chunk 5/,
+               qr/${header}toast value 16459 chunk 1005 follows last expected chunk 5/;

UPDATE $toastname SET chunk_seq = chunk_seq * 1000 WHERE chunk_id = $value_id_to_corrupt

-                       qr/${header}toast value $value_id_to_corrupt chunk 1 has sequence number 1000, but expected sequence number 1/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 2 has sequence number 2000, but expected sequence number 2/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 3 has sequence number 3000, but expected sequence number 3/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 4 has sequence number 4000, but expected sequence number 4/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 5 has sequence number 5000, but expected sequence number 5/;
-
+               qr/${header}toast value 16460 index scan returned chunk 1000 when expecting chunk 1/,
+               qr/${header}toast value 16460 chunk 1000 follows last expected chunk 5/,
+               qr/${header}toast value 16460 index scan returned chunk 2000 when expecting chunk 1001/,
+               qr/${header}toast value 16460 chunk 2000 follows last expected chunk 5/,
+               qr/${header}toast value 16460 index scan returned chunk 3000 when expecting chunk 2001/,
+               qr/${header}toast value 16460 chunk 3000 follows last expected chunk 5/,
+               qr/${header}toast value 16460 index scan returned chunk 4000 when expecting chunk 3001/,
+               qr/${header}toast value 16460 chunk 4000 follows last expected chunk 5/,
+               qr/${header}toast value 16460 index scan returned chunk 5000 when expecting chunk 4001/,
+               qr/${header}toast value 16460 chunk 5000 follows last expected chunk 5/;

INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
(SELECT chunk_id,
10*chunk_seq + 1000,
chunk_data
FROM $toastname
WHERE chunk_id = $value_id_to_corrupt)

-                       qr/${header}toast value $value_id_to_corrupt chunk 6 has sequence number 1000, but expected sequence number 6/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 7 has sequence number 1010, but expected sequence number 7/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 8 has sequence number 1020, but expected sequence number 8/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 9 has sequence number 1030, but expected sequence number 9/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 10 has sequence number 1040, but expected sequence number 10/,
-                       qr/${header}toast value $value_id_to_corrupt chunk 11 has sequence number 1050, but expected sequence number 11/,
-                       qr/${header}toast value $value_id_to_corrupt was expected to end at chunk 6, but ended at chunk 12/;
+              qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1000 when expecting chunk 6/,
+              qr/${header}toast value $value_id_to_corrupt chunk 1000 follows last expected chunk 5/,
+              qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1010 when expecting chunk 1001/,
+              qr/${header}toast value $value_id_to_corrupt chunk 1010 follows last expected chunk 5/,
+              qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1020 when expecting chunk 1011/,
+              qr/${header}toast value $value_id_to_corrupt chunk 1020 follows last expected chunk 5/,
+              qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1030 when expecting chunk 1021/,
+              qr/${header}toast value $value_id_to_corrupt chunk 1030 follows last expected chunk 5/,
+              qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1040 when expecting chunk 1031/,
+              qr/${header}toast value $value_id_to_corrupt chunk 1040 follows last expected chunk 5/,
+              qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1050 when expecting chunk 1041/,
+              qr/${header}toast value $value_id_to_corrupt chunk 1050 follows last expected chunk 5/;


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#142Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#141)
Re: pg_amcheck contrib application

On Fri, Apr 23, 2021 at 2:05 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Here are the differences between master and you patch:

Thanks. Those messages look reasonable to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#143Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#141)
Re: pg_amcheck contrib application

On Apr 23, 2021, at 11:05 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

Here are the differences between master and you patch:

Another difference I should probably mention is that a bunch of unrelated tests are failing with errors like:

toast value 13465 chunk 0 has size 1995, but expected size 1996

which leads me to suspect your changes to how the size is calculated.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#144Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#143)
Re: pg_amcheck contrib application

On Fri, Apr 23, 2021 at 2:15 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Another difference I should probably mention is that a bunch of unrelated tests are failing with errors like:

toast value 13465 chunk 0 has size 1995, but expected size 1996

which leads me to suspect your changes to how the size is calculated.

That seems like a pretty reasonable suspicion, but I can't see the problem:

-       expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-               : VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) -
(endchunk * TOAST_MAX_CHUNK_SIZE);
+       expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+               : extsize % TOAST_MAX_CHUNK_SIZE;

What's different?

1. The variables are renamed.

2. It uses a new variable extsize instead of recomputing
VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer), but I think that
should have the same value.

3. I used modulo arithmetic (%) instead of subtracting endchunk *
TOAST_MAX_CHUNK_SIZE.

Is TOAST_MAX_CHUNK_SIZE 1996? How long a value did you insert?

--
Robert Haas
EDB: http://www.enterprisedb.com

#145Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#144)
Re: pg_amcheck contrib application

On Apr 23, 2021, at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 23, 2021 at 2:15 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Another difference I should probably mention is that a bunch of unrelated tests are failing with errors like:

toast value 13465 chunk 0 has size 1995, but expected size 1996

which leads me to suspect your changes to how the size is calculated.

That seems like a pretty reasonable suspicion, but I can't see the problem:

-       expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-               : VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) -
(endchunk * TOAST_MAX_CHUNK_SIZE);
+       expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+               : extsize % TOAST_MAX_CHUNK_SIZE;

What's different?

1. The variables are renamed.

2. It uses a new variable extsize instead of recomputing
VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer), but I think that
should have the same value.

3. I used modulo arithmetic (%) instead of subtracting endchunk *
TOAST_MAX_CHUNK_SIZE.

Is TOAST_MAX_CHUNK_SIZE 1996? How long a value did you insert?

On my laptop, yes, 1996 is TOAST_MAX_CHUNK_SIZE.

I'm not inserting anything. These failures come from just regular tests that I have not changed. I just applied your patch and ran `make check-world` and these fail in src/bin/pg_amcheck


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#146Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#144)
Re: pg_amcheck contrib application

On Apr 23, 2021, at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

+       expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+               : extsize % TOAST_MAX_CHUNK_SIZE;

What's different?

for one thing, if a sequence of chunks happens to fit perfectly, the final chunk will have size TOAST_MAX_CHUNK_SIZE, but you're expecting no larger than one less than that, given how modulo arithmetic works.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#147Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#146)
Re: pg_amcheck contrib application

On Fri, Apr 23, 2021 at 2:36 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

What's different?

for one thing, if a sequence of chunks happens to fit perfectly, the final chunk will have size TOAST_MAX_CHUNK_SIZE, but you're expecting no larger than one less than that, given how modulo arithmetic works.

Good point.

Perhaps something like this, closer to the way you had it?

expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
: extsize - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);

--
Robert Haas
EDB: http://www.enterprisedb.com

#148Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#147)
Re: pg_amcheck contrib application

On Apr 23, 2021, at 1:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Perhaps something like this, closer to the way you had it?

expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
: extsize - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);

It still suffers the same failures. I'll try to post something that accomplishes the changes to the reports that you are looking for.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#149Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#148)
1 attachment(s)
Re: pg_amcheck contrib application

On Apr 23, 2021, at 3:01 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I'll try to post something that accomplishes the changes to the reports that you are looking for.

The attached patch changes amcheck corruption reports as discussed upthread. This patch is submitted for the v14 development cycle as a bug fix, per your complaint that the committed code generates reports sufficiently confusing to a user as to constitute a bug.

All other code refactoring and additional checks discussed upthread are reserved for the v15 development cycle and are not included here.

The minimal patch (not attached) that does not rename any variables is 135 lines. Your patch was 159 lines. The patch (attached) which includes your variable renaming is 174 lines.

Attachments:

v23-0001-amcheck-adjusting-corruption-report-output.patchapplication/octet-stream; name=v23-0001-amcheck-adjusting-corruption-report-output.patch; x-unix-mode=0644Download
From ebab23b6e341af1d738683af587fb1906ded29e8 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 26 Apr 2021 09:00:10 -0700
Subject: [PATCH v23] amcheck: adjusting corruption report output

Changing how toast corruption reports refer to toast chunks and how
chunk_seq numbering is anticipated.  No longer referring to the Nth
chunk returned from a toast index scan as "chunk N" since this
language could be misinterpreted by a reasonable user to mean the
chunk with chunk_seq=N.  No longer expecting each chunk to have
chunk_seq=N, but rather expecting each chunk to have chunk_seq equal
to one greater than the chunk_seq from the prior chunk from the
toast index scan.

Per complaint from Robert Haas that the old wording could confuse
users, and that missing chunks would precipitate a cascade of
corruption reports for all subsequent chunks rather than just for
the first subsequent chunk.
---
 contrib/amcheck/verify_heapam.c | 57 ++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 29 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9f159eb3db..fe4a32dd0f 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -150,8 +150,8 @@ typedef struct HeapCheckContext
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+							  ToastedAttribute *ta, int32 *expected_chunk_seq,
+							  int32 last_chunk_seq);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -1159,23 +1159,27 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
  *
- * Returns whether the toast tuple passed the corruption checks.
+ * expected_chunk_seq: on entry, the chunk_seq expected for this chunk; on
+ * return, the chunk_seq expected for the next chunk
+ * last_chunk_seq: the final chunk_seq expected for this toasted attribute
  */
 static void
 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+				  ToastedAttribute *ta, int32 *expected_chunk_seq,
+				  int32 last_chunk_seq)
 {
-	int32		curchunk;
+	int32		chunk_seq;
 	Pointer		chunk;
 	bool		isnull;
 	int32		chunksize;
 	int32		expected_size;
+	int32		expected_seq = *expected_chunk_seq;
 
 	/*
 	 * Have a chunk, extract the sequence number and the data
 	 */
-	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
+	chunk_seq = DatumGetInt32(fastgetattr(toasttup, 2,
+										  ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
@@ -1183,13 +1187,15 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 										 ta->toast_pointer.va_valueid));
 		return;
 	}
+	*expected_chunk_seq = chunk_seq + 1;
+
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has null data",
-										 ta->toast_pointer.va_valueid, chunkno));
+										 ta->toast_pointer.va_valueid, chunk_seq));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1209,38 +1215,32 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has invalid varlena header %0x",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, header));
+										 chunk_seq, header));
 		return;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != chunkno)
-	{
+	if (chunk_seq != expected_seq)
 		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
+								psprintf("toast value %u index scan returned chunk %d when expecting chunk %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
-	}
-	if (chunkno > endchunk)
-	{
+										 chunk_seq, expected_seq));
+	if (chunk_seq > last_chunk_seq)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d follows last expected chunk %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
-		return;
-	}
+										 chunk_seq, last_chunk_seq));
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+	expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has size %u, but expected size %u",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+										 chunk_seq, chunksize, expected_size));
 }
 
 /*
@@ -1437,9 +1437,9 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	bool		found_toasttup;
 	HeapTuple	toasttup;
 	int32		chunkno;
-	int32		endchunk;
+	int32		last_chunk_seq;
 
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	last_chunk_seq = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1465,8 +1465,7 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
-		chunkno++;
+		check_toast_tuple(toasttup, ctx, ta, &chunkno, last_chunk_seq);
 	}
 	systable_endscan_ordered(toastscan);
 
@@ -1474,11 +1473,11 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
+	else if (chunkno != (last_chunk_seq + 1))
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
 										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
+										 (last_chunk_seq + 1), chunkno));
 }
 
 /*
-- 
2.21.1 (Apple Git-122.3)

#150Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#149)
1 attachment(s)
Re: pg_amcheck contrib application

On Mon, Apr 26, 2021 at 1:52 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

The attached patch changes amcheck corruption reports as discussed upthread. This patch is submitted for the v14 development cycle as a bug fix, per your complaint that the committed code generates reports sufficiently confusing to a user as to constitute a bug.

All other code refactoring and additional checks discussed upthread are reserved for the v15 development cycle and are not included here.

The minimal patch (not attached) that does not rename any variables is 135 lines. Your patch was 159 lines. The patch (attached) which includes your variable renaming is 174 lines.

Hi,

I have compared this against my version. I found the following differences:

1. This version passes last_chunk_seq rather than extsize to
check_toast_tuple(). But this results in having to call
VARATT_EXTERNAL_GET_EXTSIZE() inside that function. I thought it was
nicer to do that in the caller, so that we don't do it twice.

2. You fixed some out-of-date comments.

3. You move the test for an unexpected chunk sequence further down in
the function. I don't see the point; I had put it by the related null
check, and still think that's better. You also deleted my comment /*
Either the TOAST index is corrupt, or we don't have all chunks. */
which I would have preferred to keep.

4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
because we cannot compute a sensible expected size in that case. I
think your code will subtract a larger value from a smaller one and,
this being unsigned arithmetic, say that the expected chunk size is
something gigantic. Returning and not issuing that complaint at all
seems better.

5. You fixed the incorrect formula I had introduced for the expected
size of the last chunk.

6. You changed the variable name in check_toasted_attribute() from
expected_chunkno to chunkno, and initialized it later in the function
instead of at declaration time. I don't find this to be an
improvement; including the word "expected" seems to me to be
substantially clearer. But I think I should have gone with
expected_chunk_seq for better consistency.

7. You restored the message "toast value %u was expected to end at
chunk %d, but ended at chunk %d" which my version deleted. I deleted
that message because I thought it was redundant, but I guess it's not:
there's nothing else to complain if the sequence of chunks ends early.
I think we should change the test from != to < though, because if it's

, then we must have already complained about unexpected chunks. Also,

I think the message is actually wrong, because even though you renamed
the variable, it still ends up being the expected next chunkno rather
than the last chunkno we actually saw.

PFA my counter-proposal based on the above analysis.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

simply-remove-chunkno-concept-v2.patchapplication/octet-stream; name=simply-remove-chunkno-concept-v2.patchDownload
From 9ecd8c33a5b561aba2187ad84d8fd8d7c507b6fb Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 30 Apr 2021 12:36:36 -0400
Subject: [PATCH v2] amcheck: Improve some confusing reports about TOAST
 problems.

Don't phrase reports in terms of the number of tuples thus-far
returned by the index scan, but rather in terms of the chunk_seq
values found inside the tuples.

Patch by me, reviewed by Mark Dilger.
---
 contrib/amcheck/verify_heapam.c | 75 ++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9f159eb3db..39aec1a1f7 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -150,8 +150,8 @@ typedef struct HeapCheckContext
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+							  ToastedAttribute *ta, int32 *expected_chunk_seq,
+							  uint32 extsize);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -1159,23 +1159,25 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
  *
- * Returns whether the toast tuple passed the corruption checks.
+ * On entry, *expected_chunk_seq should be the chunk_seq value that we expect
+ * to find in toasttup. On exit, it will be updated to the value the next call
+ * to this function should expect to see.
  */
 static void
 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+				  ToastedAttribute *ta, int32 *expected_chunk_seq,
+				  uint32 extsize)
 {
-	int32		curchunk;
+	int32		chunk_seq;
+	int32		last_chunk_seq = (extsize + 1) / TOAST_MAX_CHUNK_SIZE;
 	Pointer		chunk;
 	bool		isnull;
 	int32		chunksize;
 	int32		expected_size;
 
-	/*
-	 * Have a chunk, extract the sequence number and the data
-	 */
-	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
+	/* Sanity-check the sequence number. */
+	chunk_seq = DatumGetInt32(fastgetattr(toasttup, 2,
+										  ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
@@ -1183,13 +1185,25 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 										 ta->toast_pointer.va_valueid));
 		return;
 	}
+	if (chunk_seq != *expected_chunk_seq)
+	{
+		/* Either the TOAST index is corrupt, or we don't have all chunks. */
+		report_toast_corruption(ctx, ta,
+								psprintf("toast value %u index scan returned chunk %d when expecting chunk %d",
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq, *expected_chunk_seq));
+	}
+	*expected_chunk_seq = chunk_seq + 1;
+
+	/* Sanity-check the chunk data. */
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has null data",
-										 ta->toast_pointer.va_valueid, chunkno));
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1209,40 +1223,31 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has invalid varlena header %0x",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, header));
+										 chunk_seq, header));
 		return;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != chunkno)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
-	}
-	if (chunkno > endchunk)
+	if (chunk_seq > last_chunk_seq)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d follows last expected chunk %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
+										 chunk_seq, last_chunk_seq));
 		return;
 	}
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+	expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+		: extsize - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has size %u, but expected size %u",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+										 chunk_seq, chunksize, expected_size));
 }
-
 /*
  * Check the current attribute as tracked in ctx, recording any corruption
  * found in ctx->tupstore.
@@ -1436,10 +1441,12 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	SysScanDesc toastscan;
 	bool		found_toasttup;
 	HeapTuple	toasttup;
-	int32		chunkno;
-	int32		endchunk;
+	uint32		extsize;
+	int32		expected_chunk_seq = 0;
+	int32		last_chunk_seq;
 
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	extsize = VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer);
+	last_chunk_seq = (extsize + 1) / TOAST_MAX_CHUNK_SIZE;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1458,15 +1465,13 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 										   ctx->valid_toast_index,
 										   &SnapshotToast, 1,
 										   &toastkey);
-	chunkno = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
-		chunkno++;
+		check_toast_tuple(toasttup, ctx, ta, &expected_chunk_seq, extsize);
 	}
 	systable_endscan_ordered(toastscan);
 
@@ -1474,11 +1479,11 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
+	else if (expected_chunk_seq < last_chunk_seq)
 		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
+								psprintf("toast value %u index scan ended early while expecting chunk %d of %d",
 										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
+										 expected_chunk_seq, last_chunk_seq));
 }
 
 /*
-- 
2.24.3 (Apple Git-128)

#151Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#150)
Re: pg_amcheck contrib application

On Apr 30, 2021, at 9:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Apr 26, 2021 at 1:52 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

The attached patch changes amcheck corruption reports as discussed upthread. This patch is submitted for the v14 development cycle as a bug fix, per your complaint that the committed code generates reports sufficiently confusing to a user as to constitute a bug.

All other code refactoring and additional checks discussed upthread are reserved for the v15 development cycle and are not included here.

The minimal patch (not attached) that does not rename any variables is 135 lines. Your patch was 159 lines. The patch (attached) which includes your variable renaming is 174 lines.

Hi,

I have compared this against my version. I found the following differences:

Just to be clear, I did not use your patch v1 as the starting point. I took the code as committed to master as the starting point, used your corruption report verbiage changes and at least some of your variable naming choices, but did not use the rest, in large part because it didn't work. It caused corruption messages to be reported against tables that have no corruption. For that matter, your v2 patch doesn't work either, and in the same way. To wit:

heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, attribute 7:
toast value 13461 chunk 0 has size 1995, but expected size 1996

I think there is something wrong with the way you are trying to calculate and use extsize, because I'm not corrupting pg_catalog.pg_rewrite. You can get these same results by applying your patch to master, building, and running 'make check' from src/bin/pg_amcheck/

1. This version passes last_chunk_seq rather than extsize to
check_toast_tuple(). But this results in having to call
VARATT_EXTERNAL_GET_EXTSIZE() inside that function. I thought it was
nicer to do that in the caller, so that we don't do it twice.

I don't see that VARATT_EXTERNAL_GET_EXTSIZE() is worth too much concern, given that is just a struct access and a bit mask. You are avoiding calculating that twice, but at the expense of calculating last_chunk_seq twice, which involves division. I don't think the division can be optimized as a mere bit shift, since TOAST_MAX_CHUNK_SIZE is not in general a power of two. (For example, on my laptop it is 1996.)

I don't say this to nitpick at the performance one way vs. the other. I doubt it makes any real difference. I'm just confused why you want to change this particular thing right now, given that it is not a bug.

2. You fixed some out-of-date comments.

Yes, because they were wrong. That's on me. I failed to update them in a prior patch.

3. You move the test for an unexpected chunk sequence further down in
the function. I don't see the point;

Relative to your patch, perhaps. Relative to master, no tests have been moved.

I had put it by the related null
check, and still think that's better. You also deleted my comment /*
Either the TOAST index is corrupt, or we don't have all chunks. */
which I would have preferred to keep.

That's fine. I didn't mean to remove it. I was just taking a minimalist approach to constructing the patch.

4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
because we cannot compute a sensible expected size in that case. I
think your code will subtract a larger value from a smaller one and,
this being unsigned arithmetic, say that the expected chunk size is
something gigantic.

Your conclusion is probably right, but I think your analysis is based on a misreading of what "last_chunk_seq" means. It's not the last one seen, but the last one expected. (Should we rename the variable to avoid confusion?) It won't compute a gigantic size. Rather, it will expect *every* chunk with chunk_seq >= last_chunk_seq to have whatever size is appropriate for the last chunk.

Returning and not issuing that complaint at all
seems better.

That might be best. I had been resisting that because I don't want the extraneous chunks to be reported without chunk size information. When debugging corrupted toast, it may be interesting to know the size of the extraneous chunks. If there are 1000 extra chunks, somebody might want to see the sizes of them.

5. You fixed the incorrect formula I had introduced for the expected
size of the last chunk.

Not really. I just didn't introduce any change in that area.

6. You changed the variable name in check_toasted_attribute() from
expected_chunkno to chunkno, and initialized it later in the function
instead of at declaration time. I don't find this to be an
improvement;

I think I just left the variable name and its initialization unchanged.

including the word "expected" seems to me to be
substantially clearer. But I think I should have gone with
expected_chunk_seq for better consistency.

I agree that is a better name.

7. You restored the message "toast value %u was expected to end at
chunk %d, but ended at chunk %d" which my version deleted. I deleted
that message because I thought it was redundant, but I guess it's not:
there's nothing else to complain if the sequence of chunks ends early.
I think we should change the test from != to < though, because if it's

, then we must have already complained about unexpected chunks.

We can do it that way if you like. I considered that and had trouble deciding if that made things less clear to users who might be less familiar with the structure of toasted attributes. If some of the attributes have that message and others don't, they might conclude that only some of the attributes ended at the wrong chunk and fail to make the inference that to you or me is obvious.

Also,

I think the message is actually wrong, because even though you renamed
the variable, it still ends up being the expected next chunkno rather
than the last chunkno we actually saw.

If we have seen any chunks, the variable is holding the expected next chunk seq, which is one greater than the last chunk seq we saw.

If we expect chunks 0..3 and see chunk 0 but not chunk 1, it will complain ..."expected to end at chunk 4, but ended at chunk 1". This is clearly by design and not merely a bug, though I tend to agree with you that this is a strange wording choice. I can't remember exactly when and how we decided to word the message this way, but it has annoyed me for a while, and I assumed it was something you suggested a while back, because I don't recall doing it. Either way, since you seem to also be bothered by this, I agree we should change it.

PFA my counter-proposal based on the above analysis.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#152Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#151)
1 attachment(s)
Re: pg_amcheck contrib application

On Fri, Apr 30, 2021 at 2:31 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Just to be clear, I did not use your patch v1 as the starting point.

I thought that might be the case, but I was trying to understand what
you didn't like about my version, and comparing them seemed like a way
to figure that out.

I took the code as committed to master as the starting point, used your corruption report verbiage changes and at least some of your variable naming choices, but did not use the rest, in large part because it didn't work. It caused corruption messages to be reported against tables that have no corruption. For that matter, your v2 patch doesn't work either, and in the same way. To wit:

heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, attribute 7:
toast value 13461 chunk 0 has size 1995, but expected size 1996

I think there is something wrong with the way you are trying to calculate and use extsize, because I'm not corrupting pg_catalog.pg_rewrite. You can get these same results by applying your patch to master, building, and running 'make check' from src/bin/pg_amcheck/

Argh, OK, I didn't realize. Should be fixed in this version.

4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
because we cannot compute a sensible expected size in that case. I
think your code will subtract a larger value from a smaller one and,
this being unsigned arithmetic, say that the expected chunk size is
something gigantic.

Your conclusion is probably right, but I think your analysis is based on a misreading of what "last_chunk_seq" means. It's not the last one seen, but the last one expected. (Should we rename the variable to avoid confusion?) It won't compute a gigantic size. Rather, it will expect *every* chunk with chunk_seq >= last_chunk_seq to have whatever size is appropriate for the last chunk.

I realize it's the last one expected. That's the point: we don't have
any expectation for the sizes of chunks higher than the last one we
expected to see. If the value is 2000 bytes and the chunk size is 1996
bytes, we expect chunk 0 to be 1996 bytes and chunk 1 to be 4 bytes.
If not, we can complain. But it makes no sense to complain about chunk
2 being of a size we don't expect. We don't expect it to exist in the
first place, so we have no notion of what size it ought to be.

If we have seen any chunks, the variable is holding the expected next chunk seq, which is one greater than the last chunk seq we saw.

If we expect chunks 0..3 and see chunk 0 but not chunk 1, it will complain ..."expected to end at chunk 4, but ended at chunk 1". This is clearly by design and not merely a bug, though I tend to agree with you that this is a strange wording choice. I can't remember exactly when and how we decided to word the message this way, but it has annoyed me for a while, and I assumed it was something you suggested a while back, because I don't recall doing it. Either way, since you seem to also be bothered by this, I agree we should change it.

Can you review this version?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

simply-remove-chunkno-concept-v3.patchapplication/octet-stream; name=simply-remove-chunkno-concept-v3.patchDownload
From 3d63cdc88f2780eb00b7cbc62fcaa9dac29ff3ce Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 30 Apr 2021 14:50:35 -0400
Subject: [PATCH v3] amcheck: Improve some confusing reports about TOAST
 problems.

Don't phrase reports in terms of the number of tuples thus-far
returned by the index scan, but rather in terms of the chunk_seq
values found inside the tuples.

Patch by me, reviewed by Mark Dilger.
---
 contrib/amcheck/verify_heapam.c | 75 ++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9f159eb3db..a47135e5b1 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -150,8 +150,8 @@ typedef struct HeapCheckContext
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+							  ToastedAttribute *ta, int32 *expected_chunk_seq,
+							  uint32 extsize);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -1159,23 +1159,25 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
  *
- * Returns whether the toast tuple passed the corruption checks.
+ * On entry, *expected_chunk_seq should be the chunk_seq value that we expect
+ * to find in toasttup. On exit, it will be updated to the value the next call
+ * to this function should expect to see.
  */
 static void
 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+				  ToastedAttribute *ta, int32 *expected_chunk_seq,
+				  uint32 extsize)
 {
-	int32		curchunk;
+	int32		chunk_seq;
+	int32		last_chunk_seq = (extsize - 1) / TOAST_MAX_CHUNK_SIZE;
 	Pointer		chunk;
 	bool		isnull;
 	int32		chunksize;
 	int32		expected_size;
 
-	/*
-	 * Have a chunk, extract the sequence number and the data
-	 */
-	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
+	/* Sanity-check the sequence number. */
+	chunk_seq = DatumGetInt32(fastgetattr(toasttup, 2,
+										  ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
@@ -1183,13 +1185,25 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 										 ta->toast_pointer.va_valueid));
 		return;
 	}
+	if (chunk_seq != *expected_chunk_seq)
+	{
+		/* Either the TOAST index is corrupt, or we don't have all chunks. */
+		report_toast_corruption(ctx, ta,
+								psprintf("toast value %u index scan returned chunk %d when expecting chunk %d",
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq, *expected_chunk_seq));
+	}
+	*expected_chunk_seq = chunk_seq + 1;
+
+	/* Sanity-check the chunk data. */
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has null data",
-										 ta->toast_pointer.va_valueid, chunkno));
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1209,40 +1223,31 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has invalid varlena header %0x",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, header));
+										 chunk_seq, header));
 		return;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != chunkno)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
-	}
-	if (chunkno > endchunk)
+	if (chunk_seq > last_chunk_seq)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d follows last expected chunk %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
+										 chunk_seq, last_chunk_seq));
 		return;
 	}
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+	expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+		: extsize - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has size %u, but expected size %u",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+										 chunk_seq, chunksize, expected_size));
 }
-
 /*
  * Check the current attribute as tracked in ctx, recording any corruption
  * found in ctx->tupstore.
@@ -1436,10 +1441,12 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	SysScanDesc toastscan;
 	bool		found_toasttup;
 	HeapTuple	toasttup;
-	int32		chunkno;
-	int32		endchunk;
+	uint32		extsize;
+	int32		expected_chunk_seq = 0;
+	int32		last_chunk_seq;
 
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	extsize = VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer);
+	last_chunk_seq = (extsize - 1) / TOAST_MAX_CHUNK_SIZE;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1458,15 +1465,13 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 										   ctx->valid_toast_index,
 										   &SnapshotToast, 1,
 										   &toastkey);
-	chunkno = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
-		chunkno++;
+		check_toast_tuple(toasttup, ctx, ta, &expected_chunk_seq, extsize);
 	}
 	systable_endscan_ordered(toastscan);
 
@@ -1474,11 +1479,11 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
+	else if (expected_chunk_seq < last_chunk_seq)
 		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
+								psprintf("toast value %u index scan ended early while expecting chunk %d of %d",
 										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
+										 expected_chunk_seq, last_chunk_seq));
 }
 
 /*
-- 
2.24.3 (Apple Git-128)

#153Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#152)
2 attachment(s)
Re: pg_amcheck contrib application

On Apr 30, 2021, at 11:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 30, 2021 at 2:31 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

Just to be clear, I did not use your patch v1 as the starting point.

I thought that might be the case, but I was trying to understand what
you didn't like about my version, and comparing them seemed like a way
to figure that out.

I took the code as committed to master as the starting point, used your corruption report verbiage changes and at least some of your variable naming choices, but did not use the rest, in large part because it didn't work. It caused corruption messages to be reported against tables that have no corruption. For that matter, your v2 patch doesn't work either, and in the same way. To wit:

heap table "postgres"."pg_catalog"."pg_rewrite", block 6, offset 4, attribute 7:
toast value 13461 chunk 0 has size 1995, but expected size 1996

I think there is something wrong with the way you are trying to calculate and use extsize, because I'm not corrupting pg_catalog.pg_rewrite. You can get these same results by applying your patch to master, building, and running 'make check' from src/bin/pg_amcheck/

Argh, OK, I didn't realize. Should be fixed in this version.

4. You don't return if chunk_seq > last_chunk_seq. That seems wrong,
because we cannot compute a sensible expected size in that case. I
think your code will subtract a larger value from a smaller one and,
this being unsigned arithmetic, say that the expected chunk size is
something gigantic.

Your conclusion is probably right, but I think your analysis is based on a misreading of what "last_chunk_seq" means. It's not the last one seen, but the last one expected. (Should we rename the variable to avoid confusion?) It won't compute a gigantic size. Rather, it will expect *every* chunk with chunk_seq >= last_chunk_seq to have whatever size is appropriate for the last chunk.

I realize it's the last one expected. That's the point: we don't have
any expectation for the sizes of chunks higher than the last one we
expected to see. If the value is 2000 bytes and the chunk size is 1996
bytes, we expect chunk 0 to be 1996 bytes and chunk 1 to be 4 bytes.
If not, we can complain. But it makes no sense to complain about chunk
2 being of a size we don't expect. We don't expect it to exist in the
first place, so we have no notion of what size it ought to be.

If we have seen any chunks, the variable is holding the expected next chunk seq, which is one greater than the last chunk seq we saw.

If we expect chunks 0..3 and see chunk 0 but not chunk 1, it will complain ..."expected to end at chunk 4, but ended at chunk 1". This is clearly by design and not merely a bug, though I tend to agree with you that this is a strange wording choice. I can't remember exactly when and how we decided to word the message this way, but it has annoyed me for a while, and I assumed it was something you suggested a while back, because I don't recall doing it. Either way, since you seem to also be bothered by this, I agree we should change it.

Can you review this version?

--
Robert Haas
EDB: http://www.enterprisedb.com
<simply-remove-chunkno-concept-v3.patch>

As requested off-list, here are NOT FOR COMMIT, WIP patches for testing only.

The first patch allows toast tables to be updated and adds regression tests of corrupted toasted attributes. I never quite got deletes from toast tables to work, and there are probably other gotchas still lurking even with inserts and updates, but it limps along well enough for testing pg_amcheck.

The second patch updates the expected output of pg_amcheck to match the verbiage that you suggested upthread.

Attachments:

v1-0002-Modifying-toast-corruption-test-expected-output.patch.WIPapplication/octet-stream; name=v1-0002-Modifying-toast-corruption-test-expected-output.patch.WIP; x-unix-mode=0644Download
From 66a52460c1aea17fe83064ae9110a36b260a3cb3 Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 26 Apr 2021 08:52:38 -0700
Subject: [PATCH v1 2/2] Modifying toast corruption test expected output

Making the test expected output match the output generated by
Robert's patch.
---
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 50 ++++++++++++++---------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index 4d66eb64fe..a1772f6b0a 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -304,7 +304,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 38;
+plan tests => 49;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -526,12 +526,13 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push (@corruptions, "UPDATE $toastname SET chunk_seq = chunk_seq + 1000 WHERE chunk_id = $value_id_to_corrupt");
  		$header = header(0, $offnum, 2);
  		push @expected,
-			qr/${header}toast value 16459 chunk 0 has sequence number 1000, but expected sequence number 0/,
-			qr/${header}toast value 16459 chunk 1 has sequence number 1001, but expected sequence number 1/,
-			qr/${header}toast value 16459 chunk 2 has sequence number 1002, but expected sequence number 2/,
-			qr/${header}toast value 16459 chunk 3 has sequence number 1003, but expected sequence number 3/,
-			qr/${header}toast value 16459 chunk 4 has sequence number 1004, but expected sequence number 4/,
-			qr/${header}toast value 16459 chunk 5 has sequence number 1005, but expected sequence number 5/;
+			qr/${header}toast value 16459 index scan returned chunk 1000 when expecting chunk 0/,
+			qr/${header}toast value 16459 chunk 1000 follows last expected chunk 5/,
+			qr/${header}toast value 16459 chunk 1001 follows last expected chunk 5/,
+			qr/${header}toast value 16459 chunk 1002 follows last expected chunk 5/,
+			qr/${header}toast value 16459 chunk 1003 follows last expected chunk 5/,
+			qr/${header}toast value 16459 chunk 1004 follows last expected chunk 5/,
+			qr/${header}toast value 16459 chunk 1005 follows last expected chunk 5/;
  	}
 	elsif ($offnum == 17)
 	{
@@ -539,12 +540,16 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push (@corruptions, "UPDATE $toastname SET chunk_seq = chunk_seq * 1000 WHERE chunk_id = $value_id_to_corrupt");
  		$header = header(0, $offnum, 2);
  		push @expected,
-			qr/${header}toast value $value_id_to_corrupt chunk 1 has sequence number 1000, but expected sequence number 1/,
-			qr/${header}toast value $value_id_to_corrupt chunk 2 has sequence number 2000, but expected sequence number 2/,
-			qr/${header}toast value $value_id_to_corrupt chunk 3 has sequence number 3000, but expected sequence number 3/,
-			qr/${header}toast value $value_id_to_corrupt chunk 4 has sequence number 4000, but expected sequence number 4/,
-			qr/${header}toast value $value_id_to_corrupt chunk 5 has sequence number 5000, but expected sequence number 5/;
-
+			qr/${header}toast value 16460 index scan returned chunk 1000 when expecting chunk 1/,
+			qr/${header}toast value 16460 chunk 1000 follows last expected chunk 5/,
+			qr/${header}toast value 16460 index scan returned chunk 2000 when expecting chunk 1001/,
+			qr/${header}toast value 16460 chunk 2000 follows last expected chunk 5/,
+			qr/${header}toast value 16460 index scan returned chunk 3000 when expecting chunk 2001/,
+			qr/${header}toast value 16460 chunk 3000 follows last expected chunk 5/,
+			qr/${header}toast value 16460 index scan returned chunk 4000 when expecting chunk 3001/,
+			qr/${header}toast value 16460 chunk 4000 follows last expected chunk 5/,
+			qr/${header}toast value 16460 index scan returned chunk 5000 when expecting chunk 4001/,
+			qr/${header}toast value 16460 chunk 5000 follows last expected chunk 5/;
 	}
 	elsif ($offnum == 18)
 	{
@@ -567,13 +572,18 @@ INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
 ));
  		$header = header(0, $offnum, 2);
  		push @expected,
-			qr/${header}toast value $value_id_to_corrupt chunk 6 has sequence number 1000, but expected sequence number 6/,
-			qr/${header}toast value $value_id_to_corrupt chunk 7 has sequence number 1010, but expected sequence number 7/,
-			qr/${header}toast value $value_id_to_corrupt chunk 8 has sequence number 1020, but expected sequence number 8/,
-			qr/${header}toast value $value_id_to_corrupt chunk 9 has sequence number 1030, but expected sequence number 9/,
-			qr/${header}toast value $value_id_to_corrupt chunk 10 has sequence number 1040, but expected sequence number 10/,
-			qr/${header}toast value $value_id_to_corrupt chunk 11 has sequence number 1050, but expected sequence number 11/,
-			qr/${header}toast value $value_id_to_corrupt was expected to end at chunk 6, but ended at chunk 12/;
+			qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1000 when expecting chunk 6/,
+			qr/${header}toast value $value_id_to_corrupt chunk 1000 follows last expected chunk 5/,
+			qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1010 when expecting chunk 1001/,
+			qr/${header}toast value $value_id_to_corrupt chunk 1010 follows last expected chunk 5/,
+			qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1020 when expecting chunk 1011/,
+			qr/${header}toast value $value_id_to_corrupt chunk 1020 follows last expected chunk 5/,
+			qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1030 when expecting chunk 1021/,
+			qr/${header}toast value $value_id_to_corrupt chunk 1030 follows last expected chunk 5/,
+			qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1040 when expecting chunk 1031/,
+			qr/${header}toast value $value_id_to_corrupt chunk 1040 follows last expected chunk 5/,
+			qr/${header}toast value $value_id_to_corrupt index scan returned chunk 1050 when expecting chunk 1041/,
+			qr/${header}toast value $value_id_to_corrupt chunk 1050 follows last expected chunk 5/;
 	}
 	write_tuple($file, $offset, $tup);
 }
-- 
2.21.1 (Apple Git-122.3)

v1-0001-Adding-modify-toast-and-test-pg_amcheck.patch.WIPapplication/octet-stream; name=v1-0001-Adding-modify-toast-and-test-pg_amcheck.patch.WIP; x-unix-mode=0644Download
From 246e5ea1ec8dad8607ff154cf962fa73ea5356ae Mon Sep 17 00:00:00 2001
From: Mark Dilger <mark.dilger@enterprisedb.com>
Date: Mon, 26 Apr 2021 08:12:05 -0700
Subject: [PATCH v1 1/2] Adding modify toast and test pg_amcheck

This commit allows toast data to be modified, and adds tests of
pg_amcheck under corrupted toast conditions.
---
 src/backend/executor/execMain.c           |  7 +--
 src/backend/executor/nodeModifyTable.c    |  6 +-
 src/backend/optimizer/util/appendinfo.c   |  4 +-
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 71 ++++++++++++++++++++++-
 4 files changed, 76 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index df3d7f9a8b..6cda1bfdb6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -999,6 +999,7 @@ CheckValidResultRel(ResultRelInfo *resultRelInfo, CmdType operation)
 	{
 		case RELKIND_RELATION:
 		case RELKIND_PARTITIONED_TABLE:
+		case RELKIND_TOASTVALUE:
 			CheckCmdReplicaIdentity(resultRel, operation);
 			break;
 		case RELKIND_SEQUENCE:
@@ -1007,12 +1008,6 @@ CheckValidResultRel(ResultRelInfo *resultRelInfo, CmdType operation)
 					 errmsg("cannot change sequence \"%s\"",
 							RelationGetRelationName(resultRel))));
 			break;
-		case RELKIND_TOASTVALUE:
-			ereport(ERROR,
-					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("cannot change TOAST relation \"%s\"",
-							RelationGetRelationName(resultRel))));
-			break;
 		case RELKIND_VIEW:
 
 			/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c5a2a9a054..de84364548 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -2466,7 +2466,8 @@ ExecModifyTable(PlanState *pstate)
 			relkind = resultRelInfo->ri_RelationDesc->rd_rel->relkind;
 			if (relkind == RELKIND_RELATION ||
 				relkind == RELKIND_MATVIEW ||
-				relkind == RELKIND_PARTITIONED_TABLE)
+				relkind == RELKIND_PARTITIONED_TABLE ||
+				relkind == RELKIND_TOASTVALUE)
 			{
 				/* ri_RowIdAttNo refers to a ctid attribute */
 				Assert(AttributeNumberIsValid(resultRelInfo->ri_RowIdAttNo));
@@ -2825,7 +2826,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 			relkind = resultRelInfo->ri_RelationDesc->rd_rel->relkind;
 			if (relkind == RELKIND_RELATION ||
 				relkind == RELKIND_MATVIEW ||
-				relkind == RELKIND_PARTITIONED_TABLE)
+				relkind == RELKIND_PARTITIONED_TABLE ||
+				relkind == RELKIND_TOASTVALUE)
 			{
 				resultRelInfo->ri_RowIdAttNo =
 					ExecFindJunkAttributeInTlist(subplan->targetlist, "ctid");
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..43c81d9ed7 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -866,7 +866,9 @@ add_row_identity_columns(PlannerInfo *root, Index rtindex,
 
 	if (relkind == RELKIND_RELATION ||
 		relkind == RELKIND_MATVIEW ||
-		relkind == RELKIND_PARTITIONED_TABLE)
+		relkind == RELKIND_PARTITIONED_TABLE ||
+		relkind == RELKIND_TOASTVALUE
+		)
 	{
 		/*
 		 * Emit CTID so that executor can find the row to update or delete.
diff --git a/src/bin/pg_amcheck/t/004_verify_heapam.pl b/src/bin/pg_amcheck/t/004_verify_heapam.pl
index b842f7bc6d..4d66eb64fe 100644
--- a/src/bin/pg_amcheck/t/004_verify_heapam.pl
+++ b/src/bin/pg_amcheck/t/004_verify_heapam.pl
@@ -225,7 +225,7 @@ my $rel = $node->safe_psql('postgres', qq(SELECT pg_relation_filepath('public.te
 my $relpath = "$pgdata/$rel";
 
 # Insert data and freeze public.test
-use constant ROWCOUNT => 16;
+use constant ROWCOUNT => 100;
 $node->safe_psql('postgres', qq(
 	INSERT INTO public.test (a, b, c)
 		VALUES (
@@ -241,6 +241,13 @@ my $relfrozenxid = $node->safe_psql('postgres',
 my $datfrozenxid = $node->safe_psql('postgres',
 	q(select datfrozenxid from pg_database where datname = 'postgres'));
 
+# Find our toast relation name
+my $toastname = $node->safe_psql('postgres', qq(
+	SELECT c.reltoastrelid::regclass
+		FROM pg_catalog.pg_class c
+		WHERE c.oid = 'public.test'::regclass
+		));
+
 # Sanity check that our 'test' table has a relfrozenxid newer than the
 # datfrozenxid for the database, and that the datfrozenxid is greater than the
 # first normal xid.  We rely on these invariants in some of our tests.
@@ -297,7 +304,7 @@ close($file)
 $node->start;
 
 # Ok, Xids and page layout look ok.  We can run corruption tests.
-plan tests => 19;
+plan tests => 38;
 
 # Check that pg_amcheck runs against the uncorrupted table without error.
 $node->command_ok(['pg_amcheck', '-p', $port, 'postgres'],
@@ -339,11 +346,12 @@ sub header
 # performing any remaining checks, so we can't exercise the system properly if
 # we focus all our corruption on a single tuple.
 #
-my @expected;
+my (@expected, @corruptions);
 open($file, '+<', $relpath)
 	or BAIL_OUT("open failed: $!");
 binmode $file;
 
+my $value_id_to_corrupt;
 for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 {
 	my $offnum = $tupidx + 1;  # offnum is 1-based, not zero-based
@@ -512,12 +520,69 @@ for (my $tupidx = 0; $tupidx < ROWCOUNT; $tupidx++)
 		push @expected,
 			qr/${header}multitransaction ID 4000000000 precedes relation minimum multitransaction ID threshold 1/;
 	}
+	elsif ($offnum == 16)
+	{
+		$value_id_to_corrupt = $tup->{c_va_valueid};
+		push (@corruptions, "UPDATE $toastname SET chunk_seq = chunk_seq + 1000 WHERE chunk_id = $value_id_to_corrupt");
+ 		$header = header(0, $offnum, 2);
+ 		push @expected,
+			qr/${header}toast value 16459 chunk 0 has sequence number 1000, but expected sequence number 0/,
+			qr/${header}toast value 16459 chunk 1 has sequence number 1001, but expected sequence number 1/,
+			qr/${header}toast value 16459 chunk 2 has sequence number 1002, but expected sequence number 2/,
+			qr/${header}toast value 16459 chunk 3 has sequence number 1003, but expected sequence number 3/,
+			qr/${header}toast value 16459 chunk 4 has sequence number 1004, but expected sequence number 4/,
+			qr/${header}toast value 16459 chunk 5 has sequence number 1005, but expected sequence number 5/;
+ 	}
+	elsif ($offnum == 17)
+	{
+		$value_id_to_corrupt = $tup->{c_va_valueid};
+		push (@corruptions, "UPDATE $toastname SET chunk_seq = chunk_seq * 1000 WHERE chunk_id = $value_id_to_corrupt");
+ 		$header = header(0, $offnum, 2);
+ 		push @expected,
+			qr/${header}toast value $value_id_to_corrupt chunk 1 has sequence number 1000, but expected sequence number 1/,
+			qr/${header}toast value $value_id_to_corrupt chunk 2 has sequence number 2000, but expected sequence number 2/,
+			qr/${header}toast value $value_id_to_corrupt chunk 3 has sequence number 3000, but expected sequence number 3/,
+			qr/${header}toast value $value_id_to_corrupt chunk 4 has sequence number 4000, but expected sequence number 4/,
+			qr/${header}toast value $value_id_to_corrupt chunk 5 has sequence number 5000, but expected sequence number 5/;
+
+	}
+	elsif ($offnum == 18)
+	{
+		$value_id_to_corrupt = $tup->{c_va_valueid};
+		push (@corruptions, "UPDATE $toastname SET chunk_id = (chunk_id::integer + 10000000)::oid WHERE chunk_id = $value_id_to_corrupt");
+ 		$header = header(0, $offnum, 2);
+ 		push @expected,
+			qr/${header}toast value $value_id_to_corrupt not found in toast table/;
+	}
+	elsif ($offnum == 19)
+	{
+		$value_id_to_corrupt = $tup->{c_va_valueid};
+		push (@corruptions, qq(
+INSERT INTO $toastname (chunk_id, chunk_seq, chunk_data)
+	(SELECT chunk_id,
+			10*chunk_seq + 1000,
+			chunk_data
+		FROM $toastname
+		WHERE chunk_id = $value_id_to_corrupt)
+));
+ 		$header = header(0, $offnum, 2);
+ 		push @expected,
+			qr/${header}toast value $value_id_to_corrupt chunk 6 has sequence number 1000, but expected sequence number 6/,
+			qr/${header}toast value $value_id_to_corrupt chunk 7 has sequence number 1010, but expected sequence number 7/,
+			qr/${header}toast value $value_id_to_corrupt chunk 8 has sequence number 1020, but expected sequence number 8/,
+			qr/${header}toast value $value_id_to_corrupt chunk 9 has sequence number 1030, but expected sequence number 9/,
+			qr/${header}toast value $value_id_to_corrupt chunk 10 has sequence number 1040, but expected sequence number 10/,
+			qr/${header}toast value $value_id_to_corrupt chunk 11 has sequence number 1050, but expected sequence number 11/,
+			qr/${header}toast value $value_id_to_corrupt was expected to end at chunk 6, but ended at chunk 12/;
+	}
 	write_tuple($file, $offset, $tup);
 }
 close($file)
 	or BAIL_OUT("close failed: $!");
 $node->start;
 
+$node->safe_psql('postgres', $_) for (@corruptions);
+
 # Run pg_amcheck against the corrupt table with epoch=0, comparing actual
 # corruption messages against the expected messages
 $node->command_checks_all(
-- 
2.21.1 (Apple Git-122.3)

#154Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#152)
Re: pg_amcheck contrib application

On Apr 30, 2021, at 11:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Can you review this version?

It looks mostly good to me. There is a off-by-one error introduced with:

-   else if (chunkno != (endchunk + 1))
+   else if (expected_chunk_seq < last_chunk_seq)

I think that needs to be

+ else if (expected_chunk_seq <= last_chunk_seq)

because otherwise it won't complain if the only missing chunk is the very last one.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#155Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#154)
1 attachment(s)
Re: pg_amcheck contrib application

On Fri, Apr 30, 2021 at 3:26 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

It looks mostly good to me. There is a off-by-one error introduced with:

-   else if (chunkno != (endchunk + 1))
+   else if (expected_chunk_seq < last_chunk_seq)

I think that needs to be

+ else if (expected_chunk_seq <= last_chunk_seq)

because otherwise it won't complain if the only missing chunk is the very last one.

OK, how about this version?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

simply-remove-chunkno-concept-v4.patchapplication/octet-stream; name=simply-remove-chunkno-concept-v4.patchDownload
From bfedd8f880d577c8395b2be8ceafbca7544c61a3 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 30 Apr 2021 15:28:46 -0400
Subject: [PATCH v4] amcheck: Improve some confusing reports about TOAST
 problems.

Don't phrase reports in terms of the number of tuples thus-far
returned by the index scan, but rather in terms of the chunk_seq
values found inside the tuples.

Patch by me, reviewed by Mark Dilger.
---
 contrib/amcheck/verify_heapam.c | 75 ++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 35 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 9f159eb3db..c4d0cf164a 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -150,8 +150,8 @@ typedef struct HeapCheckContext
 static void sanity_check_relation(Relation rel);
 static void check_tuple(HeapCheckContext *ctx);
 static void check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-							  ToastedAttribute *ta, int32 chunkno,
-							  int32 endchunk);
+							  ToastedAttribute *ta, int32 *expected_chunk_seq,
+							  uint32 extsize);
 
 static bool check_tuple_attribute(HeapCheckContext *ctx);
 static void check_toasted_attribute(HeapCheckContext *ctx,
@@ -1159,23 +1159,25 @@ check_tuple_visibility(HeapCheckContext *ctx)
  * each toast tuple being checked against where we are in the sequence, as well
  * as each toast tuple having its varlena structure sanity checked.
  *
- * Returns whether the toast tuple passed the corruption checks.
+ * On entry, *expected_chunk_seq should be the chunk_seq value that we expect
+ * to find in toasttup. On exit, it will be updated to the value the next call
+ * to this function should expect to see.
  */
 static void
 check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
-				  ToastedAttribute *ta, int32 chunkno, int32 endchunk)
+				  ToastedAttribute *ta, int32 *expected_chunk_seq,
+				  uint32 extsize)
 {
-	int32		curchunk;
+	int32		chunk_seq;
+	int32		last_chunk_seq = (extsize - 1) / TOAST_MAX_CHUNK_SIZE;
 	Pointer		chunk;
 	bool		isnull;
 	int32		chunksize;
 	int32		expected_size;
 
-	/*
-	 * Have a chunk, extract the sequence number and the data
-	 */
-	curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
-										 ctx->toast_rel->rd_att, &isnull));
+	/* Sanity-check the sequence number. */
+	chunk_seq = DatumGetInt32(fastgetattr(toasttup, 2,
+										  ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
@@ -1183,13 +1185,25 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 										 ta->toast_pointer.va_valueid));
 		return;
 	}
+	if (chunk_seq != *expected_chunk_seq)
+	{
+		/* Either the TOAST index is corrupt, or we don't have all chunks. */
+		report_toast_corruption(ctx, ta,
+								psprintf("toast value %u index scan returned chunk %d when expecting chunk %d",
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq, *expected_chunk_seq));
+	}
+	*expected_chunk_seq = chunk_seq + 1;
+
+	/* Sanity-check the chunk data. */
 	chunk = DatumGetPointer(fastgetattr(toasttup, 3,
 										ctx->toast_rel->rd_att, &isnull));
 	if (isnull)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has null data",
-										 ta->toast_pointer.va_valueid, chunkno));
+										 ta->toast_pointer.va_valueid,
+										 chunk_seq));
 		return;
 	}
 	if (!VARATT_IS_EXTENDED(chunk))
@@ -1209,40 +1223,31 @@ check_toast_tuple(HeapTuple toasttup, HeapCheckContext *ctx,
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has invalid varlena header %0x",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, header));
+										 chunk_seq, header));
 		return;
 	}
 
 	/*
 	 * Some checks on the data we've found
 	 */
-	if (curchunk != chunkno)
-	{
-		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u chunk %d has sequence number %d, but expected sequence number %d",
-										 ta->toast_pointer.va_valueid,
-										 chunkno, curchunk, chunkno));
-		return;
-	}
-	if (chunkno > endchunk)
+	if (chunk_seq > last_chunk_seq)
 	{
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d follows last expected chunk %d",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, endchunk));
+										 chunk_seq, last_chunk_seq));
 		return;
 	}
 
-	expected_size = curchunk < endchunk ? TOAST_MAX_CHUNK_SIZE
-		: VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - (endchunk * TOAST_MAX_CHUNK_SIZE);
+	expected_size = chunk_seq < last_chunk_seq ? TOAST_MAX_CHUNK_SIZE
+		: extsize - (last_chunk_seq * TOAST_MAX_CHUNK_SIZE);
 
 	if (chunksize != expected_size)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u chunk %d has size %u, but expected size %u",
 										 ta->toast_pointer.va_valueid,
-										 chunkno, chunksize, expected_size));
+										 chunk_seq, chunksize, expected_size));
 }
-
 /*
  * Check the current attribute as tracked in ctx, recording any corruption
  * found in ctx->tupstore.
@@ -1436,10 +1441,12 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 	SysScanDesc toastscan;
 	bool		found_toasttup;
 	HeapTuple	toasttup;
-	int32		chunkno;
-	int32		endchunk;
+	uint32		extsize;
+	int32		expected_chunk_seq = 0;
+	int32		last_chunk_seq;
 
-	endchunk = (VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer) - 1) / TOAST_MAX_CHUNK_SIZE;
+	extsize = VARATT_EXTERNAL_GET_EXTSIZE(ta->toast_pointer);
+	last_chunk_seq = (extsize - 1) / TOAST_MAX_CHUNK_SIZE;
 
 	/*
 	 * Setup a scan key to find chunks in toast table with matching va_valueid
@@ -1458,15 +1465,13 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 										   ctx->valid_toast_index,
 										   &SnapshotToast, 1,
 										   &toastkey);
-	chunkno = 0;
 	found_toasttup = false;
 	while ((toasttup =
 			systable_getnext_ordered(toastscan,
 									 ForwardScanDirection)) != NULL)
 	{
 		found_toasttup = true;
-		check_toast_tuple(toasttup, ctx, ta, chunkno, endchunk);
-		chunkno++;
+		check_toast_tuple(toasttup, ctx, ta, &expected_chunk_seq, extsize);
 	}
 	systable_endscan_ordered(toastscan);
 
@@ -1474,11 +1479,11 @@ check_toasted_attribute(HeapCheckContext *ctx, ToastedAttribute *ta)
 		report_toast_corruption(ctx, ta,
 								psprintf("toast value %u not found in toast table",
 										 ta->toast_pointer.va_valueid));
-	else if (chunkno != (endchunk + 1))
+	else if (expected_chunk_seq <= last_chunk_seq)
 		report_toast_corruption(ctx, ta,
-								psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
+								psprintf("toast value %u index scan ended early while expecting chunk %d of %d",
 										 ta->toast_pointer.va_valueid,
-										 (endchunk + 1), chunkno));
+										 expected_chunk_seq, last_chunk_seq));
 }
 
 /*
-- 
2.24.3 (Apple Git-128)

#156Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#155)
Re: pg_amcheck contrib application

On Apr 30, 2021, at 12:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, how about this version?

I think that's committable.

The only nitpick might be

-                               psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
+                               psprintf("toast value %u index scan ended early while expecting chunk %d of %d",

When reporting to users about positions within a zero-based indexing scheme, what does "while expecting chunk 3 of 4" mean? Is it talking about the last chunk from the set [0..3] which has cardinality 4, or does it mean the next-to-last chunk from [0..4] which ends with chunk 4, or what? The prior language isn't any more clear than what you have here, so I have no objection to committing this, but the prior language was probably as goofy as it was because it was trying to deal with this issue.

Thoughts?


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#157Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#156)
Re: pg_amcheck contrib application

On Fri, Apr 30, 2021 at 3:41 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

I think that's committable.

The only nitpick might be

-                               psprintf("toast value %u was expected to end at chunk %d, but ended at chunk %d",
+                               psprintf("toast value %u index scan ended early while expecting chunk %d of %d",

When reporting to users about positions within a zero-based indexing scheme, what does "while expecting chunk 3 of 4" mean? Is it talking about the last chunk from the set [0..3] which has cardinality 4, or does it mean the next-to-last chunk from [0..4] which ends with chunk 4, or what? The prior language isn't any more clear than what you have here, so I have no objection to committing this, but the prior language was probably as goofy as it was because it was trying to deal with this issue.

Hmm, I think that might need adjustment, actually. What I was trying
to do is compensate for the fact that what we now have is the next
chunk_seq value we expect, not the last one we saw, nor the total
number of chunks we've seen regardless of what chunk_seq they had. But
I thought it would be too confusing to just give the chunk number we
were expecting and not say anything about how many chunks we thought
there would be in total. So maybe what I should do is change it to
something like this:

toast value %u was expected to end at chunk %d, but ended while
expecting chunk %d

i.e. same as the currently-committed code, except for changing "ended
at" to "ended while expecting."

--
Robert Haas
EDB: http://www.enterprisedb.com

#158Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Robert Haas (#157)
Re: pg_amcheck contrib application

On Apr 30, 2021, at 12:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Hmm, I think that might need adjustment, actually. What I was trying
to do is compensate for the fact that what we now have is the next
chunk_seq value we expect, not the last one we saw, nor the total
number of chunks we've seen regardless of what chunk_seq they had. But
I thought it would be too confusing to just give the chunk number we
were expecting and not say anything about how many chunks we thought
there would be in total. So maybe what I should do is change it to
something like this:

toast value %u was expected to end at chunk %d, but ended while
expecting chunk %d

i.e. same as the currently-committed code, except for changing "ended
at" to "ended while expecting."

I find the grammar of this new formulation anomalous for hard to articulate reasons not quite the same as but akin to mismatched verb aspect.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#159Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Mark Dilger (#158)
Re: pg_amcheck contrib application

On Apr 30, 2021, at 1:04 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

toast value %u was expected to end at chunk %d, but ended while
expecting chunk %d

i.e. same as the currently-committed code, except for changing "ended
at" to "ended while expecting."

I find the grammar of this new formulation anomalous for hard to articulate reasons not quite the same as but akin to mismatched verb aspect.

After further reflection, no other verbiage seems any better. I'd say go ahead and commit it this way.


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#160Robert Haas
robertmhaas@gmail.com
In reply to: Mark Dilger (#159)
Re: pg_amcheck contrib application

On Fri, Apr 30, 2021 at 4:26 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

After further reflection, no other verbiage seems any better. I'd say go ahead and commit it this way.

OK. I'll plan to do that on Monday, barring objections.

--
Robert Haas
EDB: http://www.enterprisedb.com

#161Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#160)
Re: pg_amcheck contrib application

On Fri, Apr 30, 2021 at 5:07 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 30, 2021 at 4:26 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:

After further reflection, no other verbiage seems any better. I'd say go ahead and commit it this way.

OK. I'll plan to do that on Monday, barring objections.

Done now.

--
Robert Haas
EDB: http://www.enterprisedb.com