Online checksums verification in the backend

Started by Julien Rouhaudabout 6 years ago84 messages

rjuju123@gmail.com

about 6 years ago

1 attachment(s)

Hi,

This topic was discussed several times, with the most recent
discussions found at [1]/messages/by-id/1532606373.3422.5.camel@credativ.de and [2]/messages/by-id/20190326170820.6sylklg7eh6uhabd@alap3.anarazel.de. Based on those discussions, my
understanding is that the current approach in BASE_BACKUP has too many
drawbacks and we should instead do this check in the backend. I've
been working using such approach at VMware, and I'm submitting it here
to discuss the approach and rationales, and hopefully have such a
feature integrated.

First, this was originally developed as an extension. It means that
the check is performed using an SRF. That's maybe not the best
approach, as a transaction has be kept for the total processing time.
It can be leveraged by checking each relation independently, but
that's still not ideal. Maybe using some utility commands (as part of
VACUUM or a new CHECK command for instance) would be a better
approach.

This brings the second consideration: how to report the list corrupted
blocks to end users. As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally. One usage of this information could be block level
recovery. I'm Cc-ing Sawada-san, as I know he's working on this and
mentioned me that he had ideas on passing the list of corrupted blocks
using the stat collector.

Finally, the read and locking considerations. I tried to cover that
extensively in the comments, but here are some details on how I tried
to make the check safe while trying to keep the overhead as low as
possible. First thing is that this is only doing buffered reads,
without any attempt to discard OS cache. Therefore, any discrepancy
between the OS cache and the disk cannot be detected unless you do
other actions, such as sync / drop_caches on GNU/Linux.

An access share lock on the currently checked relation is held,
meaning that it can't get deleted/truncated. The total number of
blocks for the given fork is retrieved first, so any new block will be
ignored. Such new blocks are considered out of scope as being written
after the start of the check.

Each time a buffer is being checked, the target buffer mapping
partition lock is acquired in shared mode, to prevent concurrent
eviction. If the buffer is found in shared buffers, it's pinned and
released immediately, just to get the state. If the buffer is found
dirty, no check is performed as it'll be written to disk by the
checkpointer, or during recovery in case of unclean shutdown.
Otherwise, an IO lock is held while the the buffer is being read in a
private buffer. IO Lock and buffer mapping lock are released and then
the check is performed.

If the buffer is not found in shared buffers, the buffer mapping
partition lock is released immediately and the block is read from
disk. It's therefore possible to get a false positive here, as the
block could be concurrently read, modified and partially written to
disk. So, if an error is detected in this case, the check is
restarted from scratch and if the buffer is still not found in shared
buffers, the read will be done while still holding the buffer mapping
partition lock to make sure that it can't get concurrently loaded and
modified. This is an optimistic approach to avoid performance
overhead, assuming that there shouldn't be a lot of positive, and
false positive possibility is very narrow.

The check consists of simply comparing the stored and computed
checksum, with an additional check that the page is really new (using
PageIsVerified) if it's found as PageIsNew(). Since this is done
after releasing all locks, we could definitely add more checks without
causing much overhead, like pd_lower/pd_upper sanity. I prefer to
keep the check simple for now and rather focus on the general
approach.

Finally, I also reused vacuum costing GUC (for simplicity) and
approach to add some throttling.

I'm attaching a patch that adds a new pg_check_relation() sql function
to perform a check of one or all relations, and some simple regression
tests.

[1]: /messages/by-id/1532606373.3422.5.camel@credativ.de
[2]: /messages/by-id/20190326170820.6sylklg7eh6uhabd@alap3.anarazel.de

Attachments:

online_checksum_verification-v1.diffapplication/octet-stream; name=online_checksum_verification-v1.diffDownload

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f7800f01a6..6855380057 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1332,6 +1332,13 @@ LANGUAGE INTERNAL
 STRICT STABLE PARALLEL SAFE
 AS 'jsonb_path_query_first_tz';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index 8f7a110e81..8bfaaae740 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,530 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/pg_authid_d.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/buf_internals.h"
+#include "storage/checksum.h"
+#include "storage/lockdefs.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+/* The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false postives when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is relased and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ *
+ * Vacuum's GUCs are used to avoid consuming too much resources while running
+ * this tool.
+ */
+
+static void check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  ForkNumber forknum);
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+								 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void check_delay_point(void);
+static void check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *checkit, bool *found_in_sb);
+
+/*
+ * Iterate over all relation having a physical storage in the current database
+ * and perform a check for all of them.  Note that this function can run for
+ * a very long time, and as such can cause bloat or other issues.  Client
+ * should iterate over a local list of oid using the per-table mode if
+ * keeping an open transaction for very long can be a concern.
+ */
+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  ForkNumber forknum)
+{
+	List	   *relids = NIL;
+	ListCell   *lc;
+	Relation	relation;
+	TableScanDesc scan;
+	HeapTuple	tuple;
+
+	relation = table_open(RelationRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(relation, 0, NULL);
+
+	while (HeapTupleIsValid(tuple = heap_getnext(scan, ForwardScanDirection)))
+	{
+		Form_pg_class class = (Form_pg_class) GETSTRUCT(tuple);
+
+		if (!RELKIND_HAS_STORAGE(class->relkind))
+			continue;
+
+		relids = lappend_oid(relids, class->oid);
+	}
+
+	table_endscan(scan);
+	table_close(relation, AccessShareLock);	/* release lock */
+
+	foreach(lc, relids)
+	{
+		Oid			relid = lfirst_oid(lc);
+
+		check_one_relation(tupdesc, tupstore, relid, forknum);
+	}
+}
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explictly checking for non-new
+ * pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	PageHeader	hdr = (PageHeader) buffer;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(hdr))
+	{
+		if (PageIsVerified(buffer, blkno))
+		{
+			/*
+			 * If the page is really new, there won't by any checksum to be
+			 * computed or expected.
+			 */
+			*chk_expected = *chk_found = NoComputedChecksum;
+			return true;
+		}
+		else
+		{
+			/*
+			 * There's a corruption, but since this affect PageIsNew, we
+			 * can't compute a checksum, so set NoComputedChecksum for the
+			 * expected checksum.
+			 */
+			*chk_expected = NoComputedChecksum;
+			*chk_found = hdr->pd_checksum;
+		}
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	/* Check if the relation (still) exists */
+	if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+	{
+		/*
+		 * We use a standard relation_open() to acquire the initial lock.  It
+		 * means that this will block until the lock is acquired, or will
+		 * raise an ERROR if lock_timeout has been set.  If caller wants to
+		 * check multiple tables while relying on a maximum wait time, it
+		 * should process tables one by one instead of relying on a global
+		 * processing with the main SRF.
+		 */
+		relation = relation_open(relid, AccessShareLock);
+	}
+
+	if (!RelationIsValid(relation))
+	{
+		elog(WARNING, "relation %u does not exists in database \"%s\"",
+			 relid, get_database_name(MyDatabaseId));
+		return;
+	}
+
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("Relation \"%s\" does not have storage to be checked.",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	char		buffer[BLCKSZ];
+	bool		checkit,
+				found_in_sb;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define CRF_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		bool		force_lock = false;
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[CRF_COLS];
+		bool		nulls[CRF_COLS];
+		int			i = 0;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		/*
+		 * To avoid too much overhead, the buffer will be first read without
+		 * the locks that would guarantee the lack of false positive, as such
+		 * events should be quite rare.
+		 */
+Retry:
+		check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						 &checkit, &found_in_sb);
+
+		if (!checkit)
+			continue;
+
+		if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
+			continue;
+
+		/*
+		 * If we get a failure and the buffer wasn't found in shared buffers,
+		 * reread the buffer with suitable lock to avoid false positive.  See
+		 * check_get_buffer for more details.
+		 */
+		if (!found_in_sb && !force_lock)
+		{
+			force_lock = true;
+			goto Retry;
+		}
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == CRF_COLS);
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Nap if appropriate */
+	if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
+	{
+		int			msec;
+
+		msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
+		if (msec > VacuumCostDelay * 4)
+			msec = VacuumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		VacuumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+				 bool *found_in_sb)
+{
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*checkit = true;
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* Dirty pages are ignored as they'll be flushed soon. */
+		if (buf_state & BM_DIRTY)
+			*checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirted and
+		 * flushed.
+		 */
+		if (*checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/*
+			 * Add a page miss cost, as we're always reading outside the
+			 * shared buffers.
+			 */
+			VacuumCostBalance += VacuumCostPageMiss;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/*
+		 * Add a page miss cost, as we're always reading outside the shared
+		 * buffers.
+		 */
+		VacuumCostBalance += VacuumCostPageMiss;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return;
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/*
+	 * Add a page miss cost, as we're always reading outside the shared
+	 * buffers.
+	 */
+	VacuumCostBalance += VacuumCostPageMiss;
+}
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (!DataChecksumsEnabled())
+		elog(ERROR, "Data checksums are not enabled");
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (PG_NARGS() >= 1 && !PG_ARGISNULL(0))
+		relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	VacuumCostBalance = 0;
+
+	if (OidIsValid(relid))
+		check_one_relation(tupdesc, tupstore, relid, forknum);
+	else
+		check_all_relations(tupdesc, tupstore, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8f64b219..6a1c760b15 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10703,6 +10703,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..e98385a677
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,250 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 41;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation()"
+	);
+
+	 is($stderr, '', 'Function should run successfully');
+	 is($stdout, $nb, "Should have $nb error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM pg_catalog.pg_check_relation()"
+	);
+
+	 is($stderr, '', 'Function should run successfully');
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0,
+		"There shouldn't be checksums errors before modifications");
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+	   . "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, "There should be one checksums errors");
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, "There shouldn't be checksums errors");
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that  don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, "There shouldn't be checksums errors");
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Julien Rouhaud (#1)

Re: Online checksums verification in the backend

On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

This topic was discussed several times, with the most recent
discussions found at [1] and [2]. Based on those discussions, my
understanding is that the current approach in BASE_BACKUP has too many
drawbacks and we should instead do this check in the backend.

Good idea.

This brings the second consideration: how to report the list corrupted
blocks to end users. As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally.

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally." I guess one
idea would be to provide a way to kick this off in the background via
a background worker or similar and then have it put the results in a
table. But that might fail if there are checksum errors in the
catalogs themselves.

I don't really know what's best.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Julien Rouhaud

rjuju123@gmail.com

about 6 years ago

In reply to: Robert Haas (#2)

Re: Online checksums verification in the backend

On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

This brings the second consideration: how to report the list corrupted
blocks to end users. As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally.

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally."

I meant having the results available system-wide, not only to the
caller. I think that emitting a log/notice level should always be
done on top on whatever other communication facility we're using.

I guess one
idea would be to provide a way to kick this off in the background via
a background worker or similar and then have it put the results in a
table. But that might fail if there are checksum errors in the
catalogs themselves.

Yes that's a concern. We could maintain a list in (dynamic) shared
memory with a simple SQL wrapper to read the data, but that would be
lost with a crash/restart. Or use
pgstat_report_checksum_failures_in_db(), modifying it to get an
relfilenode, bocknum and forknum and append that to some flat files,
hoping that it won't get corrupted either.

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Julien Rouhaud (#3)

Re: Online checksums verification in the backend

On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:

On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally."

I meant having the results available system-wide, not only to the
caller. I think that emitting a log/notice level should always be
done on top on whatever other communication facility we're using.

The problem of notice and logs is that they tend to be ignored. Now I
don't see no problems either in adding something into the logs which
can be found later on for parsing on top of a SRF returned by the
caller which includes all the corruption details, say with pgbadger
or your friendly neighborhood grep. I think that any backend function
should also make sure to call pgstat_report_checksum_failure() to
report a report visible at database-level in the catalogs, so as it is
possible to use that as a cheap high-level warning. The details of
the failures could always be dug from the logs or the result of the
function itself after finding out that something is wrong in
pg_stat_database.

I guess one
idea would be to provide a way to kick this off in the background via
a background worker or similar and then have it put the results in a
table. But that might fail if there are checksum errors in the
catalogs themselves.

Yes that's a concern. We could maintain a list in (dynamic) shared
memory with a simple SQL wrapper to read the data, but that would be
lost with a crash/restart. Or use
pgstat_report_checksum_failures_in_db(), modifying it to get an
relfilenode, bocknum and forknum and append that to some flat files,
hoping that it won't get corrupted either.

If a lot of blocks are corrupted, that could bloat things. Hence some
retention policies would be necessary, and that's tricky to define and
configure properly. I'd tend to be in the school of just logging the
information and be done with it, because that's simple and because you
won't need to worry about any more configuration. Doing the work in
the background is still separate than a SQL-callable function though,
no? In this case you need a connection to a database to allow the
checksum verification to happen on a relfilenode based on the relation
to check, also because you want the thing to be safe concurrently
(a background work here is a combo with a bgworker triggering dynamic
children working on one database, not necessarily something that needs
to be in core).
--
Michael

Julien Rouhaud

rjuju123@gmail.com

about 6 years ago

In reply to: Michael Paquier (#4)

Re: Online checksums verification in the backend

On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:

On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally."

I meant having the results available system-wide, not only to the
caller. I think that emitting a log/notice level should always be
done on top on whatever other communication facility we're using.

The problem of notice and logs is that they tend to be ignored. Now I
don't see no problems either in adding something into the logs which
can be found later on for parsing on top of a SRF returned by the
caller which includes all the corruption details, say with pgbadger
or your friendly neighborhood grep. I think that any backend function
should also make sure to call pgstat_report_checksum_failure() to
report a report visible at database-level in the catalogs, so as it is
possible to use that as a cheap high-level warning. The details of
the failures could always be dug from the logs or the result of the
function itself after finding out that something is wrong in
pg_stat_database.

I agree that adding extra information in the logs and calling
pgstat_report_checksum_failure is a must do, and I changed that
locally. However, I doubt that the logs is the right place to find
the details of corrupted blocks. There's no guarantee that the file
will be accessible to the DBA, nor that the content won't get
truncated by the time it's needed. I really think that corruption is
important enough to justify more specific location.

I guess one
idea would be to provide a way to kick this off in the background via
a background worker or similar and then have it put the results in a
table. But that might fail if there are checksum errors in the
catalogs themselves.

Yes that's a concern. We could maintain a list in (dynamic) shared
memory with a simple SQL wrapper to read the data, but that would be
lost with a crash/restart. Or use
pgstat_report_checksum_failures_in_db(), modifying it to get an
relfilenode, bocknum and forknum and append that to some flat files,
hoping that it won't get corrupted either.

If a lot of blocks are corrupted, that could bloat things. Hence some
retention policies would be necessary, and that's tricky to define and
configure properly. I'd tend to be in the school of just logging the
information and be done with it, because that's simple and because you
won't need to worry about any more configuration.

If the number of corrupted blocks becomes high enough to excessively
bloat things, it's likely that the instance is doomed anyway, so I'm
not especially concerned about it.

Masahiko Sawada

sawada.mshk@gmail.com

about 6 years ago

In reply to: Julien Rouhaud (#1)

Re: Online checksums verification in the backend

On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

Hi,

This topic was discussed several times, with the most recent
discussions found at [1] and [2]. Based on those discussions, my
understanding is that the current approach in BASE_BACKUP has too many
drawbacks and we should instead do this check in the backend. I've
been working using such approach at VMware, and I'm submitting it here
to discuss the approach and rationales, and hopefully have such a
feature integrated.

Thank you for working on this!

First, this was originally developed as an extension. It means that
the check is performed using an SRF. That's maybe not the best
approach, as a transaction has be kept for the total processing time.
It can be leveraged by checking each relation independently, but
that's still not ideal. Maybe using some utility commands (as part of
VACUUM or a new CHECK command for instance) would be a better
approach.

This brings the second consideration: how to report the list corrupted
blocks to end users. As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally. One usage of this information could be block level
recovery. I'm Cc-ing Sawada-san, as I know he's working on this and
mentioned me that he had ideas on passing the list of corrupted blocks
using the stat collector.

Yes it's necessary the list of corrupted pages for single page
recovery. Apart from single page recovery I think it's helpful for DBA
if they can find the corrupted blocks in the server logs and on a
system view.

I've also tried to report corrupted pages to the stats collector
during I researching single page recovery in PostgreSQL but one
problem is that the statistics in the stats collector is cleared when
crash recovery. I want the information of block corruption to survive
even when the server down. And we might want to add checksums to the
permanent file having information of database corruption. The
correctness of these information would be important because we can fix
a database by restoring some tables from a logical backup or by doing
reindex etc as long as we have a non-broken information of database
corruption.

Finally, the read and locking considerations. I tried to cover that
extensively in the comments, but here are some details on how I tried
to make the check safe while trying to keep the overhead as low as
possible. First thing is that this is only doing buffered reads,
without any attempt to discard OS cache. Therefore, any discrepancy
between the OS cache and the disk cannot be detected unless you do
other actions, such as sync / drop_caches on GNU/Linux.

An access share lock on the currently checked relation is held,
meaning that it can't get deleted/truncated. The total number of
blocks for the given fork is retrieved first, so any new block will be
ignored. Such new blocks are considered out of scope as being written
after the start of the check.

Each time a buffer is being checked, the target buffer mapping
partition lock is acquired in shared mode, to prevent concurrent
eviction. If the buffer is found in shared buffers, it's pinned and
released immediately, just to get the state.

I wonder if there is possibility that blocks on disk can be corrupted
even if these are loaded to the shared buffer. ISTM the above method
cannot detect such corruption. Reading and checking blocks fast is
attractive but I thought it's also important to check blocks precisely
without overlooking.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Julien Rouhaud

rjuju123@gmail.com

about 6 years ago

In reply to: Masahiko Sawada (#6)

Re: Online checksums verification in the backend

On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

This brings the second consideration: how to report the list corrupted
blocks to end users. As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally. One usage of this information could be block level
recovery. I'm Cc-ing Sawada-san, as I know he's working on this and
mentioned me that he had ideas on passing the list of corrupted blocks
using the stat collector.

Yes it's necessary the list of corrupted pages for single page
recovery. Apart from single page recovery I think it's helpful for DBA
if they can find the corrupted blocks in the server logs and on a
system view.

I've also tried to report corrupted pages to the stats collector
during I researching single page recovery in PostgreSQL but one
problem is that the statistics in the stats collector is cleared when
crash recovery. I want the information of block corruption to survive
even when the server down.

Yes, having the list of corrupted blocks surviving a crash-and-restart
cycle, and also available after a clean shutdown is definitely
important.

And we might want to add checksums to the
permanent file having information of database corruption. The
correctness of these information would be important because we can fix
a database by restoring some tables from a logical backup or by doing
reindex etc as long as we have a non-broken information of database
corruption.

Agreed

Finally, the read and locking considerations. I tried to cover that
extensively in the comments, but here are some details on how I tried
to make the check safe while trying to keep the overhead as low as
possible. First thing is that this is only doing buffered reads,
without any attempt to discard OS cache. Therefore, any discrepancy
between the OS cache and the disk cannot be detected unless you do
other actions, such as sync / drop_caches on GNU/Linux.

An access share lock on the currently checked relation is held,
meaning that it can't get deleted/truncated. The total number of
blocks for the given fork is retrieved first, so any new block will be
ignored. Such new blocks are considered out of scope as being written
after the start of the check.

Each time a buffer is being checked, the target buffer mapping
partition lock is acquired in shared mode, to prevent concurrent
eviction. If the buffer is found in shared buffers, it's pinned and
released immediately, just to get the state.

I wonder if there is possibility that blocks on disk can be corrupted
even if these are loaded to the shared buffer. ISTM the above method
cannot detect such corruption. Reading and checking blocks fast is
attractive but I thought it's also important to check blocks precisely
without overlooking.

It can definitely happen, and it's the usual doomsday scenario:
database is working fine for months, then postgres is restarted say
for a minor version upgrade and then boom the most populars blocks
that are constantly used in read only were corrupted on disk but never
evicted from shared buffers, and you have a major outage. I have
witnessed that unfortunately too many times. This is especially bad
as in this kind of scenario, you typically discover the corruption
once all backup only contains the corrupted blocks.

Note that in the approach I'm suggesting, I do verify blocks that are
loaded in shared buffers, I only ignore the dirty blocks, as they'll
be written by the checkpointer or recovery process in case of unclean
shutdown. A bufferpin isn't necessary to avoid torn page read, an IO
lock also guarantees that and causes less overhead. The included TAP
test should also detect the corruption of a
present-in-shared-buffers-non-dirty block. It could however be
improved eg. by calling pg_prewarm to make sure that it's indeed in
shared_buffers, and also do the same test after a clean restart to
make sure that it's hitting the not-in-shared-buffers case.

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

about 6 years ago

In reply to: Julien Rouhaud (#7)

Re: Online checksums verification in the backend

On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

This brings the second consideration: how to report the list corrupted
blocks to end users. As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally. One usage of this information could be block level
recovery. I'm Cc-ing Sawada-san, as I know he's working on this and
mentioned me that he had ideas on passing the list of corrupted blocks
using the stat collector.

Yes it's necessary the list of corrupted pages for single page
recovery. Apart from single page recovery I think it's helpful for DBA
if they can find the corrupted blocks in the server logs and on a
system view.

I've also tried to report corrupted pages to the stats collector
during I researching single page recovery in PostgreSQL but one
problem is that the statistics in the stats collector is cleared when
crash recovery. I want the information of block corruption to survive
even when the server down.

Yes, having the list of corrupted blocks surviving a crash-and-restart
cycle, and also available after a clean shutdown is definitely
important.

And we might want to add checksums to the
permanent file having information of database corruption. The
correctness of these information would be important because we can fix
a database by restoring some tables from a logical backup or by doing
reindex etc as long as we have a non-broken information of database
corruption.

Agreed

Finally, the read and locking considerations. I tried to cover that
extensively in the comments, but here are some details on how I tried
to make the check safe while trying to keep the overhead as low as
possible. First thing is that this is only doing buffered reads,
without any attempt to discard OS cache. Therefore, any discrepancy
between the OS cache and the disk cannot be detected unless you do
other actions, such as sync / drop_caches on GNU/Linux.

An access share lock on the currently checked relation is held,
meaning that it can't get deleted/truncated. The total number of
blocks for the given fork is retrieved first, so any new block will be
ignored. Such new blocks are considered out of scope as being written
after the start of the check.

Each time a buffer is being checked, the target buffer mapping
partition lock is acquired in shared mode, to prevent concurrent
eviction. If the buffer is found in shared buffers, it's pinned and
released immediately, just to get the state.

I wonder if there is possibility that blocks on disk can be corrupted
even if these are loaded to the shared buffer. ISTM the above method
cannot detect such corruption. Reading and checking blocks fast is
attractive but I thought it's also important to check blocks precisely
without overlooking.

It can definitely happen, and it's the usual doomsday scenario:
database is working fine for months, then postgres is restarted say
for a minor version upgrade and then boom the most populars blocks
that are constantly used in read only were corrupted on disk but never
evicted from shared buffers, and you have a major outage. I have
witnessed that unfortunately too many times. This is especially bad
as in this kind of scenario, you typically discover the corruption
once all backup only contains the corrupted blocks.

Note that in the approach I'm suggesting, I do verify blocks that are
loaded in shared buffers, I only ignore the dirty blocks, as they'll
be written by the checkpointer or recovery process in case of unclean
shutdown. A bufferpin isn't necessary to avoid torn page read, an IO
lock also guarantees that and causes less overhead. The included TAP
test should also detect the corruption of a
present-in-shared-buffers-non-dirty block. It could however be
improved eg. by calling pg_prewarm to make sure that it's indeed in
shared_buffers, and also do the same test after a clean restart to
make sure that it's hitting the not-in-shared-buffers case.

It reads blocks from disk even if they are loaded in shared buffer.
Now I understand. Thanks!

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Julien Rouhaud (#5)

1 attachment(s)

Re: Online checksums verification in the backend

On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote:

On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:

On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally."

I meant having the results available system-wide, not only to the
caller. I think that emitting a log/notice level should always be
done on top on whatever other communication facility we're using.

The problem of notice and logs is that they tend to be ignored. Now I
don't see no problems either in adding something into the logs which
can be found later on for parsing on top of a SRF returned by the
caller which includes all the corruption details, say with pgbadger
or your friendly neighborhood grep. I think that any backend function
should also make sure to call pgstat_report_checksum_failure() to
report a report visible at database-level in the catalogs, so as it is
possible to use that as a cheap high-level warning. The details of
the failures could always be dug from the logs or the result of the
function itself after finding out that something is wrong in
pg_stat_database.

I agree that adding extra information in the logs and calling
pgstat_report_checksum_failure is a must do, and I changed that
locally. However, I doubt that the logs is the right place to find
the details of corrupted blocks. There's no guarantee that the file
will be accessible to the DBA, nor that the content won't get
truncated by the time it's needed. I really think that corruption is
important enough to justify more specific location.

The cfbot reported a build failure, so here's a rebased v2 which also contains
the pg_stat_report_failure() call and extra log info.

Attachments:

online_checksum_verification-v2.difftext/plain; charset=us-asciiDownload

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..e266292b03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1386,6 +1386,13 @@ LANGUAGE INTERNAL
 STRICT STABLE PARALLEL SAFE
 AS 'jsonb_path_query_first_tz';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..ed1a0c9b30 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,541 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/pg_authid_d.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/checksum.h"
+#include "storage/lockdefs.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+/* The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false postives when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is relased and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ *
+ * Vacuum's GUCs are used to avoid consuming too much resources while running
+ * this tool.
+ */
+
+static void check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  ForkNumber forknum);
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+								 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void check_delay_point(void);
+static void check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *checkit, bool *found_in_sb);
+
+/*
+ * Iterate over all relation having a physical storage in the current database
+ * and perform a check for all of them.  Note that this function can run for
+ * a very long time, and as such can cause bloat or other issues.  Client
+ * should iterate over a local list of oid using the per-table mode if
+ * keeping an open transaction for very long can be a concern.
+ */
+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  ForkNumber forknum)
+{
+	List	   *relids = NIL;
+	ListCell   *lc;
+	Relation	relation;
+	TableScanDesc scan;
+	HeapTuple	tuple;
+
+	relation = table_open(RelationRelationId, AccessShareLock);
+	scan = table_beginscan_catalog(relation, 0, NULL);
+
+	while (HeapTupleIsValid(tuple = heap_getnext(scan, ForwardScanDirection)))
+	{
+		Form_pg_class class = (Form_pg_class) GETSTRUCT(tuple);
+
+		if (!RELKIND_HAS_STORAGE(class->relkind))
+			continue;
+
+		relids = lappend_oid(relids, class->oid);
+	}
+
+	table_endscan(scan);
+	table_close(relation, AccessShareLock);	/* release lock */
+
+	foreach(lc, relids)
+	{
+		Oid			relid = lfirst_oid(lc);
+
+		check_one_relation(tupdesc, tupstore, relid, forknum);
+	}
+}
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explictly checking for non-new
+ * pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	PageHeader	hdr = (PageHeader) buffer;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(hdr))
+	{
+		if (PageIsVerified(buffer, blkno))
+		{
+			/*
+			 * If the page is really new, there won't by any checksum to be
+			 * computed or expected.
+			 */
+			*chk_expected = *chk_found = NoComputedChecksum;
+			return true;
+		}
+		else
+		{
+			/*
+			 * There's a corruption, but since this affect PageIsNew, we
+			 * can't compute a checksum, so set NoComputedChecksum for the
+			 * expected checksum.
+			 */
+			*chk_expected = NoComputedChecksum;
+			*chk_found = hdr->pd_checksum;
+		}
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	/* Check if the relation (still) exists */
+	if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+	{
+		/*
+		 * We use a standard relation_open() to acquire the initial lock.  It
+		 * means that this will block until the lock is acquired, or will
+		 * raise an ERROR if lock_timeout has been set.  If caller wants to
+		 * check multiple tables while relying on a maximum wait time, it
+		 * should process tables one by one instead of relying on a global
+		 * processing with the main SRF.
+		 */
+		relation = relation_open(relid, AccessShareLock);
+	}
+
+	if (!RelationIsValid(relation))
+	{
+		elog(WARNING, "relation %u does not exists in database \"%s\"",
+			 relid, get_database_name(MyDatabaseId));
+		return;
+	}
+
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("Relation \"%s\" does not have storage to be checked.",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	char		buffer[BLCKSZ];
+	bool		checkit,
+				found_in_sb;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define CRF_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		bool		force_lock = false;
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[CRF_COLS];
+		bool		nulls[CRF_COLS];
+		int			i = 0;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		/*
+		 * To avoid too much overhead, the buffer will be first read without
+		 * the locks that would guarantee the lack of false positive, as such
+		 * events should be quite rare.
+		 */
+Retry:
+		check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						 &checkit, &found_in_sb);
+
+		if (!checkit)
+			continue;
+
+		if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
+			continue;
+
+		/*
+		 * If we get a failure and the buffer wasn't found in shared buffers,
+		 * reread the buffer with suitable lock to avoid false positive.  See
+		 * check_get_buffer for more details.
+		 */
+		if (!found_in_sb && !force_lock)
+		{
+			force_lock = true;
+			goto Retry;
+		}
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == CRF_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	/* Nap if appropriate */
+	if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
+	{
+		int			msec;
+
+		msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
+		if (msec > VacuumCostDelay * 4)
+			msec = VacuumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		VacuumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+				 bool *found_in_sb)
+{
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*checkit = true;
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* Dirty pages are ignored as they'll be flushed soon. */
+		if (buf_state & BM_DIRTY)
+			*checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirted and
+		 * flushed.
+		 */
+		if (*checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/*
+			 * Add a page miss cost, as we're always reading outside the
+			 * shared buffers.
+			 */
+			VacuumCostBalance += VacuumCostPageMiss;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/*
+		 * Add a page miss cost, as we're always reading outside the shared
+		 * buffers.
+		 */
+		VacuumCostBalance += VacuumCostPageMiss;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return;
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/*
+	 * Add a page miss cost, as we're always reading outside the shared
+	 * buffers.
+	 */
+	VacuumCostBalance += VacuumCostPageMiss;
+}
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (!DataChecksumsEnabled())
+		elog(ERROR, "Data checksums are not enabled");
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (PG_NARGS() >= 1 && !PG_ARGISNULL(0))
+		relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	VacuumCostBalance = 0;
+
+	if (OidIsValid(relid))
+		check_one_relation(tupdesc, tupstore, relid, forknum);
+	else
+		check_all_relations(tupdesc, tupstore, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..73fa7cdf51 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10741,6 +10741,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..7944195489
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,274 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation()"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $pattern, $nb) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM pg_catalog.pg_check_relation()"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, qr/invalid page in block $blkno/, 1);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);

#10

Michael Paquier

michael@paquier.xyz

almost 6 years ago

In reply to: Julien Rouhaud (#9)

Re: Online checksums verification in the backend

On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote:

The cfbot reported a build failure, so here's a rebased v2 which also contains
the pg_stat_report_failure() call and extra log info.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
This can be risky with false positives, no?  With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads.  Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches?  For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

+ * Vacuum's GUCs are used to avoid consuming too much resources while running
+ * this tool.
Shouldn't this involve separate GUCs instead of the VACUUM ones?  I
guess that this leads to the fact that this function may be better as
a contrib module, with the addition of some better-suited APIs in core
(see paragraph above).

+Run
+    make check
+or
+    make installcheck
Why is installcheck mentioned here?

I don't think that it is appropriate to place the SQL-callable part in
the existing checksum.c. I would suggest instead a new file, say
checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions
for checksums.

-SUBDIRS = perl regress isolation modules authentication recovery
 subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+     recovery subscription
It seems to me that this test would be a good fit for
src/test/modules/test_misc/.

+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)
Per the argument of bloat, I think that I would remove
check_all_relation() as this function could take a very long time to
run, and just make the SQL function strict.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Wouldn't it actually be a good thing to check that the page on storage
is fine in this case?  This depends on the system settings and the
checkpoint frequency, but checkpoint_timeout can be extended up to 1
day.  And plenty of things could happen to the storage in one day,
including a base backup that includes a corrupted page on storage,
that this function would not be able to detect.

+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
Yeah, I think that you are right here to check the page on storage
anyway.

+ *   we detect if a block is in shared_buffers or not.  See get_buffer()
+ *   comments for more details about the locking strategy.
get_buffer() does not exist in your patch, check_get_buffer() does.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
[...]
+ * To avoid torn page and possible false postives when reading data, and
Typos.

+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");
Note that elog() is for the class of errors which are never expected,
and here a caller of pg_check_relation() with checksums disabled can
trigger that.  So you need to call ereport() with
ERRCODE_FEATURE_NOT_SUPPORTED.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Not sure that the indentation is going to react well on that part of
the patch, perhaps it would be better to add some "/*-------" at the
beginning and end of the comment block to tell pgindent to ignore this
part?

Based on the feedback gathered on this thread, I guess that you should
have a SRF returning the list of broken blocks, as well as NOTICE
messages. Another thing to consider is the addition of a range
argument to only check a certain portion of the blocks, say one
segment file at a time, etc. Fine by me to not include in the first
flavor of the patch.

The patch needs documentation.
--
Michael

#11

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Julien Rouhaud (#9)

Re: Online checksums verification in the backend

On Wed, 11 Mar 2020 at 16:18, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote:

On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote:

On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:

Some people might prefer notices, because you can get those while the
thing is still running, rather than a result set, which you will only
see when the query finishes. Other people might prefer an SRF, because
they want to have the data in structured form so that they can
postprocess it. Not sure what you mean by "more globally."

I meant having the results available system-wide, not only to the
caller. I think that emitting a log/notice level should always be
done on top on whatever other communication facility we're using.

The problem of notice and logs is that they tend to be ignored. Now I
don't see no problems either in adding something into the logs which
can be found later on for parsing on top of a SRF returned by the
caller which includes all the corruption details, say with pgbadger
or your friendly neighborhood grep. I think that any backend function
should also make sure to call pgstat_report_checksum_failure() to
report a report visible at database-level in the catalogs, so as it is
possible to use that as a cheap high-level warning. The details of
the failures could always be dug from the logs or the result of the
function itself after finding out that something is wrong in
pg_stat_database.

I agree that adding extra information in the logs and calling
pgstat_report_checksum_failure is a must do, and I changed that
locally. However, I doubt that the logs is the right place to find
the details of corrupted blocks. There's no guarantee that the file
will be accessible to the DBA, nor that the content won't get
truncated by the time it's needed. I really think that corruption is
important enough to justify more specific location.

The cfbot reported a build failure, so here's a rebased v2 which also contains
the pg_stat_report_failure() call and extra log info.

In addition to comments from Michael-san, here are my comments:

1.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));
+
+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");

I think it's better to reverse the order of the above checks.

2.
+#define CRF_COLS 5 /* Number of output arguments in the SRF */

Should it be SRF_COLS?

3.
+static void
+check_delay_point(void)
+{
+   /* Always check for interrupts */
+   CHECK_FOR_INTERRUPTS();
+
+   /* Nap if appropriate */
+   if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
+   {
+       int         msec;
+
+       msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
+       if (msec > VacuumCostDelay * 4)
+           msec = VacuumCostDelay * 4;
+
+       pg_usleep(msec * 1000L);
+
+       VacuumCostBalance = 0;
+
+       /* Might have gotten an interrupt while sleeping */
+       CHECK_FOR_INTERRUPTS();
+   }
+}

Even if we use vacuum delay for this function, I think we need to set
VacuumDelayActive and return if it's false, or it's better to just
return if VacuumCostDelay == 0.

4.
+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)

I also agree with Michael-san to remove this function. Instead we can
check all relations by:

select pg_check_relation(oid) from pg_class;

6.
Other typos

s/dirted/dirtied/
s/explictly/explicitly/

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Michael Paquier (#10)

Re: Online checksums verification in the backend

Thanks for the review Michael!

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:

On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote:

The cfbot reported a build failure, so here's a rebased v2 which also contains
the pg_stat_report_failure() call and extra log info.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
This can be risky with false positives, no?

Do you mean high probability of false positive in the 1st iteration, so running
frequently the recheck that can't have false positive, not that the 2nd check
can lead to false positive?

With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads. Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches? For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page. Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.

+ * Vacuum's GUCs are used to avoid consuming too much resources while running
+ * this tool.
Shouldn't this involve separate GUCs instead of the VACUUM ones?

We could but the access pattern looked so similar that it looked like a good
idea to avoid adding 2 new GUC for that to keep configuration simple. Unless
there are objections I'll add them in the next version.

I guess that this leads to the fact that this function may be better as
a contrib module, with the addition of some better-suited APIs in core
(see paragraph above).

Below?

+Run
+    make check
+or
+    make installcheck
Why is installcheck mentioned here?

Oups, copy/pasto error from the original contrib module this stuff was
initially implemented as, will fix.

I don't think that it is appropriate to place the SQL-callable part in
the existing checksum.c. I would suggest instead a new file, say
checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions
for checksums.

Agreed.

-SUBDIRS = perl regress isolation modules authentication recovery
subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+     recovery subscription
It seems to me that this test would be a good fit for
src/test/modules/test_misc/.

WFM.

+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)
Per the argument of bloat, I think that I would remove
check_all_relation() as this function could take a very long time to
run, and just make the SQL function strict.

No objection.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Wouldn't it actually be a good thing to check that the page on storage
is fine in this case?  This depends on the system settings and the
checkpoint frequency, but checkpoint_timeout can be extended up to 1
day.  And plenty of things could happen to the storage in one day,
including a base backup that includes a corrupted page on storage,
that this function would not be able to detect.

How could that lead to data corruption? If postgres crashes before the
checkpoint completion, the block will be overwritten during recovery, and if a
base backup is taken the block will also be overwritten while replaying all the
required WALs. Detecting a corrupted blocks in those cases would have the
merit of possibly warning about possibly broken hardware sooner, but it would
also make the check more expensive as the odds to prevent postgres from
evicting a dirty block is way higher. Maybe an additional GUC for that?

For the record when I first tested that feature I did try to check dirty
blocks, and it seemed that dirty blocks of shared relation were sometimes
wrongly reported as corrupted. I didn't try to investigate more though.

+ *   we detect if a block is in shared_buffers or not.  See get_buffer()
+ *   comments for more details about the locking strategy.
get_buffer() does not exist in your patch, check_get_buffer() does.

Oops, will fix.

+ * - if a block is not found in shared_buffers, the LWLock is relased and the
[...]
+ * To avoid torn page and possible false postives when reading data, and
Typos.

+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");
Note that elog() is for the class of errors which are never expected,
and here a caller of pg_check_relation() with checksums disabled can
trigger that.  So you need to call ereport() with
ERRCODE_FEATURE_NOT_SUPPORTED.

Indeed, will fix.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Not sure that the indentation is going to react well on that part of
the patch, perhaps it would be better to add some "/*-------" at the
beginning and end of the comment block to tell pgindent to ignore this
part?

Ok. Although I think only the beginning comment is needed?

Based on the feedback gathered on this thread, I guess that you should
have a SRF returning the list of broken blocks, as well as NOTICE
messages.

The current patch has an SRF and a WARNING message, do you want an additional
NOTICE message or downgrade the existing one?

Another thing to consider is the addition of a range
argument to only check a certain portion of the blocks, say one
segment file at a time, etc. Fine by me to not include in the first
flavor of the patch.

Ok!

The patch needs documentation.

I'll try to add some.

#13

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#11)

Re: Online checksums verification in the backend

On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:

In addition to comments from Michael-san, here are my comments:

1.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));

Good point! I'll fix it.

+
+   if (!DataChecksumsEnabled())
+       elog(ERROR, "Data checksums are not enabled");
I think it's better to reverse the order of the above checks.

Indeed.

2.
+#define CRF_COLS 5 /* Number of output arguments in the SRF */

Should it be SRF_COLS?

Oops, will fix.

3.
+static void
+check_delay_point(void)
+{
+   /* Always check for interrupts */
+   CHECK_FOR_INTERRUPTS();
+
+   /* Nap if appropriate */
+   if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit)
+   {
+       int         msec;
+
+       msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit;
+       if (msec > VacuumCostDelay * 4)
+           msec = VacuumCostDelay * 4;
+
+       pg_usleep(msec * 1000L);
+
+       VacuumCostBalance = 0;
+
+       /* Might have gotten an interrupt while sleeping */
+       CHECK_FOR_INTERRUPTS();
+   }
+}

Even if we use vacuum delay for this function, I think we need to set
VacuumDelayActive and return if it's false, or it's better to just
return if VacuumCostDelay == 0.

Good point, I'll fix that.

4.
+static void
+check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore,
+                         ForkNumber forknum)
I also agree with Michael-san to remove this function. Instead we can
check all relations by:

select pg_check_relation(oid) from pg_class;

Sure, but ideally we should do that in a client program (eg. pg_checksums)
that wouldn't maintain a transaction active for the whole execution.

6.
Other typos

s/dirted/dirtied/
s/explictly/explicitly/

Will fix, thanks!

#14

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Julien Rouhaud (#13)

1 attachment(s)

Re: Online checksums verification in the backend

On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:

In addition to comments from Michael-san, here are my comments:

Thanks both for the reviews. I'm attaching a v3 with all comments addressed,
except:

It seems to me that this test would be a good fit for
src/test/modules/test_misc/.

AFAICT this is explicitly documented as tests for various extensions, and for
now it's a core function, so I didn't move it.

+Run
+    make check
+or
+    make installcheck
Why is installcheck mentioned here?

This is actually already used in multiple other test readme.

Attachments:

v3-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From d96c67e2e591cecca63fdf8c9d808bb6a1b72866 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v3] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/func.sgml                        |  44 ++
 src/backend/catalog/system_views.sql          |   7 +
 src/backend/storage/page/checksum.c           | 427 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         |  96 ++++
 src/backend/utils/init/globals.c              |   8 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/miscadmin.h                       |   8 +
 src/include/storage/checksum.h                |   7 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 +
 src/test/check_relation/README                |  23 +
 .../check_relation/t/01_checksums_check.pl    | 274 +++++++++++
 16 files changed, 971 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 323366feb6..5847d7f655 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21815,6 +21815,50 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>, <parameter>fork</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of a given relation, and
+       optionally the given fork.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..e266292b03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1386,6 +1386,13 @@ LANGUAGE INTERNAL
 STRICT STABLE PARALLEL SAFE
 AS 'jsonb_path_query_first_tz';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..aec5e2ad2d 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,429 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/pg_authid_d.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/checksum.h"
+#include "storage/lockdefs.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void check_delay_point(void);
+static void check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *checkit, bool *found_in_sb);
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for non-new
+ * pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	PageHeader	hdr = (PageHeader) buffer;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(hdr))
+	{
+		if (PageIsVerified(buffer, blkno))
+		{
+			/*
+			 * If the page is really new, there won't by any checksum to be
+			 * computed or expected.
+			 */
+			*chk_expected = *chk_found = NoComputedChecksum;
+			return true;
+		}
+		else
+		{
+			/*
+			 * There's a corruption, but since this affect PageIsNew, we
+			 * can't compute a checksum, so set NoComputedChecksum for the
+			 * expected checksum.
+			 */
+			*chk_expected = NoComputedChecksum;
+			*chk_found = hdr->pd_checksum;
+		}
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	/* Check if the relation (still) exists */
+	if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+	{
+		/*
+		 * We use a standard relation_open() to acquire the initial lock.  It
+		 * means that this will block until the lock is acquired, or will
+		 * raise an ERROR if lock_timeout has been set.  If caller wants to
+		 * check multiple tables while relying on a maximum wait time, it
+		 * should process tables one by one instead of relying on a global
+		 * processing with the main SRF.
+		 */
+		relation = relation_open(relid, AccessShareLock);
+	}
+
+	if (!RelationIsValid(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+				 errmsg("relation with OID %u does not exist", relid)));
+
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("Relation \"%s\" does not have storage to be checked.",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	char		buffer[BLCKSZ];
+	bool		checkit,
+				found_in_sb;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define SRF_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		bool		force_lock = false;
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[SRF_COLS];
+		bool		nulls[SRF_COLS];
+		int			i = 0;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		/*
+		 * To avoid too much overhead, the buffer will be first read without
+		 * the locks that would guarantee the lack of false positive, as such
+		 * events should be quite rare.
+		 */
+Retry:
+		check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						 &checkit, &found_in_sb);
+
+		if (!checkit)
+			continue;
+
+		if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
+			continue;
+
+		/*
+		 * If we get a failure and the buffer wasn't found in shared buffers,
+		 * reread the buffer with suitable lock to avoid false positive.  See
+		 * check_get_buffer for more details.
+		 */
+		if (!found_in_sb && !force_lock)
+		{
+			force_lock = true;
+			goto Retry;
+		}
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == SRF_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+				 bool *found_in_sb)
+{
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*checkit = true;
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* Dirty pages are ignored as they'll be flushed soon. */
+		if (buf_state & BM_DIRTY)
+			*checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (*checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/*
+			 * Add a page miss cost, as we're always reading outside the
+			 * shared buffers.
+			 */
+			ChecksumCostBalance += ChecksumCostPageMiss;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/*
+		 * Add a page miss cost, as we're always reading outside the shared
+		 * buffers.
+		 */
+		ChecksumCostBalance += ChecksumCostPageMiss;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return;
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/*
+	 * Add a page miss cost, as we're always reading outside the shared
+	 * buffers.
+	 */
+	ChecksumCostBalance += ChecksumCostPageMiss;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..b53b4c0bbf
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/checksum.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("relation cannot be null")));
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	/* Set cost-based vacuum delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPageHit = 1;	/* GUC parameters for checksum check */
+int			ChecksumCostPageMiss = 10;
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..1dc10c5462 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -693,6 +693,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2376,6 +2378,36 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for a page found in the buffer cache."),
+			NULL
+		},
+		&ChecksumCostPageHit,
+		1, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"vacuum_cost_page_miss", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for a page not found in the buffer cache."),
+			NULL
+		},
+		&ChecksumCostPageMiss,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"vacuum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3498,6 +3530,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..73fa7cdf51 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10741,6 +10741,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..f2fc143872 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,14 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPageHit;
+extern int	ChecksumCostPageMiss;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..29ca52d8c9 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,15 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/tuplestore.h"
 
+extern void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+							   Oid relid, ForkNumber single_forknum);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..7944195489
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,274 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation()"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $pattern, $nb) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM pg_catalog.pg_check_relation()"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, qr/invalid page in block $blkno/, 1);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#15

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Julien Rouhaud (#14)

1 attachment(s)

Re: Online checksums verification in the backend

On Mon, Mar 16, 2020 at 02:15:27PM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote:

In addition to comments from Michael-san, here are my comments:

Thanks both for the reviews. I'm attaching a v3 with all comments addressed,
except:

It seems to me that this test would be a good fit for
src/test/modules/test_misc/.

AFAICT this is explicitly documented as tests for various extensions, and for
now it's a core function, so I didn't move it.
+Run
+    make check
+or
+    make installcheck
Why is installcheck mentioned here?
This is actually already used in multiple other test readme.

Sorry I forgot to update the regression tests. v4 attached.

Attachments:

v4-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From e71af7998600a491efa1435d3c218de6e55bbc64 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v4] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/func.sgml                        |  44 ++
 src/backend/catalog/system_views.sql          |   7 +
 src/backend/storage/page/checksum.c           | 427 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         |  96 ++++
 src/backend/utils/init/globals.c              |   8 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/miscadmin.h                       |   8 +
 src/include/storage/checksum.h                |   7 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 +
 src/test/check_relation/README                |  23 +
 .../check_relation/t/01_checksums_check.pl    | 276 +++++++++++
 16 files changed, 973 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 323366feb6..5847d7f655 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21815,6 +21815,50 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>, <parameter>fork</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of a given relation, and
+       optionally the given fork.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..e266292b03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1386,6 +1386,13 @@ LANGUAGE INTERNAL
 STRICT STABLE PARALLEL SAFE
 AS 'jsonb_path_query_first_tz';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..aec5e2ad2d 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,429 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/pg_authid_d.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/checksum.h"
+#include "storage/lockdefs.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void check_delay_point(void);
+static void check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *checkit, bool *found_in_sb);
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for non-new
+ * pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	PageHeader	hdr = (PageHeader) buffer;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(hdr))
+	{
+		if (PageIsVerified(buffer, blkno))
+		{
+			/*
+			 * If the page is really new, there won't by any checksum to be
+			 * computed or expected.
+			 */
+			*chk_expected = *chk_found = NoComputedChecksum;
+			return true;
+		}
+		else
+		{
+			/*
+			 * There's a corruption, but since this affect PageIsNew, we
+			 * can't compute a checksum, so set NoComputedChecksum for the
+			 * expected checksum.
+			 */
+			*chk_expected = NoComputedChecksum;
+			*chk_found = hdr->pd_checksum;
+		}
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	/* Check if the relation (still) exists */
+	if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+	{
+		/*
+		 * We use a standard relation_open() to acquire the initial lock.  It
+		 * means that this will block until the lock is acquired, or will
+		 * raise an ERROR if lock_timeout has been set.  If caller wants to
+		 * check multiple tables while relying on a maximum wait time, it
+		 * should process tables one by one instead of relying on a global
+		 * processing with the main SRF.
+		 */
+		relation = relation_open(relid, AccessShareLock);
+	}
+
+	if (!RelationIsValid(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+				 errmsg("relation with OID %u does not exist", relid)));
+
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("Relation \"%s\" does not have storage to be checked.",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	char		buffer[BLCKSZ];
+	bool		checkit,
+				found_in_sb;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define SRF_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		bool		force_lock = false;
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[SRF_COLS];
+		bool		nulls[SRF_COLS];
+		int			i = 0;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		/*
+		 * To avoid too much overhead, the buffer will be first read without
+		 * the locks that would guarantee the lack of false positive, as such
+		 * events should be quite rare.
+		 */
+Retry:
+		check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						 &checkit, &found_in_sb);
+
+		if (!checkit)
+			continue;
+
+		if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
+			continue;
+
+		/*
+		 * If we get a failure and the buffer wasn't found in shared buffers,
+		 * reread the buffer with suitable lock to avoid false positive.  See
+		 * check_get_buffer for more details.
+		 */
+		if (!found_in_sb && !force_lock)
+		{
+			force_lock = true;
+			goto Retry;
+		}
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == SRF_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+				 bool *found_in_sb)
+{
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*checkit = true;
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* Dirty pages are ignored as they'll be flushed soon. */
+		if (buf_state & BM_DIRTY)
+			*checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (*checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/*
+			 * Add a page miss cost, as we're always reading outside the
+			 * shared buffers.
+			 */
+			ChecksumCostBalance += ChecksumCostPageMiss;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/*
+		 * Add a page miss cost, as we're always reading outside the shared
+		 * buffers.
+		 */
+		ChecksumCostBalance += ChecksumCostPageMiss;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return;
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/*
+	 * Add a page miss cost, as we're always reading outside the shared
+	 * buffers.
+	 */
+	ChecksumCostBalance += ChecksumCostPageMiss;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..b53b4c0bbf
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/checksum.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("relation cannot be null")));
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	/* Set cost-based vacuum delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPageHit = 1;	/* GUC parameters for checksum check */
+int			ChecksumCostPageMiss = 10;
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..1dc10c5462 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -693,6 +693,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2376,6 +2378,36 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for a page found in the buffer cache."),
+			NULL
+		},
+		&ChecksumCostPageHit,
+		1, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"vacuum_cost_page_miss", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for a page not found in the buffer cache."),
+			NULL
+		},
+		&ChecksumCostPageMiss,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"vacuum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3498,6 +3530,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..73fa7cdf51 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10741,6 +10741,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..f2fc143872 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,14 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPageHit;
+extern int	ChecksumCostPageMiss;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..29ca52d8c9 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,15 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/tuplestore.h"
 
+extern void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+							   Oid relid, ForkNumber single_forknum);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..af67252ea1
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid)"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#16

Michael Paquier

michael@paquier.xyz

almost 6 years ago

In reply to: Julien Rouhaud (#12)

Re: Online checksums verification in the backend

On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:

With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads. Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches? For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page. Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.

Having false reports in this area could be very confusing for the
user. That's for example possible now with checksum verification and
base backups.

I guess that this leads to the fact that this function may be better as
a contrib module, with the addition of some better-suited APIs in core
(see paragraph above).

Below?

Above. This thought more precisely:

For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

For the record when I first tested that feature I did try to check dirty
blocks, and it seemed that dirty blocks of shared relation were sometimes
wrongly reported as corrupted. I didn't try to investigate more though.

Hmm. It would be good to look at that, correct verification of shared
relations matter.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Not sure that the indentation is going to react well on that part of
the patch, perhaps it would be better to add some "/*-------" at the
beginning and end of the comment block to tell pgindent to ignore this
part?

Ok. Although I think only the beginning comment is needed?

From src/tools/pgindent/README:
"pgindent will reflow any comment block that's not at the left margin.
If this messes up manual formatting that ought to be preserved,
protect the comment block with some dashes:"

/*----------
* Text here will not be touched by pgindent.
*----------
*/

Based on the feedback gathered on this thread, I guess that you should
have a SRF returning the list of broken blocks, as well as NOTICE
messages.

The current patch has an SRF and a WARNING message, do you want an additional
NOTICE message or downgrade the existing one?

Right, not sure which one is better, for zero_damaged_pages a WARNING
is used.
--
Michael

#17

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Michael Paquier (#16)

Re: Online checksums verification in the backend

On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:

On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:

With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads. Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches? For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page. Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.

Having false reports in this area could be very confusing for the
user. That's for example possible now with checksum verification and
base backups.

I agree, however this shouldn't be the case here, as the block will be
rechecked while holding proper lock the 2nd time in case of possible false
positive before being reported as corrupted. So the only downside is to check
twice a corrupted block that's not found in shared buffers (or concurrently
loaded/modified/half flushed). As the number of corrupted or concurrently
loaded/modified/half flushed blocks should usually be close to zero, it seems
worthwhile to have a lockless check first for performance reason.

For the record when I first tested that feature I did try to check dirty
blocks, and it seemed that dirty blocks of shared relation were sometimes
wrongly reported as corrupted. I didn't try to investigate more though.

Hmm. It would be good to look at that, correct verification of shared
relations matter.

I'll try to investigate, but non-dirty shared relation blocks can be checked
and work as intended.

+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
Not sure that the indentation is going to react well on that part of
the patch, perhaps it would be better to add some "/*-------" at the
beginning and end of the comment block to tell pgindent to ignore this
part?
Ok. Although I think only the beginning comment is needed?
From src/tools/pgindent/README:
"pgindent will reflow any comment block that's not at the left margin.
If this messes up manual formatting that ought to be preserved,
protect the comment block with some dashes:"

/*----------
* Text here will not be touched by pgindent.
*----------
*/

For instance the block comment in gen_partprune_steps_internal() disagrees.
Anyway I added both as all the nearby codes does that for overall function
comments.

#18

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Michael Paquier (#16)

Re: Online checksums verification in the backend

On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:

On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:

Based on the feedback gathered on this thread, I guess that you should
have a SRF returning the list of broken blocks, as well as NOTICE
messages.

The current patch has an SRF and a WARNING message, do you want an additional
NOTICE message or downgrade the existing one?

Right, not sure which one is better, for zero_damaged_pages a WARNING
is used.

Sorry forgot to answer that. IMHO a WARNING is better here, as we're talking
about data corruption. Also, a WARNING will be reported to both the client and
server logs, which sounds like a good thing.

#19

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Julien Rouhaud (#17)

1 attachment(s)

Re: Online checksums verification in the backend

On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:

On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:

On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:

With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads. Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches? For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page. Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.

Having false reports in this area could be very confusing for the
user. That's for example possible now with checksum verification and
base backups.

I agree, however this shouldn't be the case here, as the block will be
rechecked while holding proper lock the 2nd time in case of possible false
positive before being reported as corrupted. So the only downside is to check
twice a corrupted block that's not found in shared buffers (or concurrently
loaded/modified/half flushed). As the number of corrupted or concurrently
loaded/modified/half flushed blocks should usually be close to zero, it seems
worthwhile to have a lockless check first for performance reason.

I just noticed some dumb mistakes while adding the new GUCs. v5 attached to
fix that, no other changes.

Attachments:

v5-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From 6bea507c5dbd7bff862c75b93fc023f0f33aba99 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v5] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/func.sgml                        |  44 ++
 src/backend/catalog/system_views.sql          |   7 +
 src/backend/storage/page/checksum.c           | 427 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         |  96 ++++
 src/backend/utils/init/globals.c              |   8 +
 src/backend/utils/misc/guc.c                  |  43 ++
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/miscadmin.h                       |   8 +
 src/include/storage/checksum.h                |   7 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 +
 src/test/check_relation/README                |  23 +
 .../check_relation/t/01_checksums_check.pl    | 276 +++++++++++
 16 files changed, 973 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index fc4d7f0f78..4baa2bb5e9 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21819,6 +21819,50 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>, <parameter>fork</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of a given relation, and
+       optionally the given fork.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b8a3f46912..e266292b03 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1386,6 +1386,13 @@ LANGUAGE INTERNAL
 STRICT STABLE PARALLEL SAFE
 AS 'jsonb_path_query_first_tz';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..aec5e2ad2d 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,429 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/pg_authid_d.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/checksum.h"
+#include "storage/lockdefs.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void check_delay_point(void);
+static void check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *checkit, bool *found_in_sb);
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for non-new
+ * pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	PageHeader	hdr = (PageHeader) buffer;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(hdr))
+	{
+		if (PageIsVerified(buffer, blkno))
+		{
+			/*
+			 * If the page is really new, there won't by any checksum to be
+			 * computed or expected.
+			 */
+			*chk_expected = *chk_found = NoComputedChecksum;
+			return true;
+		}
+		else
+		{
+			/*
+			 * There's a corruption, but since this affect PageIsNew, we
+			 * can't compute a checksum, so set NoComputedChecksum for the
+			 * expected checksum.
+			 */
+			*chk_expected = NoComputedChecksum;
+			*chk_found = hdr->pd_checksum;
+		}
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	/* Check if the relation (still) exists */
+	if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+	{
+		/*
+		 * We use a standard relation_open() to acquire the initial lock.  It
+		 * means that this will block until the lock is acquired, or will
+		 * raise an ERROR if lock_timeout has been set.  If caller wants to
+		 * check multiple tables while relying on a maximum wait time, it
+		 * should process tables one by one instead of relying on a global
+		 * processing with the main SRF.
+		 */
+		relation = relation_open(relid, AccessShareLock);
+	}
+
+	if (!RelationIsValid(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+				 errmsg("relation with OID %u does not exist", relid)));
+
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("Relation \"%s\" does not have storage to be checked.",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	char		buffer[BLCKSZ];
+	bool		checkit,
+				found_in_sb;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define SRF_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		bool		force_lock = false;
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[SRF_COLS];
+		bool		nulls[SRF_COLS];
+		int			i = 0;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		/*
+		 * To avoid too much overhead, the buffer will be first read without
+		 * the locks that would guarantee the lack of false positive, as such
+		 * events should be quite rare.
+		 */
+Retry:
+		check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						 &checkit, &found_in_sb);
+
+		if (!checkit)
+			continue;
+
+		if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
+			continue;
+
+		/*
+		 * If we get a failure and the buffer wasn't found in shared buffers,
+		 * reread the buffer with suitable lock to avoid false positive.  See
+		 * check_get_buffer for more details.
+		 */
+		if (!found_in_sb && !force_lock)
+		{
+			force_lock = true;
+			goto Retry;
+		}
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == SRF_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+				 bool *found_in_sb)
+{
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*checkit = true;
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* Dirty pages are ignored as they'll be flushed soon. */
+		if (buf_state & BM_DIRTY)
+			*checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (*checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/*
+			 * Add a page miss cost, as we're always reading outside the
+			 * shared buffers.
+			 */
+			ChecksumCostBalance += ChecksumCostPageMiss;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/*
+		 * Add a page miss cost, as we're always reading outside the shared
+		 * buffers.
+		 */
+		ChecksumCostBalance += ChecksumCostPageMiss;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return;
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/*
+	 * Add a page miss cost, as we're always reading outside the shared
+	 * buffers.
+	 */
+	ChecksumCostBalance += ChecksumCostPageMiss;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..b53b4c0bbf
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/checksum.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("relation cannot be null")));
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	/* Set cost-based vacuum delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPageHit = 1;	/* GUC parameters for checksum check */
+int			ChecksumCostPageMiss = 10;
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 68082315ac..9749e3048b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -693,6 +693,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2376,6 +2378,36 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for a page found in the buffer cache."),
+			NULL
+		},
+		&ChecksumCostPageHit,
+		1, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_page_miss", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for a page not found in the buffer cache."),
+			NULL
+		},
+		&ChecksumCostPageMiss,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3498,6 +3530,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7fb574f9dc..73fa7cdf51 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10741,6 +10741,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..f2fc143872 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,14 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPageHit;
+extern int	ChecksumCostPageMiss;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..29ca52d8c9 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,15 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/tuplestore.h"
 
+extern void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+							   Oid relid, ForkNumber single_forknum);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..af67252ea1
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid)"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#20

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Julien Rouhaud (#19)

Re: Online checksums verification in the backend

On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote:

On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote:

On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote:

On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote:

With a large amount of
shared buffer eviction you actually increase the risk of torn page
reads. Instead of a logic relying on partition mapping locks, which
could be unwise on performance grounds, did you consider different
approaches? For example a kind of pre-emptive lock on the page in
storage to prevent any shared buffer operation to happen while the
block is read from storage, that would act like a barrier.

Even with a workload having a large shared_buffers eviction pattern, I don't
think that there's a high probability of hitting a torn page. Unless I'm
mistaken it can only happen if all those steps happen concurrently to doing the
block read just after releasing the LWLock:

- postgres read the same block in shared_buffers (including all the locking)
- dirties it
- writes part of the page

It's certainly possible, but it seems so unlikely that the optimistic lock-less
approach seems like a very good tradeoff.

Having false reports in this area could be very confusing for the
user. That's for example possible now with checksum verification and
base backups.

I agree, however this shouldn't be the case here, as the block will be
rechecked while holding proper lock the 2nd time in case of possible false
positive before being reported as corrupted. So the only downside is to check
twice a corrupted block that's not found in shared buffers (or concurrently
loaded/modified/half flushed). As the number of corrupted or concurrently
loaded/modified/half flushed blocks should usually be close to zero, it seems
worthwhile to have a lockless check first for performance reason.

I just noticed some dumb mistakes while adding the new GUCs. v5 attached to
fix that, no other changes.

Thank you for updating the patch. I have some comments:

1.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>, <parameter>fork</parameter>
<type>text</type>)</function></literal>
+       </entry>

Looking at the declaration of pg_check_relation, 'relation' and 'fork'
are optional arguments. So I think the above is not correct. But as I
commented below, 'relation' should not be optional, so maybe the above
line could be:

+ <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>

2.
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>

* I think we need a description about possible values for 'fork'
(i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork'
is omitted.

* Do we need to explain about checksum cost-based delay here?

3.
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;

Now that pg_check_relation doesn't accept NULL as 'relation', I think
we need to make 'relation' a mandatory argument.

4.
+   /* Check if the relation (still) exists */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       /*
+        * We use a standard relation_open() to acquire the initial lock.  It
+        * means that this will block until the lock is acquired, or will
+        * raise an ERROR if lock_timeout has been set.  If caller wants to
+        * check multiple tables while relying on a maximum wait time, it
+        * should process tables one by one instead of relying on a global
+        * processing with the main SRF.
+        */
+       relation = relation_open(relid, AccessShareLock);
+   }

IIUC the above was necessary because we used to have
check_all_relations() which iterates all relations on the database to
do checksum checks. But now that we don't have that function and
pg_check_relation processes one relation. Can we just do
relation_open() here?

5.
I think we need to check if the relation is a temp relation. I'm not
sure it's worth to check checksums of temp relations but at least we
need not to check other's temp relations.

6.
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ */

I think the above comment also needs some "/*-------" at the beginning and end.

7.
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+                BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+                bool *found_in_sb)
+{

Maybe we can make check_get_buffer() return a bool indicating we found
a buffer to check, instead of having '*checkit'. That way, we can
simplify the following code:

+       check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                        &checkit, &found_in_sb);
+
+       if (!checkit)
+           continue;

to something like:

+ if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                          &found_in_sb))
+     continue;

8.
+       if (PageIsVerified(buffer, blkno))
+       {
+           /*
+            * If the page is really new, there won't by any checksum to be
+            * computed or expected.
+            */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }
+       else
+       {
+           /*
+            * There's a corruption, but since this affect PageIsNew, we
+            * can't compute a checksum, so set NoComputedChecksum for the
+            * expected checksum.
+            */
+           *chk_expected = NoComputedChecksum;
+           *chk_found = hdr->pd_checksum;
+       }
+       return false;

* I think the 'else' is not necessary here.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

8.
+   {
+       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+           gettext_noop("Checksum cost for a page found in the buffer cache."),
+           NULL
+       },
+       &ChecksumCostPageHit,
+       1, 0, 10000,
+       NULL, NULL, NULL
+   },

* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.

* The patch doesn't use checksum_cost_page_hit at all.

9.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int            max_worker_processes = 8;
 int            max_parallel_workers = 8;
 int            MaxBackends = 0;

+int            ChecksumCostPageHit = 1;    /* GUC parameters for
checksum check */
+int            ChecksumCostPageMiss = 10;
+int            ChecksumCostLimit = 200;
+double     ChecksumCostDelay = 0;
+
+int            ChecksumCostBalance = 0;    /* working state for
checksums check */
+bool       ChecksumCostActive = false;

Can we declare them in checksum.c since these parameters are used only
in checksum.c and it does I/O my itself.

10.
+       /* Report the failure to the stat collector and the logs. */
+       pgstat_report_checksum_failure();
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think we could do pgstat_report_checksum_failure() and emit WARNING
twice for the same page since PageIsVerified() also does the same.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#20)

1 attachment(s)

Re: Online checksums verification in the backend

On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote:

On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote:

v5 attached

Thank you for updating the patch. I have some comments:

Thanks a lot for the review!

1.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>, <parameter>fork</parameter>
<type>text</type>)</function></literal>
+       </entry>
Looking at the declaration of pg_check_relation, 'relation' and 'fork'
are optional arguments. So I think the above is not correct. But as I
commented below, 'relation' should not be optional, so maybe the above
line could be:

+ <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>

Yes I missed that when making relation mandatory. Fixed.

2.
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para>
+    <function>pg_check_relation</function> iterates over all the blocks of all
+    or the specified fork of a given relation and verify their checksum.  It
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>

* I think we need a description about possible values for 'fork'
(i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork'
is omitted.

Done.

* Do we need to explain about checksum cost-based delay here?

It's probably better in config.sgml, nearby vacuum cost-based delay, so done
this way with a link to reference that part.

3.
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;

Now that pg_check_relation doesn't accept NULL as 'relation', I think
we need to make 'relation' a mandatory argument.

Correct, fixed.

4.
+   /* Check if the relation (still) exists */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       /*
+        * We use a standard relation_open() to acquire the initial lock.  It
+        * means that this will block until the lock is acquired, or will
+        * raise an ERROR if lock_timeout has been set.  If caller wants to
+        * check multiple tables while relying on a maximum wait time, it
+        * should process tables one by one instead of relying on a global
+        * processing with the main SRF.
+        */
+       relation = relation_open(relid, AccessShareLock);
+   }

Ah yes I missed that comment. I think only the comment needed to be updated to
remove any part related to NULL-relation call. I ended up removign the whole
comment since locking and lock_timeout behavior is inherent to relation_open
and there's no need to document that any further now that we always only check
one relation at a time.

5.
I think we need to check if the relation is a temp relation. I'm not
sure it's worth to check checksums of temp relations but at least we
need not to check other's temp relations.

Good point. I think it's still worthwhile to check the backend's temp
relation, although typical usage should be a bgworker/cron job doing that check
so there shouldn't be any.

6.
+/*
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
[...] 
+ */

I think the above comment also needs some "/*-------" at the beginning and end.

Fixed.

7.
+static void
+check_get_buffer(Relation relation, ForkNumber forknum,
+                BlockNumber blkno, char *buffer, bool needlock, bool *checkit,
+                bool *found_in_sb)
+{

Maybe we can make check_get_buffer() return a bool indicating we found
a buffer to check, instead of having '*checkit'. That way, we can
simplify the following code:

+       check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                        &checkit, &found_in_sb);
+
+       if (!checkit)
+           continue;

to something like:

+ if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+                          &found_in_sb))
+     continue;

Changed.

8.
+       if (PageIsVerified(buffer, blkno))
+       {
+           /*
+            * If the page is really new, there won't by any checksum to be
+            * computed or expected.
+            */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }
+       else
+       {
+           /*
+            * There's a corruption, but since this affect PageIsNew, we
+            * can't compute a checksum, so set NoComputedChecksum for the
+            * expected checksum.
+            */
+           *chk_expected = NoComputedChecksum;
+           *chk_found = hdr->pd_checksum;
+       }
+       return false;

* I think the 'else' is not necessary here.

AFAICT it's, see below.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

Indeed, fixed.

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

PageIsVerified has a different behavior depending on whether the page looks new
or not. If the page looks like new, it only checks that it's indeed a new
page, and otherwise try to verify the checksum.

Also, pg_check_page() has an assert to make sure that the page isn't (or don't
look like) new.

So it seems to me that the 'else' is required to properly detect a real or fake
PageIsNew, and try to compute checksums only when required.

8.
+   {
+       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+           gettext_noop("Checksum cost for a page found in the buffer cache."),
+           NULL
+       },
+       &ChecksumCostPageHit,
+       1, 0, 10000,
+       NULL, NULL, NULL
+   },
* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.

Fixed both.

* The patch doesn't use checksum_cost_page_hit at all.

Indeed, I also realized that while working on previous issues. I removed it
and renamed checksum_cost_page_miss to checksum_cost_page.

9.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int            max_worker_processes = 8;
int            max_parallel_workers = 8;
int            MaxBackends = 0;

+int            ChecksumCostPageHit = 1;    /* GUC parameters for
checksum check */
+int            ChecksumCostPageMiss = 10;
+int            ChecksumCostLimit = 200;
+double     ChecksumCostDelay = 0;
+
+int            ChecksumCostBalance = 0;    /* working state for
checksums check */
+bool       ChecksumCostActive = false;

Can we declare them in checksum.c since these parameters are used only
in checksum.c and it does I/O my itself.

The GUC parameters would still need to be global, so for consistency I kept all
the variables in globals.c.

10.
+       /* Report the failure to the stat collector and the logs. */
+       pgstat_report_checksum_failure();
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think we could do pgstat_report_checksum_failure() and emit WARNING
twice for the same page since PageIsVerified() also does the same.

As mentioned before, in this patch I only calls PageIsVerified() if the buffer
looks like new, and in this case PageIsVerified() only verify that it's a true
all-zero-page, and won't try to verify the checksum, so there's no possibility
of duplicated report. I modified the comments to document all the interactions
and expectations.

v6 attached.

Attachments:

v6-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From b7787a7d1e6ec6f8211efccdefc7de0340b4a472 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v6] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 ++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/catalog/system_views.sql          |   7 +
 src/backend/storage/page/checksum.c           | 424 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         |  96 ++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/miscadmin.h                       |   7 +
 src/include/storage/checksum.h                |   7 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 +
 src/test/check_relation/README                |  23 +
 .../check_relation/t/01_checksums_check.pl    | 276 ++++++++++++
 18 files changed, 1056 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 2de21903a1..377ca700b0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2043,6 +2043,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7a0bb0c70a..1187ea6b98 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21849,6 +21849,57 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>[, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of a given relation, and
+       optionally the given fork.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all the blocks of a
+    given relation and verify their checksum.  If provided,
+    <replaceable>fork</replaceable> should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork, and only this
+    specific fork will be verifies, otherwise all forks will.  The function
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 83d00c6cde..071aa5aa2b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1387,6 +1387,13 @@ LANGUAGE INTERNAL
 STRICT STABLE PARALLEL SAFE
 AS 'jsonb_path_query_first_tz';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..09d453afa6 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,426 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/pg_authid_d.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/checksum.h"
+#include "storage/lockdefs.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for non-new
+ * pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	PageHeader	hdr = (PageHeader) buffer;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(hdr))
+	{
+		/*
+		 * Check if the page is really new or if there's a corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(buffer, blkno))
+		{
+			/* No corruption. */
+			*chk_expected = *chk_found = NoComputedChecksum;
+			return true;
+		}
+		else
+		{
+			/*
+			 * There's a corruption, but since this affect PageIsNew, we
+			 * can't compute a checksum, so set NoComputedChecksum for the
+			 * expected checksum.
+			 */
+			*chk_expected = NoComputedChecksum;
+			*chk_found = hdr->pd_checksum;
+		}
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	/* Open the relation if it exists.  */
+	if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+	{
+		relation = relation_open(relid, AccessShareLock);
+	}
+
+	/* sanity checks */
+	if (!RelationIsValid(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_TABLE),
+				 errmsg("could not open relation with OID %u", relid)));
+
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	char		buffer[BLCKSZ];
+	bool		found_in_sb;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define SRF_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		bool		force_lock = false;
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[SRF_COLS];
+		bool		nulls[SRF_COLS];
+		int			i = 0;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		/*
+		 * To avoid too much overhead, the buffer will be first read without
+		 * the locks that would guarantee the lack of false positive, as such
+		 * events should be quite rare.
+		 */
+Retry:
+		if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+							  &found_in_sb))
+			continue;
+
+		if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
+			continue;
+
+		/*
+		 * If we get a failure and the buffer wasn't found in shared buffers,
+		 * reread the buffer with suitable lock to avoid false positive.  See
+		 * check_get_buffer for more details.
+		 */
+		if (!found_in_sb && !force_lock)
+		{
+			force_lock = true;
+			goto Retry;
+		}
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == SRF_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* Dirty pages are ignored as they'll be flushed soon. */
+		if (buf_state & BM_DIRTY)
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/*
+		 * Add a page miss cost, as we're always reading outside the shared
+		 * buffers.
+		 */
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..b53b4c0bbf
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/checksum.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("relation cannot be null")));
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	/* Set cost-based vacuum delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..8683343619 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 79bc7ac8ca..a4e1fb0f77 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -693,6 +693,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2396,6 +2398,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page found."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3527,6 +3549,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e9f8ca775d..36120787fd 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -156,6 +156,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index ac8ad8dbf0..7cd005b891 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10798,6 +10798,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..8d4605efc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,13 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..29ca52d8c9 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,15 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/tuplestore.h"
 
+extern void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+							   Oid relid, ForkNumber single_forknum);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..af67252ea1
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid)"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#22

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Julien Rouhaud (#21)

Re: Online checksums verification in the backend

On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote:

On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote:

v5 attached

Thank you for updating the patch. I have some comments:

Thanks a lot for the review!

Thank you for updating the patch!

4.
+   /* Check if the relation (still) exists */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       /*
+        * We use a standard relation_open() to acquire the initial lock.  It
+        * means that this will block until the lock is acquired, or will
+        * raise an ERROR if lock_timeout has been set.  If caller wants to
+        * check multiple tables while relying on a maximum wait time, it
+        * should process tables one by one instead of relying on a global
+        * processing with the main SRF.
+        */
+       relation = relation_open(relid, AccessShareLock);
+   }
IIUC the above was necessary because we used to have
check_all_relations() which iterates all relations on the database to
do checksum checks. But now that we don't have that function and
pg_check_relation processes one relation. Can we just do
relation_open() here?
Ah yes I missed that comment. I think only the comment needed to be updated to
remove any part related to NULL-relation call. I ended up removign the whole
comment since locking and lock_timeout behavior is inherent to relation_open
and there's no need to document that any further now that we always only check
one relation at a time.

The current patch still checks SearchSysCacheExists1() before
relation_open. Why do we need to call SearchSysCacheExists1() here? I
think if the given relation doesn't exist, relation_open() will raise
an error "could not open relation with OID %u".

+   /* Open the relation if it exists.  */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       relation = relation_open(relid, AccessShareLock);
+   }

8.
+       if (PageIsVerified(buffer, blkno))
+       {
+           /*
+            * If the page is really new, there won't by any checksum to be
+            * computed or expected.
+            */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }
+       else
+       {
+           /*
+            * There's a corruption, but since this affect PageIsNew, we
+            * can't compute a checksum, so set NoComputedChecksum for the
+            * expected checksum.
+            */
+           *chk_expected = NoComputedChecksum;
+           *chk_found = hdr->pd_checksum;
+       }
+       return false;

* I think the 'else' is not necessary here.

AFAICT it's, see below.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

Indeed, fixed.

The patch still sets values to both?

+       if (PageIsVerified(buffer, blkno))
+       {
+           /* No corruption. */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

PageIsVerified has a different behavior depending on whether the page looks new
or not. If the page looks like new, it only checks that it's indeed a new
page, and otherwise try to verify the checksum.

Also, pg_check_page() has an assert to make sure that the page isn't (or don't
look like) new.

So it seems to me that the 'else' is required to properly detect a real or fake
PageIsNew, and try to compute checksums only when required.

Thank you for your explanation! I understand.

I thought we can arrange the code to something like:

if (PageIsNew(hdr))
{
if (PageIsVerified(hdr))
{
*chk_expected = *chk_found = NoComputedChecksum;
return true;
}

*chk_expected = NoComputedChecksum;
*chk_found = hdr->pd_checksum;
return false;
}

But since it's not a critical problem you can ignore it.

8.
+   {
+       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+           gettext_noop("Checksum cost for a page found in the buffer cache."),
+           NULL
+       },
+       &ChecksumCostPageHit,
+       1, 0, 10000,
+       NULL, NULL, NULL
+   },
* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.
Fixed both.

* The patch doesn't use checksum_cost_page_hit at all.

Indeed, I also realized that while working on previous issues. I removed it
and renamed checksum_cost_page_miss to checksum_cost_page.

Perhaps we can use checksum_cost_page_hit when we found the page in
the shared buffer but it's marked as dirty?

9.
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..37f63e747c 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,14 @@ int            max_worker_processes = 8;
int            max_parallel_workers = 8;
int            MaxBackends = 0;

+int            ChecksumCostPageHit = 1;    /* GUC parameters for
checksum check */
+int            ChecksumCostPageMiss = 10;
+int            ChecksumCostLimit = 200;
+double     ChecksumCostDelay = 0;
+
+int            ChecksumCostBalance = 0;    /* working state for
checksums check */
+bool       ChecksumCostActive = false;

Can we declare them in checksum.c since these parameters are used only
in checksum.c and it does I/O my itself.

The GUC parameters would still need to be global, so for consistency I kept all
the variables in globals.c.

Okay.

10.
+       /* Report the failure to the stat collector and the logs. */
+       pgstat_report_checksum_failure();
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));
I think we could do pgstat_report_checksum_failure() and emit WARNING
twice for the same page since PageIsVerified() also does the same.
As mentioned before, in this patch I only calls PageIsVerified() if the buffer
looks like new, and in this case PageIsVerified() only verify that it's a true
all-zero-page, and won't try to verify the checksum, so there's no possibility
of duplicated report. I modified the comments to document all the interactions
and expectations.

You're right. Thank you for the explanation!

I've read the latest patch and here is random comments:

1.
+       /*
+        * Add a page miss cost, as we're always reading outside the shared
+        * buffers.
+        */
+       /* Add a page cost. */
+       ChecksumCostBalance += ChecksumCostPage;

There are duplicate comments.

2.
+       /* Dirty pages are ignored as they'll be flushed soon. */
+       if (buf_state & BM_DIRTY)
+           checkit = false;

Should we check the buffer if it has BM_TAG_VALID as well here? I
thought there might be a possibility that BufTableLookup() returns a
buf_Id but its buffer tag is not valid for example when the previous
read failed after inserting the buffer tag to the buffer table.

3.
+   /* Add a page cost. */
+   ChecksumCostBalance += ChecksumCostPage;
+
+   return checkit;
+}

The check_get_buffer() seems to be slightly complex to me but when we
reached the end of this function we always return true. Similarly, in
the case where we read the block while holding a partition lock we
always return true as well. Is my understanding right? If so, it might
be better to put some assertions.

4.
@@ -10825,6 +10825,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },

Why is the pg_check_relation() is not a strict function? I think
prostrict can be 'true' for this function and we can drop checking if
the first argument is NULL.

5.
+       memset(values, 0, sizeof(values));
+       memset(nulls, 0, sizeof(nulls));

I think we can do memset right before setting values to them, that is,
after checking (!found_in_sb && !force_lock).

6.
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+                  uint16 *chk_found)
+{
+   PageHeader  hdr = (PageHeader) buffer;
+
+   Assert(chk_expected && chk_found);
+
+   if (PageIsNew(hdr))
+   {
+       /*
+        * Check if the page is really new or if there's a corruption that
+        * affected PageIsNew detection.  Note that PageIsVerified won't try to
+        * detect checksum corruption in this case, so there's no risk of
+        * duplicated corruption report.
+        */
+       if (PageIsVerified(buffer, blkno))

How about using Page instead of PageHeader? Looking at other codes,
ISTM we usually pass Page to both PageIsNew() and PageIsVerified().

7.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>.
+       </entry>

+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },

The function argument data types don't match in the doc and function
declaretion. relation is 'oid' in the doc but is 'regclass' in the
function declaretion.

8.
+#define SRF_COLS 5 /* Number of output arguments in the SRF */

Looking at similar built-in functions that return set of records they
use a more specific name for the number of returned columns such as
PG_STAT_GET_WAL_SENDERS_COLS and PG_GET_SHMEM_SIZES_COLS. How about
PG_CHECK_RELATION_COLS?

check_relation_fork() seems to quite depends on pg_check_relation()
because the returned tuplestore is specified by pg_check_relation().
It's just an idea but to improve reusability, how about moving
check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
while iterating all blocks we call a new function in checksum.c, say
check_one_block() function, which has the following part and is
responsible for getting, checking the specified block and returning a
boolean indicating whether the block has corruption or not, along with
chk_found and chk_expected:

/*
* To avoid too much overhead, the buffer will be first read without
* the locks that would guarantee the lack of false positive, as such
* events should be quite rare.
*/
Retry:
if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
&found_in_sb))
continue;

if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
continue;

/*
* If we get a failure and the buffer wasn't found in shared buffers,
* reread the buffer with suitable lock to avoid false positive. See
* check_get_buffer for more details.
*/
if (!found_in_sb && !force_lock)
{
force_lock = true;
goto Retry;
}

A new function in checksumfuncs.c or pg_check_relation will be
responsible for storing the result to the tuplestore. That way,
check_one_block() will be useful for other use when we want to check
if the particular block has corruption with low overhead.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#23

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#22)

1 attachment(s)

Re: Online checksums verification in the backend

On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote:

On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud <rjuju123@gmail.com> wrote:

The current patch still checks SearchSysCacheExists1() before
relation_open. Why do we need to call SearchSysCacheExists1() here? I
think if the given relation doesn't exist, relation_open() will raise
an error "could not open relation with OID %u".
+   /* Open the relation if it exists.  */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+       relation = relation_open(relid, AccessShareLock);
+   }

Oops yes sorry about that. Fixed.

8.
+       if (PageIsVerified(buffer, blkno))
+       {
+           /*
+            * If the page is really new, there won't by any checksum to be
+            * computed or expected.
+            */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }
+       else
+       {
+           /*
+            * There's a corruption, but since this affect PageIsNew, we
+            * can't compute a checksum, so set NoComputedChecksum for the
+            * expected checksum.
+            */
+           *chk_expected = NoComputedChecksum;
+           *chk_found = hdr->pd_checksum;
+       }
+       return false;

* I think the 'else' is not necessary here.

AFAICT it's, see below.

* Setting *chk_expected and *chk_found seems useless when we return
true. The caller doesn't use them.

Indeed, fixed.

The patch still sets values to both?

+       if (PageIsVerified(buffer, blkno))
+       {
+           /* No corruption. */
+           *chk_expected = *chk_found = NoComputedChecksum;
+           return true;
+       }

Sorry again, fixed.

* Should we forcibly overwrite ignore_checksum_failure to off in
pg_check_relation()? Otherwise, this logic seems not work fine.

* I don't understand why we cannot compute a checksum in case where a
page looks like a new page but is actually corrupted. Could you please
elaborate on that?

PageIsVerified has a different behavior depending on whether the page looks new
or not. If the page looks like new, it only checks that it's indeed a new
page, and otherwise try to verify the checksum.

Also, pg_check_page() has an assert to make sure that the page isn't (or don't
look like) new.

So it seems to me that the 'else' is required to properly detect a real or fake
PageIsNew, and try to compute checksums only when required.

Thank you for your explanation! I understand.

I thought we can arrange the code to something like:

if (PageIsNew(hdr))
{
if (PageIsVerified(hdr))
{
*chk_expected = *chk_found = NoComputedChecksum;
return true;
}

*chk_expected = NoComputedChecksum;
*chk_found = hdr->pd_checksum;
return false;
}

But since it's not a critical problem you can ignore it.

I like it, so done!

8.
+   {
+       {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+           gettext_noop("Checksum cost for a page found in the buffer cache."),
+           NULL
+       },
+       &ChecksumCostPageHit,
+       1, 0, 10000,
+       NULL, NULL, NULL
+   },
* There is no description about the newly added four GUC parameters in the doc.

* We need to put new GUC parameters into postgresql.conf.sample as well.
Fixed both.

* The patch doesn't use checksum_cost_page_hit at all.

Indeed, I also realized that while working on previous issues. I removed it
and renamed checksum_cost_page_miss to checksum_cost_page.
Perhaps we can use checksum_cost_page_hit when we found the page in
the shared buffer but it's marked as dirty?

The thing is that when the buffer is dirty, we won't do any additional check,
thus not adding any overhead. What may be needed here is to account for the
locking overhead (in all cases), so that if all (or almost all) the buffers are
dirty and in shared buffers the execution can be throttled. I don't know how
much an issue it can be, but if that's something to be fixes then page_hit
doesn't look like the right answer for that.

I've read the latest patch and here is random comments:

1.
+       /*
+        * Add a page miss cost, as we're always reading outside the shared
+        * buffers.
+        */
+       /* Add a page cost. */
+       ChecksumCostBalance += ChecksumCostPage;

There are duplicate comments.

Fixed.

2.
+       /* Dirty pages are ignored as they'll be flushed soon. */
+       if (buf_state & BM_DIRTY)
+           checkit = false;
Should we check the buffer if it has BM_TAG_VALID as well here? I
thought there might be a possibility that BufTableLookup() returns a
buf_Id but its buffer tag is not valid for example when the previous
read failed after inserting the buffer tag to the buffer table.

Good point, fixed.

3.
+   /* Add a page cost. */
+   ChecksumCostBalance += ChecksumCostPage;
+
+   return checkit;
+}
The check_get_buffer() seems to be slightly complex to me but when we
reached the end of this function we always return true. Similarly, in
the case where we read the block while holding a partition lock we
always return true as well. Is my understanding right? If so, it might
be better to put some assertions.

Yes it's a little bit complex. I used this approach to avoid the need to
release the locks all over the place, but maybe this doesn't really improve
things. I added asserts and comments anyway as suggested, thanks.

4.
@@ -10825,6 +10825,14 @@
proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
proargnames => '{tablespace,name,size,modification}',
prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },

Why is the pg_check_relation() is not a strict function? I think
prostrict can be 'true' for this function and we can drop checking if
the first argument is NULL.

That's because the fork is still optional. While this could be made mandatory
without much problems, I think we'll eventually want to add a way to check only
a subset of a fork, so it seemed to me that is wasn't worth changing that now.

5.
+       memset(values, 0, sizeof(values));
+       memset(nulls, 0, sizeof(nulls));
I think we can do memset right before setting values to them, that is,
after checking (!found_in_sb && !force_lock).

Indeed, done!

6.
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+                  uint16 *chk_found)
+{
+   PageHeader  hdr = (PageHeader) buffer;
+
+   Assert(chk_expected && chk_found);
+
+   if (PageIsNew(hdr))
+   {
+       /*
+        * Check if the page is really new or if there's a corruption that
+        * affected PageIsNew detection.  Note that PageIsVerified won't try to
+        * detect checksum corruption in this case, so there's no risk of
+        * duplicated corruption report.
+        */
+       if (PageIsVerified(buffer, blkno))

How about using Page instead of PageHeader? Looking at other codes,
ISTM we usually pass Page to both PageIsNew() and PageIsVerified().

Agreed, done.

7.
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter>
<type>oid</type>[, <parameter>fork</parameter>
<type>text</type>])</function></literal>.
+       </entry>

+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames =>
'{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },

The function argument data types don't match in the doc and function
declaretion. relation is 'oid' in the doc but is 'regclass' in the
function declaretion.

Fixed.

8.
+#define SRF_COLS 5 /* Number of output arguments in the SRF */

Looking at similar built-in functions that return set of records they
use a more specific name for the number of returned columns such as
PG_STAT_GET_WAL_SENDERS_COLS and PG_GET_SHMEM_SIZES_COLS. How about
PG_CHECK_RELATION_COLS?

check_relation_fork() seems to quite depends on pg_check_relation()
because the returned tuplestore is specified by pg_check_relation().
It's just an idea but to improve reusability, how about moving
check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
while iterating all blocks we call a new function in checksum.c, say
check_one_block() function, which has the following part and is
responsible for getting, checking the specified block and returning a
boolean indicating whether the block has corruption or not, along with
chk_found and chk_expected:

/*
* To avoid too much overhead, the buffer will be first read without
* the locks that would guarantee the lack of false positive, as such
* events should be quite rare.
*/
Retry:
if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
&found_in_sb))
continue;

if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
continue;

/*
* If we get a failure and the buffer wasn't found in shared buffers,
* reread the buffer with suitable lock to avoid false positive. See
* check_get_buffer for more details.
*/
if (!found_in_sb && !force_lock)
{
force_lock = true;
goto Retry;
}

A new function in checksumfuncs.c or pg_check_relation will be
responsible for storing the result to the tuplestore. That way,
check_one_block() will be useful for other use when we want to check
if the particular block has corruption with low overhead.

Yes, I agree that passing the tuplestore isn't an ideal approach and some
refactoring should probably happen. One thing is that this wouldn't be
"check_one_block()" but "check_one_block_on_disk()" (which could also be from
the OS cache). I'm not sure how useful it's in itself. It also raises some
concerns about the throttling. I didn't change that for now, but I hope
there'll be some other feedback about it.

Attachments:

v6-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From ba7d80691ef6f34b0771fc72f38da7fc8a9c0a72 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v6] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 ++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/catalog/system_views.sql          |   7 +
 src/backend/storage/page/checksum.c           | 418 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         |  96 ++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/miscadmin.h                       |   7 +
 src/include/storage/checksum.h                |   7 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 +
 src/test/check_relation/README                |  23 +
 .../check_relation/t/01_checksums_check.pl    | 276 ++++++++++++
 18 files changed, 1050 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4d6ed4bbc..259793650f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2038,6 +2038,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d88b45e72..75bcf3484f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21896,6 +21896,57 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type>[, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of a given relation, and
+       optionally the given fork.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all the blocks of a
+    given relation and verify their checksum.  If provided,
+    <replaceable>fork</replaceable> should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork, and only this
+    specific fork will be verifies, otherwise all forks will.  The function
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..f289afda6d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1415,6 +1415,13 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..0529e8204a 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,420 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "catalog/pg_authid_d.h"
+#include "commands/dbcommands.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/checksum.h"
+#include "storage/lockdefs.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+#include "utils/syscache.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for non-new
+ * pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's a corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's a corruption, but since this affect PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	char		buffer[BLCKSZ];
+	bool		found_in_sb;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		bool		force_lock = false;
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/*
+		 * To avoid too much overhead, the buffer will be first read without
+		 * the locks that would guarantee the lack of false positive, as such
+		 * events should be quite rare.
+		 */
+Retry:
+		if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+							  &found_in_sb))
+			continue;
+
+		if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
+			continue;
+
+		/*
+		 * If we get a failure and the buffer wasn't found in shared buffers,
+		 * reread the buffer with suitable lock to avoid false positive.  See
+		 * check_get_buffer for more details.
+		 */
+		if (!found_in_sb && !force_lock)
+		{
+			force_lock = true;
+			goto Retry;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to check checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..b53b4c0bbf
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,96 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/checksum.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("relation cannot be null")));
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	/* Set cost-based vacuum delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..8683343619 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 64dc9fbd13..b3df5c6518 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -694,6 +694,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2388,6 +2390,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page found."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3541,6 +3563,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904fa7300..a112d010b3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -156,6 +156,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a649e44d08..e706df5f5f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10825,6 +10825,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..8d4605efc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,13 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..29ca52d8c9 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,15 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/tuplestore.h"
 
+extern void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+							   Oid relid, ForkNumber single_forknum);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..af67252ea1
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid)"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#24

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Julien Rouhaud (#23)

1 attachment(s)

Re: Online checksums verification in the backend

On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote:

On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote:

check_relation_fork() seems to quite depends on pg_check_relation()
because the returned tuplestore is specified by pg_check_relation().
It's just an idea but to improve reusability, how about moving
check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
while iterating all blocks we call a new function in checksum.c, say
check_one_block() function, which has the following part and is
responsible for getting, checking the specified block and returning a
boolean indicating whether the block has corruption or not, along with
chk_found and chk_expected:

/*
* To avoid too much overhead, the buffer will be first read without
* the locks that would guarantee the lack of false positive, as such
* events should be quite rare.
*/
Retry:
if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
&found_in_sb))
continue;

if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
continue;

/*
* If we get a failure and the buffer wasn't found in shared buffers,
* reread the buffer with suitable lock to avoid false positive. See
* check_get_buffer for more details.
*/
if (!found_in_sb && !force_lock)
{
force_lock = true;
goto Retry;
}

A new function in checksumfuncs.c or pg_check_relation will be
responsible for storing the result to the tuplestore. That way,
check_one_block() will be useful for other use when we want to check
if the particular block has corruption with low overhead.

Yes, I agree that passing the tuplestore isn't an ideal approach and some
refactoring should probably happen. One thing is that this wouldn't be
"check_one_block()" but "check_one_block_on_disk()" (which could also be from
the OS cache). I'm not sure how useful it's in itself. It also raises some
concerns about the throttling. I didn't change that for now, but I hope
there'll be some other feedback about it.

I had some time this morning, so I did the suggested refactoring as it seems
like a way cleaner interface. I also kept the suggested check_one_block().

Attachments:

v7-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From fc961d5fc8e220ddaa60306bd24300fe857a6d4b Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v7] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 +++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/catalog/system_views.sql          |   7 +
 src/backend/storage/page/checksum.c           | 318 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 218 ++++++++++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |   8 +
 src/include/miscadmin.h                       |   7 +
 src/include/storage/checksum.h                |  13 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 ++
 src/test/check_relation/README                |  23 ++
 .../check_relation/t/01_checksums_check.pl    | 276 +++++++++++++++
 18 files changed, 1078 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4d6ed4bbc..259793650f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2038,6 +2038,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d88b45e72..75bcf3484f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21896,6 +21896,57 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type>[, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of a given relation, and
+       optionally the given fork.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all the blocks of a
+    given relation and verify their checksum.  If provided,
+    <replaceable>fork</replaceable> should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork, and only this
+    specific fork will be verifies, otherwise all forks will.  The function
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..f289afda6d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1415,6 +1415,13 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text DEFAULT NULL::text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..b2ea43826f 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,320 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrred even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+	bool		force_lock = false;
+	bool		found_in_sb;
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	/*
+	 * To avoid too much overhead, the buffer will be first read without
+	 * the locks that would guarantee the lack of false positive, as such
+	 * events should be quite rare.
+	 */
+Retry:
+	if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						  &found_in_sb))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/*
+	 * If we get a failure and the buffer wasn't found in shared buffers,
+	 * reread the buffer with suitable lock to avoid false positive.  See
+	 * check_get_buffer for more details.
+	 */
+	if (!found_in_sb && !force_lock)
+	{
+		force_lock = true;
+		goto Retry;
+	}
+
+	/* A corruption is detected. */
+	return false;
+}
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's a corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's a corruption, but since this affect PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to check checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..e7f02908de
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,218 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum = InvalidForkNumber;
+	Oid			relid = InvalidOid;
+	const char *forkname;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_read_server_files role may use this function")));
+
+	if (PG_ARGISNULL(0))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("relation cannot be null")));
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	if (PG_NARGS() == 2 && !PG_ARGISNULL(1))
+	{
+		forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	/* Set cost-based vacuum delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 quote_qualified_identifier(
+					 get_namespace_name(get_rel_namespace(relid)),
+					 get_rel_name(relid)))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (check_one_block(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or no worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..8683343619 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 64dc9fbd13..b3df5c6518 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -694,6 +694,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2388,6 +2390,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page found."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3541,6 +3563,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e904fa7300..a112d010b3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -156,6 +156,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a649e44d08..e706df5f5f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10825,6 +10825,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for one or all relations',
+  proname => 'pg_check_relation', proisstrict => 'f', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..8d4605efc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,13 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..08aa0949a8 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,21 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
 
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+							BlockNumber blkno, uint16 *chk_expected,
+							uint16 *chk_found);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..af67252ea1
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid)"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#25

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Julien Rouhaud (#24)

Re: Online checksums verification in the backend

On Sat, 4 Apr 2020 at 18:04, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote:

On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote:

check_relation_fork() seems to quite depends on pg_check_relation()
because the returned tuplestore is specified by pg_check_relation().
It's just an idea but to improve reusability, how about moving
check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c
while iterating all blocks we call a new function in checksum.c, say
check_one_block() function, which has the following part and is
responsible for getting, checking the specified block and returning a
boolean indicating whether the block has corruption or not, along with
chk_found and chk_expected:

/*
* To avoid too much overhead, the buffer will be first read without
* the locks that would guarantee the lack of false positive, as such
* events should be quite rare.
*/
Retry:
if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
&found_in_sb))
continue;

if (check_buffer(buffer, blkno, &chk_expected, &chk_found))
continue;

/*
* If we get a failure and the buffer wasn't found in shared buffers,
* reread the buffer with suitable lock to avoid false positive. See
* check_get_buffer for more details.
*/
if (!found_in_sb && !force_lock)
{
force_lock = true;
goto Retry;
}

A new function in checksumfuncs.c or pg_check_relation will be
responsible for storing the result to the tuplestore. That way,
check_one_block() will be useful for other use when we want to check
if the particular block has corruption with low overhead.

Yes, I agree that passing the tuplestore isn't an ideal approach and some
refactoring should probably happen. One thing is that this wouldn't be
"check_one_block()" but "check_one_block_on_disk()" (which could also be from
the OS cache). I'm not sure how useful it's in itself. It also raises some
concerns about the throttling. I didn't change that for now, but I hope
there'll be some other feedback about it.

I had some time this morning, so I did the suggested refactoring as it seems
like a way cleaner interface. I also kept the suggested check_one_block().

Thank you for updating the patch! The patch looks good to me. Here are
some random comments mostly about cosmetic changes.

1.
I think we can have two separate SQL functions:
pg_check_relation(regclass, text) and pg_check_relation(regclass),
instead of setting NULL by default to the second argument.

2.
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrred even if a version exists in

s/smgrred/smgrread/

3.
+       /* The buffer will have to check checked. */
+       Assert(checkit);

Should it be "The buffer will have to be checked"?

4.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));

Looking at the definition of pg_stat_read_server_files role, this role
seems to be for operations that could read non-database files such as
csv files. Therefore, currently this role is used by file_fdw and COPY
command. I personally think pg_stat_scan_tables would be more
appropriate for this function but I'm not sure.

5.
+   /* Set cost-based vacuum delay */
+   ChecksumCostActive = (ChecksumCostDelay > 0);
+   ChecksumCostBalance = 0;

s/vacuum/checksum verification/

6.
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think it's better to show the relation name instead of the relation path here.

7.
+       ereport(ERROR,
+               (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+                errmsg("relation \"%s\" does not have storage to be checked",
+                quote_qualified_identifier(
+                    get_namespace_name(get_rel_namespace(relid)),
+                    get_rel_name(relid)))));

Looking at other similar error messages we don't show qualified
relation name but the relation name gotten by
RelationGetRelationName(relation). Can we do that here as well for
consistency?

8.
+   if (!(rsinfo->allowedModes & SFRM_Materialize))
+       ereport(ERROR,
+               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                errmsg("materialize mode required, but it is not " \
+                       "allowed in this context")));

I think it's better to have this error message in one line for easy grepping.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#26

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#25)

1 attachment(s)

Re: Online checksums verification in the backend

On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote:

Thank you for updating the patch! The patch looks good to me. Here are
some random comments mostly about cosmetic changes.

Thanks a lot for the review!

1.
I think we can have two separate SQL functions:
pg_check_relation(regclass, text) and pg_check_relation(regclass),
instead of setting NULL by default to the second argument.

I'm fine with it, so implemented this way with the required documentation
changes.

2.
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrred even if a version exists in

s/smgrred/smgrread/

Fixed.

3.
+       /* The buffer will have to check checked. */
+       Assert(checkit);
Should it be "The buffer will have to be checked"?

Oops indeed, fixed.

4.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));
Looking at the definition of pg_stat_read_server_files role, this role
seems to be for operations that could read non-database files such as
csv files. Therefore, currently this role is used by file_fdw and COPY
command. I personally think pg_stat_scan_tables would be more
appropriate for this function but I'm not sure.

That's a very good point, especially since the documentation of this default
role is quite relevant for those functions:

"Execute monitoring functions that may take ACCESS SHARE locks on tables,
potentially for a long time."

So changed!

5.
+   /* Set cost-based vacuum delay */
+   ChecksumCostActive = (ChecksumCostDelay > 0);
+   ChecksumCostBalance = 0;

s/vacuum/checksum verification/

Fixed.

6.
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));

I think it's better to show the relation name instead of the relation path here.

I'm here using the same pattern as what ReadBuffer_common() would display if a
corrupted block is read. I think it's better to keep the format for both, so
any existing log analyzer will keep working with those new functions.

7.
+       ereport(ERROR,
+               (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+                errmsg("relation \"%s\" does not have storage to be checked",
+                quote_qualified_identifier(
+                    get_namespace_name(get_rel_namespace(relid)),
+                    get_rel_name(relid)))));
Looking at other similar error messages we don't show qualified
relation name but the relation name gotten by
RelationGetRelationName(relation). Can we do that here as well for
consistency?

Indeed, fixed.

8.
+   if (!(rsinfo->allowedModes & SFRM_Materialize))
+       ereport(ERROR,
+               (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+                errmsg("materialize mode required, but it is not " \
+                       "allowed in this context")));

I think it's better to have this error message in one line for easy grepping.

Fixed.

I also fixed missing leading tab in the perl TAP tests

Attachments:

v8-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From a9634bf02c0e7e1f9b0e8cf4e1ad79a72ea80ec8 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v8] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 +++++
 doc/src/sgml/func.sgml                        |  60 ++++
 src/backend/catalog/system_views.sql          |  15 +
 src/backend/storage/page/checksum.c           | 318 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 217 ++++++++++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/include/miscadmin.h                       |   7 +
 src/include/storage/checksum.h                |  13 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 ++
 src/test/check_relation/README                |  23 ++
 .../check_relation/t/01_checksums_check.pl    | 276 +++++++++++++++
 18 files changed, 1102 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 114db38116..a20501bb85 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2038,6 +2038,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d88b45e72..71c7554ebe 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21896,6 +21896,66 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type>)</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of all forks of the given
+       relation.</entry>
+      </row>
+
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type>, <parameter>fork</parameter> <type>text</type>)</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of the given fork of the
+       given relation.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all the blocks of a
+    given relation and verify their checksum.  If provided,
+    <replaceable>fork</replaceable> should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork, and only this
+    specific fork will be verifies, otherwise all forks will.  The function
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 813ea8bfc3..8bba45e156 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1415,6 +1415,21 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
+  AS 'pg_check_relation_fork'
+  PARALLEL RESTRICTED;
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..eb2c919c34 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,320 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+	bool		force_lock = false;
+	bool		found_in_sb;
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	/*
+	 * To avoid too much overhead, the buffer will be first read without
+	 * the locks that would guarantee the lack of false positive, as such
+	 * events should be quite rare.
+	 */
+Retry:
+	if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						  &found_in_sb))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/*
+	 * If we get a failure and the buffer wasn't found in shared buffers,
+	 * reread the buffer with suitable lock to avoid false positive.  See
+	 * check_get_buffer for more details.
+	 */
+	if (!found_in_sb && !force_lock)
+	{
+		force_lock = true;
+		goto Retry;
+	}
+
+	/* A corruption is detected. */
+	return false;
+}
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's a corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's a corruption, but since this affect PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..d005b8d01f
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,217 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid,
+									   Oid forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+
+	pg_check_relation_internal(fcinfo, relid, InvalidForkNumber);
+
+	return (Datum) 0;
+}
+
+Datum
+pg_check_relation_fork(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+	const char *forkname;
+	ForkNumber	forknum;
+
+	forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+	forknum = forkname_to_number(forkname);
+
+	pg_check_relation_internal(fcinfo, relid, forknum);
+
+	return (Datum) 0;
+}
+
+/* Common code for all versions of pg_check_relation() */
+static void
+pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid, Oid forknum)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_stat_scan_tables role may use this function")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	/* Set cost-based checksum verification delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 RelationGetRelationName(relation))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (check_one_block(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or no worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..8683343619 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 477af5d552..b2af8083f6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -695,6 +695,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2389,6 +2391,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page found."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3553,6 +3575,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 91fa185053..acb5baefae 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -156,6 +156,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a649e44d08..6adc3b04c9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10825,6 +10825,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..8d4605efc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,13 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..08aa0949a8 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,21 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
 
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+							BlockNumber blkno, uint16 *chk_expected,
+							uint16 *chk_found);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..1ad34adcb9
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#27

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Julien Rouhaud (#26)

Re: Online checksums verification in the backend

On Sun, 5 Apr 2020 at 17:44, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote:

Thank you for updating the patch! The patch looks good to me. Here are
some random comments mostly about cosmetic changes.

Thanks a lot for the review!

Thank you for updating the patch.

1.
I think we can have two separate SQL functions:
pg_check_relation(regclass, text) and pg_check_relation(regclass),
instead of setting NULL by default to the second argument.

I'm fine with it, so implemented this way with the required documentation
changes.

Why do we need two rows in the doc? For instance, replication slot
functions have some optional arguments but there is only one row in
the doc. So I think we don't need to change the doc from the previous
version patch.

And I think these are not necessary as we already defined in
include/catalog/pg_proc.dat:

+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
+  AS 'pg_check_relation_fork'
+  PARALLEL RESTRICTED;

2.
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrred even if a version exists in
s/smgrred/smgrread/
Fixed.
3.
+       /* The buffer will have to check checked. */
+       Assert(checkit);
Should it be "The buffer will have to be checked"?
Oops indeed, fixed.
4.
+   if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+       ereport(ERROR,
+               (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+                errmsg("only superuser or a member of the
pg_read_server_files role may use this function")));
Looking at the definition of pg_stat_read_server_files role, this role
seems to be for operations that could read non-database files such as
csv files. Therefore, currently this role is used by file_fdw and COPY
command. I personally think pg_stat_scan_tables would be more
appropriate for this function but I'm not sure.
That's a very good point, especially since the documentation of this default
role is quite relevant for those functions:

"Execute monitoring functions that may take ACCESS SHARE locks on tables,
potentially for a long time."

So changed!
5.
+   /* Set cost-based vacuum delay */
+   ChecksumCostActive = (ChecksumCostDelay > 0);
+   ChecksumCostBalance = 0;
s/vacuum/checksum verification/
Fixed.
6.
+       ereport(WARNING,
+               (errcode(ERRCODE_DATA_CORRUPTED),
+                errmsg("invalid page in block %u of relation %s",
+                       blkno,
+                       relpath(relation->rd_smgr->smgr_rnode, forknum))));
I think it's better to show the relation name instead of the relation path here.
I'm here using the same pattern as what ReadBuffer_common() would display if a
corrupted block is read. I think it's better to keep the format for both, so
any existing log analyzer will keep working with those new functions.

Ok, I agree with you.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#27)

1 attachment(s)

Re: Online checksums verification in the backend

On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:

Why do we need two rows in the doc? For instance, replication slot
functions have some optional arguments but there is only one row in
the doc. So I think we don't need to change the doc from the previous
version patch.

I thought that if we document the function as pg_check_relation(regclass [,
fork]) users could think that the 2nd argument is optional, so that
pg_check_relation('something', NULL) could be a valid alias for the 1-argument
form, which it isn't. After checking, I see that e.g. current_setting has the
same semantics and is documented the way you suggest, so fixed back to previous
version.

And I think these are not necessary as we already defined in
include/catalog/pg_proc.dat:

+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
+  AS 'pg_check_relation_fork'
+  PARALLEL RESTRICTED;

Oh right this isn't required since there's no default value anymore, fixed.

v9 attached.

Attachments:

v9-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From 66e0ed1c3b12c42c4d52b2b27e79e16e82730b4b Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v9] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 +++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/storage/page/checksum.c           | 318 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 217 ++++++++++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/include/miscadmin.h                       |   7 +
 src/include/storage/checksum.h                |  13 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 ++
 src/test/check_relation/README                |  23 ++
 .../check_relation/t/01_checksums_check.pl    | 276 +++++++++++++++
 17 files changed, 1078 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 114db38116..a20501bb85 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2038,6 +2038,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4d88b45e72..0a958f4a45 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21896,6 +21896,57 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide means to check for health of data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksums for all blocks of all or the given fork of
+       a given relation.</entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all the blocks of a
+    given relation and verify their checksum.  If provided,
+    <replaceable>fork</replaceable> should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork, and only this
+    specific fork will be verifies, otherwise all forks will.  The function
+    returns the list of blocks for which the found checksum doesn't match the
+    expected one.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..eb2c919c34 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,320 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn page and possible false positive when reading data, and
+ * keeping overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+	bool		force_lock = false;
+	bool		found_in_sb;
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	/*
+	 * To avoid too much overhead, the buffer will be first read without
+	 * the locks that would guarantee the lack of false positive, as such
+	 * events should be quite rare.
+	 */
+Retry:
+	if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						  &found_in_sb))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/*
+	 * If we get a failure and the buffer wasn't found in shared buffers,
+	 * reread the buffer with suitable lock to avoid false positive.  See
+	 * check_get_buffer for more details.
+	 */
+	if (!found_in_sb && !force_lock)
+	{
+		force_lock = true;
+		goto Retry;
+	}
+
+	/* A corruption is detected. */
+	return false;
+}
+
+/*
+ * Perform a checksum check on the passed page.  Returns whether the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of a corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's a corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's a corruption, but since this affect PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positive due to torn page.  Caller
+ * can read first the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid too much
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 13efa9338c..27dc524e00 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..d005b8d01f
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,217 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksums related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid,
+									   Oid forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+
+	pg_check_relation_internal(fcinfo, relid, InvalidForkNumber);
+
+	return (Datum) 0;
+}
+
+Datum
+pg_check_relation_fork(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+	const char *forkname;
+	ForkNumber	forknum;
+
+	forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+	forknum = forkname_to_number(forkname);
+
+	pg_check_relation_internal(fcinfo, relid, forknum);
+
+	return (Datum) 0;
+}
+
+/* Common code for all versions of pg_check_relation() */
+static void
+pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid, Oid forknum)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_stat_scan_tables role may use this function")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	/* Set cost-based checksum verification delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 RelationGetRelationName(relation))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (check_one_block(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or no worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if a corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..8683343619 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 477af5d552..b2af8083f6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -695,6 +695,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2389,6 +2391,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page found."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3553,6 +3575,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 91fa185053..acb5baefae 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -156,6 +156,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a649e44d08..6adc3b04c9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10825,6 +10825,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..8d4605efc8 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,13 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..08aa0949a8 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,21 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
 
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+							BlockNumber blkno, uint16 *chk_expected,
+							uint16 *chk_found);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 454c2df487..84965d3f09 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..1ad34adcb9
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums shoud be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfinode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#29

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

almost 6 years ago

In reply to: Julien Rouhaud (#28)

Re: Online checksums verification in the backend

On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:

Why do we need two rows in the doc? For instance, replication slot
functions have some optional arguments but there is only one row in
the doc. So I think we don't need to change the doc from the previous
version patch.

I thought that if we document the function as pg_check_relation(regclass [,
fork]) users could think that the 2nd argument is optional, so that
pg_check_relation('something', NULL) could be a valid alias for the 1-argument
form, which it isn't. After checking, I see that e.g. current_setting has the
same semantics and is documented the way you suggest, so fixed back to previous
version.
And I think these are not necessary as we already defined in
include/catalog/pg_proc.dat:
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
+  AS 'pg_check_relation_fork'
+  PARALLEL RESTRICTED;
Oh right this isn't required since there's no default value anymore, fixed.

v9 attached.

Thank you for updating the patch! The patch looks good to me.

I've marked this patch as Ready for Committer. I hope this patch will
get committed to PG13.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#30

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Masahiko Sawada (#29)

Re: Online checksums verification in the backend

On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote:

On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud <rjuju123@gmail.com> wrote:
On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote:

Why do we need two rows in the doc? For instance, replication slot
functions have some optional arguments but there is only one row in
the doc. So I think we don't need to change the doc from the previous
version patch.

I thought that if we document the function as pg_check_relation(regclass [,
fork]) users could think that the 2nd argument is optional, so that
pg_check_relation('something', NULL) could be a valid alias for the 1-argument
form, which it isn't. After checking, I see that e.g. current_setting has the
same semantics and is documented the way you suggest, so fixed back to previous
version.
And I think these are not necessary as we already defined in
include/catalog/pg_proc.dat:
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation'
+  PARALLEL RESTRICTED;
+
+CREATE OR REPLACE FUNCTION pg_check_relation(
+  IN relation regclass, IN fork text,
+  OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint,
+  OUT expected_checksum integer, OUT found_checksum integer)
+  RETURNS SETOF record STRICT VOLATILE LANGUAGE internal
+  AS 'pg_check_relation_fork'
+  PARALLEL RESTRICTED;
Oh right this isn't required since there's no default value anymore, fixed.

v9 attached.
Thank you for updating the patch! The patch looks good to me.

I've marked this patch as Ready for Committer. I hope this patch will
get committed to PG13.

Thanks a lot!

#31

Daniel Gustafsson

daniel@yesql.se

over 5 years ago

In reply to: Julien Rouhaud (#30)

Re: Online checksums verification in the backend

On 5 Apr 2020, at 13:17, Julien Rouhaud <rjuju123@gmail.com> wrote:
On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote:

Thank you for updating the patch! The patch looks good to me.

I've marked this patch as Ready for Committer. I hope this patch will
get committed to PG13.

Thanks a lot!

This patch has been through quite thorough review, and skimming the thread all
concerns raised have been addressed. It still applies and tests gree in the CF
Patchtester. The feature in itself certainly gets my +1 for inclusion, it
seems a good addition.

Is any committer who has taken part in the thread (or anyone else for that
matter) interested in seeing this to some form of closure in this CF?

cheers ./daniel

#32

Justin Pryzby

pryzby@telsasoft.com

over 5 years ago

In reply to: Julien Rouhaud (#28)

1 attachment(s)

Re: Online checksums verification in the backend

Small language fixes in comments and user-facing docs.

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 88efb38556..39596db193 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26162,7 +26162,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');

    <para>
     The functions shown in <xref linkend="functions-data-sanity-table"/>
-    provide means to check for health of data file in a cluster.
+    provide a means to check for health of a data file in a cluster.
    </para>

    <table id="functions-data-sanity-table">
@@ -26179,8 +26179,8 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
         <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
        </entry>
        <entry><type>setof record</type></entry>
-       <entry>Validate the checksums for all blocks of all or the given fork of
-       a given relation.</entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
       </row>
      </tbody>
     </tgroup>
@@ -26190,15 +26190,15 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
     <primary>pg_check_relation</primary>
    </indexterm>
    <para id="functions-check-relation-note" xreflabel="pg_check_relation">
-    <function>pg_check_relation</function> iterates over all the blocks of a
-    given relation and verify their checksum.  If provided,
-    <replaceable>fork</replaceable> should be <literal>'main'</literal> for the
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
     main data fork, <literal>'fsm'</literal> for the free space map,
     <literal>'vm'</literal> for the visibility map, or
-    <literal>'init'</literal> for the initialization fork, and only this
-    specific fork will be verifies, otherwise all forks will.  The function
-    returns the list of blocks for which the found checksum doesn't match the
-    expected one.  See <xref
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
     linkend="runtime-config-resource-checksum-verification-cost"/> for
     information on how to configure cost-based verification delay.  You must be
     a member of the <literal>pg_read_all_stats</literal> role to use this
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index eb2c919c34..17cd95ec95 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -36,7 +36,7 @@
  * actual storage, you have to discard the operating system cache before
  * running those functions.
  *
- * To avoid torn page and possible false positive when reading data, and
+ * To avoid torn pages and possible false positives when reading data, and to
  * keeping overhead as low as possible, the following heuristics are used:
  *
  * - a shared LWLock is taken on the target buffer pool partition mapping, and
@@ -92,8 +92,8 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
 	*chk_expected = *chk_found = NoComputedChecksum;

 	/*
-	 * To avoid too much overhead, the buffer will be first read without
-	 * the locks that would guarantee the lack of false positive, as such
+	 * To avoid excessive overhead, the buffer will be first read without
+	 * the locks that would prevent false positives, as such
 	 * events should be quite rare.
 	 */
 Retry:
@@ -120,10 +120,10 @@ Retry:
 }

 /*
- * Perform a checksum check on the passed page.  Returns whether the page is
+ * Perform a checksum check on the passed page.  Return True iff the page is
  * valid or not, and assign the expected and found checksum in chk_expected and
  * chk_found, respectively.  Note that a page can look like new but could be
- * the result of a corruption.  We still check for this case, but we can't
+ * the result of corruption.  We still check for this case, but we can't
  * compute its checksum as pg_checksum_page() is explicitly checking for
  * non-new pages, so NoComputedChecksum will be set in chk_found.
  */
@@ -139,7 +139,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
 	if (PageIsNew(page))
 	{
 		/*
-		 * Check if the page is really new or if there's a corruption that
+		 * Check if the page is really new or if there's corruption that
 		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
 		 * detect checksum corruption in this case, so there's no risk of
 		 * duplicated corruption report.
@@ -151,7 +151,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
 		}

 		/*
-		 * There's a corruption, but since this affect PageIsNew, we
+		 * There's corruption, but since this affects PageIsNew, we
 		 * can't compute a checksum, so set NoComputedChecksum for the
 		 * expected checksum.
 		 */
@@ -218,8 +218,8 @@ check_delay_point(void)
  * held.  Reading with this lock is to avoid the unlikely but possible case
  * that a buffer wasn't present in shared buffers when we checked but it then
  * alloc'ed in shared_buffers, modified and flushed concurrently when we
- * later try to read it, leading to false positive due to torn page.  Caller
- * can read first the buffer without holding the target buffer mapping
+ * later try to read it, leading to false positives due to a torn page.  Caller
+ * can first read the buffer without holding the target buffer mapping
  * partition LWLock to have an optimistic approach, and reread the buffer
  * from disk in case of error.
  *
@@ -280,7 +280,7 @@ check_get_buffer(Relation relation, ForkNumber forknum,
 			checkit = false;

 		/*
-		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
 		 * reads, in the unlikely event that it was concurrently dirtied and
 		 * flushed.
 		 */
@@ -320,7 +320,7 @@ check_get_buffer(Relation relation, ForkNumber forknum,
 	/*
 	 * Didn't find it in the buffer pool and didn't read it while holding the
 	 * buffer mapping partition lock.  We'll have to try to read it from
-	 * disk, after releasing the target partition lock to avoid too much
+	 * disk, after releasing the target partition lock to avoid excessive
 	 * overhead.  It means that it's possible to get a torn page later, so
 	 * we'll have to retry with a suitable lock in case of error to avoid
 	 * false positive.
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
index d005b8d01f..fa5823677a 100644
--- a/src/backend/utils/adt/checksumfuncs.c
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * checksumfuncs.c
- *	  Functions for checksums related feature such as online verification
+ *	  Functions for checksum related feature such as online verification
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -181,7 +181,7 @@ check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
 		if (check_one_block(relation, forknum, blkno, &chk_expected,
 							&chk_found))
 		{
-			/* Buffer not corrupted or no worth checking, continue */
+			/* Buffer not corrupted or not worth checking, continue */
 			continue;
 		}

@@ -192,7 +192,7 @@ check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
 		values[i++] = Int32GetDatum(forknum);
 		values[i++] = UInt32GetDatum(blkno);
 		/*
-		 * This can happen if a corruption makes the block appears as
+		 * This can happen if corruption makes the block appears as
 		 * PageIsNew() but isn't a new page.
 		 */
 		if (chk_expected == NoComputedChecksum)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5a51dccca9..57401580c3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2383,7 +2383,7 @@ static struct config_int ConfigureNamesInt[] =

 	{
 		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
-			gettext_noop("Checksum cost for verifying a page found."),
+			gettext_noop("Checksum cost for verifying a page."),
 			NULL
 		},
 		&ChecksumCostPage,
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
index 1ad34adcb9..2a3f2880ea 100644
--- a/src/test/check_relation/t/01_checksums_check.pl
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -218,7 +218,7 @@ $ENV{PGOPTIONS} = '--client-min-messages=WARNING';
 my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
 	. " current_setting('data_checksums')");

-is($stdout, 'on', 'Data checksums shoud be enabled');
+is($stdout, 'on', 'Data checksums should be enabled');

($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
. " current_setting('block_size')");
@@ -254,7 +254,7 @@ is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index
. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
);

-isnt($stdout, '', 'A relfinode should be returned');
+isnt($stdout, '', 'A relfilenode should be returned');

my $filename = $stdout;

--
2.17.0

Attachments:

0001-fixes-for-online-checksum-verification.patchtext/x-diff; charset=us-asciiDownload

From 7b9e61d640b5137dbcd0221dae0c293b705989ba Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Sun, 12 Jul 2020 12:25:34 -0500
Subject: [PATCH] fixes for online checksum verification

---
 doc/src/sgml/func.sgml                        | 20 ++++++++---------
 src/backend/storage/page/checksum.c           | 22 +++++++++----------
 src/backend/utils/adt/checksumfuncs.c         |  6 ++---
 src/backend/utils/misc/guc.c                  |  2 +-
 .../check_relation/t/01_checksums_check.pl    |  4 ++--
 5 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 88efb38556..39596db193 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26162,7 +26162,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
    <para>
     The functions shown in <xref linkend="functions-data-sanity-table"/>
-    provide means to check for health of data file in a cluster.
+    provide a means to check for health of a data file in a cluster.
    </para>
 
    <table id="functions-data-sanity-table">
@@ -26179,8 +26179,8 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
         <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
        </entry>
        <entry><type>setof record</type></entry>
-       <entry>Validate the checksums for all blocks of all or the given fork of
-       a given relation.</entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
       </row>
      </tbody>
     </tgroup>
@@ -26190,15 +26190,15 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
     <primary>pg_check_relation</primary>
    </indexterm>
    <para id="functions-check-relation-note" xreflabel="pg_check_relation">
-    <function>pg_check_relation</function> iterates over all the blocks of a
-    given relation and verify their checksum.  If provided,
-    <replaceable>fork</replaceable> should be <literal>'main'</literal> for the
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
     main data fork, <literal>'fsm'</literal> for the free space map,
     <literal>'vm'</literal> for the visibility map, or
-    <literal>'init'</literal> for the initialization fork, and only this
-    specific fork will be verifies, otherwise all forks will.  The function
-    returns the list of blocks for which the found checksum doesn't match the
-    expected one.  See <xref
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
     linkend="runtime-config-resource-checksum-verification-cost"/> for
     information on how to configure cost-based verification delay.  You must be
     a member of the <literal>pg_read_all_stats</literal> role to use this
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index eb2c919c34..17cd95ec95 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -36,7 +36,7 @@
  * actual storage, you have to discard the operating system cache before
  * running those functions.
  *
- * To avoid torn page and possible false positive when reading data, and
+ * To avoid torn pages and possible false positives when reading data, and to
  * keeping overhead as low as possible, the following heuristics are used:
  *
  * - a shared LWLock is taken on the target buffer pool partition mapping, and
@@ -92,8 +92,8 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
 	*chk_expected = *chk_found = NoComputedChecksum;
 
 	/*
-	 * To avoid too much overhead, the buffer will be first read without
-	 * the locks that would guarantee the lack of false positive, as such
+	 * To avoid excessive overhead, the buffer will be first read without
+	 * the locks that would prevent false positives, as such
 	 * events should be quite rare.
 	 */
 Retry:
@@ -120,10 +120,10 @@ Retry:
 }
 
 /*
- * Perform a checksum check on the passed page.  Returns whether the page is
+ * Perform a checksum check on the passed page.  Return True iff the page is
  * valid or not, and assign the expected and found checksum in chk_expected and
  * chk_found, respectively.  Note that a page can look like new but could be
- * the result of a corruption.  We still check for this case, but we can't
+ * the result of corruption.  We still check for this case, but we can't
  * compute its checksum as pg_checksum_page() is explicitly checking for
  * non-new pages, so NoComputedChecksum will be set in chk_found.
  */
@@ -139,7 +139,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
 	if (PageIsNew(page))
 	{
 		/*
-		 * Check if the page is really new or if there's a corruption that
+		 * Check if the page is really new or if there's corruption that
 		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
 		 * detect checksum corruption in this case, so there's no risk of
 		 * duplicated corruption report.
@@ -151,7 +151,7 @@ check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
 		}
 
 		/*
-		 * There's a corruption, but since this affect PageIsNew, we
+		 * There's corruption, but since this affects PageIsNew, we
 		 * can't compute a checksum, so set NoComputedChecksum for the
 		 * expected checksum.
 		 */
@@ -218,8 +218,8 @@ check_delay_point(void)
  * held.  Reading with this lock is to avoid the unlikely but possible case
  * that a buffer wasn't present in shared buffers when we checked but it then
  * alloc'ed in shared_buffers, modified and flushed concurrently when we
- * later try to read it, leading to false positive due to torn page.  Caller
- * can read first the buffer without holding the target buffer mapping
+ * later try to read it, leading to false positives due to a torn page.  Caller
+ * can first read the buffer without holding the target buffer mapping
  * partition LWLock to have an optimistic approach, and reread the buffer
  * from disk in case of error.
  *
@@ -280,7 +280,7 @@ check_get_buffer(Relation relation, ForkNumber forknum,
 			checkit = false;
 
 		/*
-		 * Read the buffer from disk, taking on IO lock to prevent torn-page
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
 		 * reads, in the unlikely event that it was concurrently dirtied and
 		 * flushed.
 		 */
@@ -320,7 +320,7 @@ check_get_buffer(Relation relation, ForkNumber forknum,
 	/*
 	 * Didn't find it in the buffer pool and didn't read it while holding the
 	 * buffer mapping partition lock.  We'll have to try to read it from
-	 * disk, after releasing the target partition lock to avoid too much
+	 * disk, after releasing the target partition lock to avoid excessive
 	 * overhead.  It means that it's possible to get a torn page later, so
 	 * we'll have to retry with a suitable lock in case of error to avoid
 	 * false positive.
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
index d005b8d01f..fa5823677a 100644
--- a/src/backend/utils/adt/checksumfuncs.c
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * checksumfuncs.c
- *	  Functions for checksums related feature such as online verification
+ *	  Functions for checksum related feature such as online verification
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -181,7 +181,7 @@ check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
 		if (check_one_block(relation, forknum, blkno, &chk_expected,
 							&chk_found))
 		{
-			/* Buffer not corrupted or no worth checking, continue */
+			/* Buffer not corrupted or not worth checking, continue */
 			continue;
 		}
 
@@ -192,7 +192,7 @@ check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
 		values[i++] = Int32GetDatum(forknum);
 		values[i++] = UInt32GetDatum(blkno);
 		/*
-		 * This can happen if a corruption makes the block appears as
+		 * This can happen if corruption makes the block appears as
 		 * PageIsNew() but isn't a new page.
 		 */
 		if (chk_expected == NoComputedChecksum)
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5a51dccca9..57401580c3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2383,7 +2383,7 @@ static struct config_int ConfigureNamesInt[] =
 
 	{
 		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
-			gettext_noop("Checksum cost for verifying a page found."),
+			gettext_noop("Checksum cost for verifying a page."),
 			NULL
 		},
 		&ChecksumCostPage,
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
index 1ad34adcb9..2a3f2880ea 100644
--- a/src/test/check_relation/t/01_checksums_check.pl
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -218,7 +218,7 @@ $ENV{PGOPTIONS} = '--client-min-messages=WARNING';
 my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
 	. " current_setting('data_checksums')");
 
-is($stdout, 'on', 'Data checksums shoud be enabled');
+is($stdout, 'on', 'Data checksums should be enabled');
 
 ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
 	. " current_setting('block_size')");
@@ -254,7 +254,7 @@ is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index
 	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
 );
 
-isnt($stdout, '', 'A relfinode should be returned');
+isnt($stdout, '', 'A relfilenode should be returned');
 
 my $filename = $stdout;
 
-- 
2.17.0

#33

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Justin Pryzby (#32)

1 attachment(s)

Re: Online checksums verification in the backend

On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote:

Small language fixes in comments and user-facing docs.

Thanks a lot! I just fixed a small issue (see below), PFA updated v10.

diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index eb2c919c34..17cd95ec95 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -36,7 +36,7 @@
* actual storage, you have to discard the operating system cache before
* running those functions.
*
- * To avoid torn page and possible false positive when reading data, and
+ * To avoid torn pages and possible false positives when reading data, and to
* keeping overhead as low as possible, the following heuristics are used:
*

Changed for "to keep".

Attachments:

v10-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From ff8b384d3acdee2986543742c22af9de0335ff4f Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v10] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 +++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/storage/page/checksum.c           | 318 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 217 ++++++++++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/include/miscadmin.h                       |   7 +
 src/include/storage/checksum.h                |  13 +
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/check_relation/.gitignore            |   2 +
 src/test/check_relation/Makefile              |  23 ++
 src/test/check_relation/README                |  23 ++
 .../check_relation/t/01_checksums_check.pl    | 276 +++++++++++++++
 17 files changed, 1078 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..f6a03ff931 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2038,6 +2038,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cc83d6652e..2b5cacac95 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26173,6 +26173,57 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..1f2590fe6c 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,320 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, and to
+ * keep overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+	bool		force_lock = false;
+	bool		found_in_sb;
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	/*
+	 * To avoid excessive overhead, the buffer will be first read without
+	 * the locks that would prevent false positives, as such
+	 * events should be quite rare.
+	 */
+Retry:
+	if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						  &found_in_sb))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/*
+	 * If we get a failure and the buffer wasn't found in shared buffers,
+	 * reread the buffer with suitable lock to avoid false positive.  See
+	 * check_get_buffer for more details.
+	 */
+	if (!found_in_sb && !force_lock)
+	{
+		force_lock = true;
+		goto Retry;
+	}
+
+	/* A corruption is detected. */
+	return false;
+}
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pg_usleep(msec * 1000L);
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positives due to a torn page.  Caller
+ * can first read the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid excessive
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 5d2aca8cfe..2b20942e88 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..fa5823677a
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,217 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksum related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid,
+									   Oid forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+
+	pg_check_relation_internal(fcinfo, relid, InvalidForkNumber);
+
+	return (Datum) 0;
+}
+
+Datum
+pg_check_relation_fork(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+	const char *forkname;
+	ForkNumber	forknum;
+
+	forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+	forknum = forkname_to_number(forkname);
+
+	pg_check_relation_internal(fcinfo, relid, forknum);
+
+	return (Datum) 0;
+}
+
+/* Common code for all versions of pg_check_relation() */
+static void
+pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid, Oid forknum)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_stat_scan_tables role may use this function")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	/* Set cost-based checksum verification delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 RelationGetRelationName(relation))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (check_one_block(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or not worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 497d7c38ae..86b67c3636 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -134,6 +134,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..57401580c3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -686,6 +686,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2379,6 +2381,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3556,6 +3578,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..d578c525f3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -156,6 +156,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 95604e988a..70acd13e25 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10919,6 +10919,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 18bc8a7b90..747e79ca2f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -246,6 +246,13 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/storage/checksum.h b/src/include/storage/checksum.h
index 6e77744cbc..08aa0949a8 100644
--- a/src/include/storage/checksum.h
+++ b/src/include/storage/checksum.h
@@ -13,8 +13,21 @@
 #ifndef CHECKSUM_H
 #define CHECKSUM_H
 
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
 
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+							BlockNumber blkno, uint16 *chk_expected,
+							uint16 *chk_found);
 /*
  * Compute the checksum for a Postgres page.  The page must be aligned on a
  * 4-byte boundary.
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..4ed2a09600 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/check_relation/.gitignore b/src/test/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/check_relation/Makefile b/src/test/check_relation/Makefile
new file mode 100644
index 0000000000..792098fa65
--- /dev/null
+++ b/src/test/check_relation/Makefile
@@ -0,0 +1,23 @@
+#-------------------------------------------------------------------------
+#
+# Makefile for src/test/check_relation
+#
+# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/check_relation/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
+clean distclean maintainer-clean:
+	rm -rf tmp_check
diff --git a/src/test/check_relation/README b/src/test/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/check_relation/t/01_checksums_check.pl b/src/test/check_relation/t/01_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#34

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Julien Rouhaud (#33)

Re: Online checksums verification in the backend

On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote:

On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote:

Small language fixes in comments and user-facing docs.

Thanks a lot! I just fixed a small issue (see below), PFA updated v10.

Sawada-san, you are registered as a reviewer of this patch. Are you
planning to look at it? If you are busy lately, that's fine as well
(congrats!). In this case it could be better to unregister from the
CF app for this entry.

I am refreshing my mind here, but here are some high-level comments
for now...

+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
[...]
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+                           BlockNumber blkno, uint16 *chk_expected,
+                           uint16 *chk_found);
I don't think that it is a good idea to add this much to checksum.h
as these are includes coming mainly from the backend.  Note that
pg_checksum_page() is a function designed to be also available for
frontend tools, with checksum.h something that can be included in
frontends.  This would mean that we could move all the buffer lookup
APIs directly to checksumfuncs.c, or move that into a separate file
closer to the location.

+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum 0
Wouldn't it be better to rename that something like
InvalidPageChecksum, and make use of it in pg_checksum_page()?  It
would be more consistent with the naming of transaction IDs, OIDs or
even XLogRecPtr.  And that could be a separate patch.

+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
It could be better to move that to src/test/modules/, so as it could
be picked more easily by MSVC scripts in the future.  Note that if we
apply the normal naming convention here this should be named
001_checksum_check.pl.

+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
Let's use a Makefile shaped in a way similar to modules/test_misc that
makes use of TAP_TESTS = 1.  There is the infra, let's rely on it for
the regression tests.

+ pg_usleep(msec * 1000L);
Could it be possible to add a wait event here? It would be nice to be
able to monitor that in pg_stat_activity.

+if (exists $ENV{MY_PG_REGRESS})
+{
+   $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
What is MY_PG_REGRESS for?  A remnant from an external makefile
perhaps?

+   /*
+    * If we get a failure and the buffer wasn't found in shared buffers,
+    * reread the buffer with suitable lock to avoid false positive. See
+    * check_get_buffer for more details.
+    */
+   if (!found_in_sb && !force_lock)
+   {
+       force_lock = true;
+       goto Retry;
+   }
As designed, we have a risk of false positives with a torn page in the
first loop when trying to look for a given buffer as we would try to
use smgrread() without a partition lock.  This stresses me a bit, and
false positives could scare users easily.  Could we consider first a
safer approach where we don't do that, and just read the page while
holding the partition lock?  OK, that would be more expensive, but at
least that's safe in any case.  My memory of this patch is a bit
fuzzy, but this is itching me and this is the heart of the problem
dealt with here :)
--
Michael

#35

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Michael Paquier (#34)

Re: Online checksums verification in the backend

On Mon, Sep 7, 2020 at 8:59 AM Michael Paquier <michael@paquier.xyz> wrote:

+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
#include "storage/block.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
[...]
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+                           BlockNumber blkno, uint16 *chk_expected,
+                           uint16 *chk_found);
I don't think that it is a good idea to add this much to checksum.h
as these are includes coming mainly from the backend.  Note that
pg_checksum_page() is a function designed to be also available for
frontend tools, with checksum.h something that can be included in
frontends.  This would mean that we could move all the buffer lookup
APIs directly to checksumfuncs.c, or move that into a separate file
closer to the location.

Did you mean creating a new checksumfuncs.c file? I don't find any
such file in the current tree.

+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum 0
Wouldn't it be better to rename that something like
InvalidPageChecksum, and make use of it in pg_checksum_page()?  It
would be more consistent with the naming of transaction IDs, OIDs or
even XLogRecPtr.  And that could be a separate patch.

It seems quite ambiguous, as checksum validity usually has a different
meaning. And in the code added here, the meaning isn't that the
ckecksum is invalid but that there's no checsum as it cannot be
computed due to PageIsNew().

+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
It could be better to move that to src/test/modules/, so as it could
be picked more easily by MSVC scripts in the future.  Note that if we
apply the normal naming convention here this should be named
001_checksum_check.pl.

+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
Let's use a Makefile shaped in a way similar to modules/test_misc that
makes use of TAP_TESTS = 1.  There is the infra, let's rely on it for
the regression tests.

Will fix.

+ pg_usleep(msec * 1000L);
Could it be possible to add a wait event here? It would be nice to be
able to monitor that in pg_stat_activity.

Sure, I missed that as this was first implemented as an extension.

+if (exists $ENV{MY_PG_REGRESS})
+{
+   $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
What is MY_PG_REGRESS for?  A remnant from an external makefile
perhaps?

Indeed.

+   /*
+    * If we get a failure and the buffer wasn't found in shared buffers,
+    * reread the buffer with suitable lock to avoid false positive. See
+    * check_get_buffer for more details.
+    */
+   if (!found_in_sb && !force_lock)
+   {
+       force_lock = true;
+       goto Retry;
+   }
As designed, we have a risk of false positives with a torn page in the
first loop when trying to look for a given buffer as we would try to
use smgrread() without a partition lock.  This stresses me a bit, and
false positives could scare users easily.  Could we consider first a
safer approach where we don't do that, and just read the page while
holding the partition lock?  OK, that would be more expensive, but at
least that's safe in any case.  My memory of this patch is a bit
fuzzy, but this is itching me and this is the heart of the problem
dealt with here :)

I'm not sure I understand. Unless I missed something this approach
*cannot* raise a false positive. What it does is force a 2nd check
with stronger lock *to make sure it's actually a corruption*, so we
don't raise false positive. The only report that can happen in this
1st loop is if smgread raises an error, which AFAICT can only happen
(at least with mdread) if the whole block couldn't be read, which is a
sign of a very bad problem. This should clearly be reported, as this
cannot be caused by the locking heuristics used here.

#36

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Julien Rouhaud (#35)

Re: Online checksums verification in the backend

On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote:

Did you mean creating a new checksumfuncs.c file? I don't find any
such file in the current tree.

Your patch adds checksumfuncs.c, so the subroutines grabbing a given
block could just be moved there.

I'm not sure I understand. Unless I missed something this approach
*cannot* raise a false positive. What it does is force a 2nd check
with stronger lock *to make sure it's actually a corruption*, so we
don't raise false positive. The only report that can happen in this
1st loop is if smgread raises an error, which AFAICT can only happen
(at least with mdread) if the whole block couldn't be read, which is a
sign of a very bad problem. This should clearly be reported, as this
cannot be caused by the locking heuristics used here.

We don't know how much this optimization matters though? Could it be
possible to get an idea of that? For example, take the case of one
relation with a fixed size in a read-only workload and a read-write
workload (as long as autovacuum and updates make the number of
relation blocks rather constant for the read-write case), doing a
checksum verification in parallel of multiple clients working on the
relation concurrently. Assuming that the relation is fully in the OS
cache, we could get an idea of the impact with multiple
(shared_buffers / relation size) rates to make the eviction more
aggressive? The buffer partition locks, knowing that
NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it
seems to me that it would be good to see if we have a difference.
What do you think?
--
Michael

#37

Masahiko Sawada

masahiko.sawada@2ndquadrant.com

over 5 years ago

In reply to: Michael Paquier (#34)

Re: Online checksums verification in the backend

On Mon, 7 Sep 2020 at 15:59, Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote:

On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote:

Small language fixes in comments and user-facing docs.

Thanks a lot! I just fixed a small issue (see below), PFA updated v10.

Sawada-san, you are registered as a reviewer of this patch. Are you
planning to look at it? If you are busy lately, that's fine as well
(congrats!).

Thanks!

In this case it could be better to unregister from the
CF app for this entry.

Well, I sent review comments on this patch and Julien fixed all
comments. So I’d marked this as Ready for Committer since I didn't
have further comments at that time, and I was waiting for the
committer review. I'll look at this patch again but should I remove my
name from the reviewer after that if no comments?

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#38

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Masahiko Sawada (#37)

Re: Online checksums verification in the backend

On Tue, Sep 08, 2020 at 11:36:45AM +0900, Masahiko Sawada wrote:

On Mon, 7 Sep 2020 at 15:59, Michael Paquier <michael@paquier.xyz> wrote:

In this case it could be better to unregister from the
CF app for this entry.

Well, I sent review comments on this patch and Julien fixed all
comments. So I’d marked this as Ready for Committer since I didn't
have further comments at that time, and I was waiting for the
committer review. I'll look at this patch again but should I remove my
name from the reviewer after that if no comments?

Ah, sorry, I somewhat missed the previous status of the patch.
Perhaps that's an overdose of CF. Keeping your name as reviewer is
fine I guess. I have begun looking at the patch and spotted some
issues, so let's see where we do from here.
--
Michael

#39

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Michael Paquier (#36)

1 attachment(s)

Re: Online checksums verification in the backend

On Mon, Sep 07, 2020 at 05:50:38PM +0900, Michael Paquier wrote:

On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote:

Did you mean creating a new checksumfuncs.c file? I don't find any
such file in the current tree.

Your patch adds checksumfuncs.c, so the subroutines grabbing a given
block could just be moved there.

Sorry, I was in the middle of a rebase for another patch and missed the new
files added in this one. I added a new checksumfuncs.h for the required
include that should not be seen by client code. I kept checksumfuncs.c and
checksums.c so that the SQL visible declaration are separated from the rest of
the implementation as this is what we already do elsewhere I think. If that's
a problem I'll change and put everything in checksumfuncs.[ch].

I also moved the tap tests in src/test/modules and renamed the file with a 3
digits. For the record I initially copied src/test/modules/brin, and this is
apparently the only subdir that has a 2 digits pattern.

I also added a new WAIT_EVENT_CHECK_DELAY wait event.

I'm not sure I understand. Unless I missed something this approach
*cannot* raise a false positive. What it does is force a 2nd check
with stronger lock *to make sure it's actually a corruption*, so we
don't raise false positive. The only report that can happen in this
1st loop is if smgread raises an error, which AFAICT can only happen
(at least with mdread) if the whole block couldn't be read, which is a
sign of a very bad problem. This should clearly be reported, as this
cannot be caused by the locking heuristics used here.

We don't know how much this optimization matters though? Could it be
possible to get an idea of that? For example, take the case of one
relation with a fixed size in a read-only workload and a read-write
workload (as long as autovacuum and updates make the number of
relation blocks rather constant for the read-write case), doing a
checksum verification in parallel of multiple clients working on the
relation concurrently. Assuming that the relation is fully in the OS
cache, we could get an idea of the impact with multiple
(shared_buffers / relation size) rates to make the eviction more
aggressive? The buffer partition locks, knowing that
NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it
seems to me that it would be good to see if we have a difference.
What do you think?

I assumed that the odds of having to check the buffer twice were so low, and
avoiding to keep a bufmapping lock while doing some IO was an uncontroversial
enough optimisation, but maybe that's only wishful thinking.

I'll do some becnhmarking and see if I can get some figures, but it'll probably
take some time. In the meantime I'm attaching v11 of the patch that should
address all other comments.

Attachments:

v11-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From 08b28331e11160ab842b7b72a61c4478e7b3561a Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v11] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 +++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/postmaster/pgstat.c               |   3 +
 src/backend/storage/page/checksum.c           | 322 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 218 ++++++++++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/include/miscadmin.h                       |   7 +
 src/include/pgstat.h                          |   3 +-
 src/include/utils/checksumfuncs.h             |  31 ++
 src/include/utils/guc_tables.h                |   1 +
 src/test/Makefile                             |   3 +-
 src/test/modules/Makefile                     |   1 +
 src/test/modules/check_relation/.gitignore    |   2 +
 src/test/modules/check_relation/Makefile      |  14 +
 src/test/modules/check_relation/README        |  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 +++++++++++++++
 20 files changed, 1098 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/include/utils/checksumfuncs.h
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4ba49ffaf..b7629fde60 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2137,6 +2137,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e2e618791e..bcc97d1306 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26221,6 +26221,57 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5f4b168fd1..64664c0bc6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3912,6 +3912,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_VACUUM_DELAY:
 			event_name = "VacuumDelay";
 			break;
+		case WAIT_EVENT_CHECK_DELAY:
+			event_name = "CheckDelay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..dbdc105060 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,324 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+#include "utils/checksumfuncs.h"
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, and to
+ * keep overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+	bool		force_lock = false;
+	bool		found_in_sb;
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	/*
+	 * To avoid excessive overhead, the buffer will be first read without
+	 * the locks that would prevent false positives, as such
+	 * events should be quite rare.
+	 */
+Retry:
+	if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						  &found_in_sb))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/*
+	 * If we get a failure and the buffer wasn't found in shared buffers,
+	 * reread the buffer with suitable lock to avoid false positive.  See
+	 * check_get_buffer for more details.
+	 */
+	if (!found_in_sb && !force_lock)
+	{
+		force_lock = true;
+		goto Retry;
+	}
+
+	/* A corruption is detected. */
+	return false;
+}
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pgstat_report_wait_start(WAIT_EVENT_CHECK_DELAY);
+		pg_usleep(msec * 1000L);
+		pgstat_report_wait_end();
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positives due to a torn page.  Caller
+ * can first read the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid excessive
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..603f63afb6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..700831cbc7
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,218 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksum related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/checksumfuncs.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid,
+									   Oid forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+
+	pg_check_relation_internal(fcinfo, relid, InvalidForkNumber);
+
+	return (Datum) 0;
+}
+
+Datum
+pg_check_relation_fork(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+	const char *forkname;
+	ForkNumber	forknum;
+
+	forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+	forknum = forkname_to_number(forkname);
+
+	pg_check_relation_internal(fcinfo, relid, forknum);
+
+	return (Datum) 0;
+}
+
+/* Common code for all versions of pg_check_relation() */
+static void
+pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid, Oid forknum)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_stat_scan_tables role may use this function")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	/* Set cost-based checksum verification delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 RelationGetRelationName(relation))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (check_one_block(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or not worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..68e6e0cfcb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -135,6 +135,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..a18f85181d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -691,6 +691,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2385,6 +2387,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3585,6 +3607,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..0b45a3a9eb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -160,6 +160,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 687509ba92..e0afc7a1f0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10938,6 +10938,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..44c473995f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -247,6 +247,13 @@ extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..b9ff8392b8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -902,7 +902,8 @@ typedef enum
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-	WAIT_EVENT_VACUUM_DELAY
+	WAIT_EVENT_VACUUM_DELAY,
+	WAIT_EVENT_CHECK_DELAY
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/utils/checksumfuncs.h b/src/include/utils/checksumfuncs.h
new file mode 100644
index 0000000000..f3e13b64b1
--- /dev/null
+++ b/src/include/utils/checksumfuncs.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfunc.h
+ *	  Checksum verification implementation for data pages.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/checksumfunc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKSUMFUNC_H
+#define CHECKSUMFUNC_H
+
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+							BlockNumber blkno, uint16 *chk_expected,
+							uint16 *chk_found);
+
+#endif							/* CHECKSUMFUNC_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..4ed2a09600 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..832868099f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,8 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = perl regress isolation modules authentication check_relation \
+	  recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#40

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Julien Rouhaud (#39)

Re: Online checksums verification in the backend

On Wed, Sep 09, 2020 at 11:25:24AM +0200, Julien Rouhaud wrote:

I assumed that the odds of having to check the buffer twice were so low, and
avoiding to keep a bufmapping lock while doing some IO was an uncontroversial
enough optimisation, but maybe that's only wishful thinking.

Perhaps it is worth it, so it would be good to make sure of it and see
if that's actually worth the extra complication. This also depends if
the page is in the OS cache if the page is not in shared buffers,
meaning that smgrread() is needed to fetch the page to check. I would
be more curious to see if there is an actual difference if the page is
the OS cache.

I'll do some becnhmarking and see if I can get some figures, but it'll probably
take some time. In the meantime I'm attaching v11 of the patch that should
address all other comments.

Thanks.

Another thing that was itching me is the introduction of 3 GUCs with
one new category for the sake of two SQL functions. For VACUUM we
have many things relying on the GUC delays, with autovacuum and manual
vacuum. Perhaps it would make sense to have these if we have some day
a background worker doing checksum verifications, still that could
perfectly be in contrib/, and that would be kind of hard to tune as
well. The patch enabling checksums on-the-fly could also be a reason
good enough. Another thing we could consider is to pass down those
parameters as function arguments, at the cost of not being able to
reload them.
--
Michael

#41

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Michael Paquier (#40)

Re: Online checksums verification in the backend

On Wed, Sep 9, 2020 at 2:37 PM Michael Paquier <michael@paquier.xyz> wrote:

Another thing that was itching me is the introduction of 3 GUCs with
one new category for the sake of two SQL functions. For VACUUM we
have many things relying on the GUC delays, with autovacuum and manual
vacuum. Perhaps it would make sense to have these if we have some day
a background worker doing checksum verifications, still that could
perfectly be in contrib/, and that would be kind of hard to tune as
well. The patch enabling checksums on-the-fly could also be a reason
good enough. Another thing we could consider is to pass down those
parameters as function arguments, at the cost of not being able to
reload them.

I'm not terribly happy with adding that for now, but it's quite clear
that there'll eventually be a lot of new stuff added that will benefit
from either the category or the GUC. FTR once we reach an agreement
on how to do this check (I'm wondering if it'll stay an SQL function
or become a plain backend command, in which case GUCs would be
mandatory), I'll also be happy to work on a background worker to help
people running the check regularly. So in my opinion it's better to
add them now so we won't have to change the sql function definition
later when other facilities will rely on them.

#42

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Julien Rouhaud (#39)

1 attachment(s)

Re: Online checksums verification in the backend

On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:

I'll do some becnhmarking and see if I can get some figures, but it'll probably
take some time. In the meantime I'm attaching v11 of the patch that should
address all other comments.

I just realized that I forgot to update one of the Makefile when moving the TAP
test folder. v12 attached should fix that.

Attachments:

v12-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From 84d0a4c5ec5d6b6f832fad02d421c204f1bee98b Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v12] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 +++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/postmaster/pgstat.c               |   3 +
 src/backend/storage/page/checksum.c           | 322 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 218 ++++++++++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/include/miscadmin.h                       |   7 +
 src/include/pgstat.h                          |   3 +-
 src/include/utils/checksumfuncs.h             |  31 ++
 src/include/utils/guc_tables.h                |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/check_relation/.gitignore    |   2 +
 src/test/modules/check_relation/Makefile      |  14 +
 src/test/modules/check_relation/README        |  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 +++++++++++++++
 19 files changed, 1096 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/include/utils/checksumfuncs.h
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4ba49ffaf..b7629fde60 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2137,6 +2137,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e2e618791e..bcc97d1306 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26221,6 +26221,57 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5f4b168fd1..64664c0bc6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3912,6 +3912,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_VACUUM_DELAY:
 			event_name = "VacuumDelay";
 			break;
+		case WAIT_EVENT_CHECK_DELAY:
+			event_name = "CheckDelay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..dbdc105060 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,324 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+#include "utils/checksumfuncs.h"
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, and to
+ * keep overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+	bool		force_lock = false;
+	bool		found_in_sb;
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	/*
+	 * To avoid excessive overhead, the buffer will be first read without
+	 * the locks that would prevent false positives, as such
+	 * events should be quite rare.
+	 */
+Retry:
+	if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						  &found_in_sb))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/*
+	 * If we get a failure and the buffer wasn't found in shared buffers,
+	 * reread the buffer with suitable lock to avoid false positive.  See
+	 * check_get_buffer for more details.
+	 */
+	if (!found_in_sb && !force_lock)
+	{
+		force_lock = true;
+		goto Retry;
+	}
+
+	/* A corruption is detected. */
+	return false;
+}
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pgstat_report_wait_start(WAIT_EVENT_CHECK_DELAY);
+		pg_usleep(msec * 1000L);
+		pgstat_report_wait_end();
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positives due to a torn page.  Caller
+ * can first read the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid excessive
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..603f63afb6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..700831cbc7
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,218 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksum related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/checksumfuncs.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid,
+									   Oid forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+
+	pg_check_relation_internal(fcinfo, relid, InvalidForkNumber);
+
+	return (Datum) 0;
+}
+
+Datum
+pg_check_relation_fork(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+	const char *forkname;
+	ForkNumber	forknum;
+
+	forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+	forknum = forkname_to_number(forkname);
+
+	pg_check_relation_internal(fcinfo, relid, forknum);
+
+	return (Datum) 0;
+}
+
+/* Common code for all versions of pg_check_relation() */
+static void
+pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid, Oid forknum)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_stat_scan_tables role may use this function")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	/* Set cost-based checksum verification delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 RelationGetRelationName(relation))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (check_one_block(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or not worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..68e6e0cfcb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -135,6 +135,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..a18f85181d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -691,6 +691,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2385,6 +2387,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3585,6 +3607,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..0b45a3a9eb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -160,6 +160,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 687509ba92..e0afc7a1f0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10938,6 +10938,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..44c473995f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -247,6 +247,13 @@ extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..b9ff8392b8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -902,7 +902,8 @@ typedef enum
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-	WAIT_EVENT_VACUUM_DELAY
+	WAIT_EVENT_VACUUM_DELAY,
+	WAIT_EVENT_CHECK_DELAY
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/utils/checksumfuncs.h b/src/include/utils/checksumfuncs.h
new file mode 100644
index 0000000000..f3e13b64b1
--- /dev/null
+++ b/src/include/utils/checksumfuncs.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfunc.h
+ *	  Checksum verification implementation for data pages.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/checksumfunc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKSUMFUNC_H
+#define CHECKSUMFUNC_H
+
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+							BlockNumber blkno, uint16 *chk_expected,
+							uint16 *chk_found);
+
+#endif							/* CHECKSUMFUNC_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..4ed2a09600 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
-- 
2.20.1

#43

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Julien Rouhaud (#42)

1 attachment(s)

Re: Online checksums verification in the backend

On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:

On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:

I'll do some becnhmarking and see if I can get some figures, but it'll probably
take some time. In the meantime I'm attaching v11 of the patch that should
address all other comments.

I just realized that I forgot to update one of the Makefile when moving the TAP
test folder. v12 attached should fix that.

And the cfbot just reported a new error for Windows build. Attached v13 should
fix that.

Attachments:

v13-0001-Add-a-pg_check_relation-function.patchtext/plain; charset=us-asciiDownload

From 2e08641e25ccd21bf6e4870085eb8b6b741027c1 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v13] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  85 +++++
 doc/src/sgml/func.sgml                        |  51 +++
 src/backend/postmaster/pgstat.c               |   3 +
 src/backend/storage/page/checksum.c           | 322 +++++++++++++++++-
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 218 ++++++++++++
 src/backend/utils/init/globals.c              |   7 +
 src/backend/utils/misc/guc.c                  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/include/miscadmin.h                       |   7 +
 src/include/pgstat.h                          |   3 +-
 src/include/utils/checksumfuncs.h             |  31 ++
 src/include/utils/guc_tables.h                |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/check_relation/.gitignore    |   2 +
 src/test/modules/check_relation/Makefile      |  14 +
 src/test/modules/check_relation/README        |  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 +++++++++++++++
 src/tools/msvc/Mkvcbuild.pm                   |   3 +-
 20 files changed, 1098 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/include/utils/checksumfuncs.h
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c4ba49ffaf..b7629fde60 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2137,6 +2137,91 @@ include_dir 'conf.d'
      </note>
     </sect2>
 
+    <sect2 id="runtime-config-resource-checksum-verification-cost">
+     <title>Cost-based Checksum Verification Delay</title>
+
+     <para>
+      During the execution of <xref linkend="functions-check-relation-note"/>
+      function, the system maintains an internal counter that keeps track of
+      the estimated cost of the various I/O operations that are performed.
+      When the accumulated cost reaches a limit (specified by
+      <varname>checksum_cost_limit</varname>), the process performing the
+      operation will sleep for a short period of time, as specified by
+      <varname>checksum_cost_delay</varname>. Then it will reset the counter
+      and continue execution.
+     </para>
+
+     <para>
+      This feature is disabled by default. To enable it, set the
+      <varname>checksum_cost_delay</varname> variable to a nonzero
+      value.
+     </para>
+
+     <variablelist>
+      <varlistentry id="guc-checksum-cost-delay" xreflabel="checksum_cost_delay">
+       <term><varname>checksum_cost_delay</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_delay</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The amount of time that the process will sleep
+         when the cost limit has been exceeded.
+         If this value is specified without units, it is taken as milliseconds.
+         The default value is zero, which disables the cost-based checksum
+         verification delay feature.  Positive values enable cost-based
+         checksum verification.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-page" xreflabel="checksum_cost_page">
+       <term><varname>checksum_cost_page</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_page</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The estimated cost for verifying a buffer, whether it's found in the
+         shared buffer cache or not. It represents the cost to lock the buffer
+         pool, lookup the shared hash table, read the content of the page from
+         disk and compute its checksum.  The default value is 10.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-checksum-cost-limit" xreflabel="checksum_cost_limit">
+       <term><varname>checksum_cost_limit</varname> (<type>integer</type>)
+       <indexterm>
+        <primary><varname>checksum_cost_limit</varname> configuration parameter</primary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         The accumulated cost that will cause the verification process to sleep.
+         The default value is 200.
+        </para>
+       </listitem>
+      </varlistentry>
+     </variablelist>
+
+     <note>
+      <para>
+       There are certain operations that hold critical locks and should
+       therefore complete as quickly as possible.  Cost-based checksum
+       verification delays do not occur during such operations.  Therefore it
+       is possible that the cost accumulates far higher than the specified
+       limit.  To avoid uselessly long delays in such cases, the actual delay
+       is calculated as <varname>checksum_cost_delay</varname> *
+       <varname>accumulated_balance</varname> /
+       <varname>checksum_cost_limit</varname> with a maximum of
+       <varname>checksum_cost_delay</varname> * 4.
+      </para>
+     </note>
+    </sect2>
+
     <sect2 id="runtime-config-resource-background-writer">
      <title>Background Writer</title>
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e2e618791e..bcc97d1306 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26221,6 +26221,57 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5f4b168fd1..64664c0bc6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3912,6 +3912,9 @@ pgstat_get_wait_timeout(WaitEventTimeout w)
 		case WAIT_EVENT_VACUUM_DELAY:
 			event_name = "VacuumDelay";
 			break;
+		case WAIT_EVENT_CHECK_DELAY:
+			event_name = "CheckDelay";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c
index e010691c9f..dbdc105060 100644
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -15,8 +15,324 @@
 
 #include "storage/checksum.h"
 /*
- * The actual code is in storage/checksum_impl.h.  This is done so that
- * external programs can incorporate the checksum code by #include'ing
- * that file from the exported Postgres headers.  (Compare our CRC code.)
+ * The actual checksum computation code is in storage/checksum_impl.h.  This
+ * is done so that external programs can incorporate the checksum code by
+ * #include'ing that file from the exported Postgres headers.  (Compare our
+ * CRC code.)
  */
 #include "storage/checksum_impl.h"
+
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/buf_internals.h"
+#include "storage/lmgr.h"
+#include "utils/checksumfuncs.h"
+
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, and to
+ * keep overhead as low as possible, the following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the LWLock is released and the
+ *   block is read from disk without taking any lock.  If an error is detected,
+ *   the read block will be discarded and retrieved again while holding the
+ *   LWLock.  This is because an error due to concurrent write is possible but
+ *   very unlikely, so it's better to have an optimistic approach to limit
+ *   locking overhead
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static void check_delay_point(void);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer, bool needlock,
+							 bool *found_in_sb);
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+	bool		force_lock = false;
+	bool		found_in_sb;
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	/*
+	 * To avoid excessive overhead, the buffer will be first read without
+	 * the locks that would prevent false positives, as such
+	 * events should be quite rare.
+	 */
+Retry:
+	if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock,
+						  &found_in_sb))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/*
+	 * If we get a failure and the buffer wasn't found in shared buffers,
+	 * reread the buffer with suitable lock to avoid false positive.  See
+	 * check_get_buffer for more details.
+	 */
+	if (!found_in_sb && !force_lock)
+	{
+		force_lock = true;
+		goto Retry;
+	}
+
+	/* A corruption is detected. */
+	return false;
+}
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ * Check for interrupts and cost-based delay.
+ */
+static void
+check_delay_point(void)
+{
+	/* Always check for interrupts */
+	CHECK_FOR_INTERRUPTS();
+
+	if (!ChecksumCostActive || InterruptPending)
+		return;
+
+	/* Nap if appropriate */
+	if (ChecksumCostBalance >= ChecksumCostLimit)
+	{
+		int			msec;
+
+		msec = ChecksumCostDelay * ChecksumCostBalance / ChecksumCostLimit;
+		if (msec > ChecksumCostDelay * 4)
+			msec = ChecksumCostDelay * 4;
+
+		pgstat_report_wait_start(WAIT_EVENT_CHECK_DELAY);
+		pg_usleep(msec * 1000L);
+		pgstat_report_wait_end();
+
+		ChecksumCostBalance = 0;
+
+		/* Might have gotten an interrupt while sleeping */
+		CHECK_FOR_INTERRUPTS();
+	}
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion..
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read from disk without any
+ * lock unless caller asked otherwise, setting needlock.  In this case, the
+ * read will be done while the buffer mapping partition LWLock is still being
+ * held.  Reading with this lock is to avoid the unlikely but possible case
+ * that a buffer wasn't present in shared buffers when we checked but it then
+ * alloc'ed in shared_buffers, modified and flushed concurrently when we
+ * later try to read it, leading to false positives due to a torn page.  Caller
+ * can first read the buffer without holding the target buffer mapping
+ * partition LWLock to have an optimistic approach, and reread the buffer
+ * from disk in case of error.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer, bool needlock,
+				 bool *found_in_sb)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	*found_in_sb = false;
+
+	/* Check for interrupts and take throttling into account. */
+	check_delay_point();
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		*found_in_sb = true;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+
+			/* Add a page cost. */
+			ChecksumCostBalance += ChecksumCostPage;
+		}
+	}
+	else if (needlock)
+	{
+		/*
+		 * Caller asked to read the buffer while we have a lock on the target
+		 * partition.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+
+		/* Add a page cost. */
+		ChecksumCostBalance += ChecksumCostPage;
+	}
+
+	LWLockRelease(partLock);
+
+	if (*found_in_sb || needlock)
+		return checkit;
+
+	/* After this point the buffer will always be checked. */
+	Assert(checkit);
+
+	/*
+	 * Didn't find it in the buffer pool and didn't read it while holding the
+	 * buffer mapping partition lock.  We'll have to try to read it from
+	 * disk, after releasing the target partition lock to avoid excessive
+	 * overhead.  It means that it's possible to get a torn page later, so
+	 * we'll have to retry with a suitable lock in case of error to avoid
+	 * false positive.
+	 */
+	smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+	/* Add a page cost. */
+	ChecksumCostBalance += ChecksumCostPage;
+
+	return checkit;
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..603f63afb6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..700831cbc7
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,218 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksum related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/checksumfuncs.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid,
+									   Oid forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+
+	pg_check_relation_internal(fcinfo, relid, InvalidForkNumber);
+
+	return (Datum) 0;
+}
+
+Datum
+pg_check_relation_fork(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+	const char *forkname;
+	ForkNumber	forknum;
+
+	forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+	forknum = forkname_to_number(forkname);
+
+	pg_check_relation_internal(fcinfo, relid, forknum);
+
+	return (Datum) 0;
+}
+
+/* Common code for all versions of pg_check_relation() */
+static void
+pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid, Oid forknum)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_stat_scan_tables role may use this function")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	/* Set cost-based checksum verification delay */
+	ChecksumCostActive = (ChecksumCostDelay > 0);
+	ChecksumCostBalance = 0;
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 RelationGetRelationName(relation))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (check_one_block(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or not worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 6ab8216839..68e6e0cfcb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -135,6 +135,13 @@ int			max_worker_processes = 8;
 int			max_parallel_workers = 8;
 int			MaxBackends = 0;
 
+int			ChecksumCostPage = 10;	/* GUC parameters for checksum check */
+int			ChecksumCostLimit = 200;
+double		ChecksumCostDelay = 0;
+
+int			ChecksumCostBalance = 0;	/* working state for checksums check */
+bool		ChecksumCostActive = false;
+
 int			VacuumCostPageHit = 1;	/* GUC parameters for vacuum */
 int			VacuumCostPageMiss = 10;
 int			VacuumCostPageDirty = 20;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..a18f85181d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -691,6 +691,8 @@ const char *const config_group_names[] =
 	gettext_noop("Resource Usage / Disk"),
 	/* RESOURCES_KERNEL */
 	gettext_noop("Resource Usage / Kernel Resources"),
+	/* RESOURCES_CHECKSUM_DELAY */
+	gettext_noop("Resource Usage / Cost-Based Checksum Verification Delay"),
 	/* RESOURCES_VACUUM_DELAY */
 	gettext_noop("Resource Usage / Cost-Based Vacuum Delay"),
 	/* RESOURCES_BGWRITER */
@@ -2385,6 +2387,26 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"checksum_cost_page", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost for verifying a page."),
+			NULL
+		},
+		&ChecksumCostPage,
+		10, 0, 10000,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"checksum_cost_limit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost amount available before napping."),
+			NULL
+		},
+		&ChecksumCostLimit,
+		200, 1, 10000,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
@@ -3585,6 +3607,17 @@ static struct config_real ConfigureNamesReal[] =
 		check_random_seed, assign_random_seed, show_random_seed
 	},
 
+	{
+		{"checksum_cost_delay", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
+			gettext_noop("Checksum cost delay in milliseconds."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&ChecksumCostDelay,
+		0, 0, 100,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"vacuum_cost_delay", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost delay in milliseconds."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..0b45a3a9eb 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -160,6 +160,12 @@
 #max_files_per_process = 1000		# min 64
 					# (change requires restart)
 
+# - Cost-Based Checksum Verification Delay -
+
+#checksum_cost_delay = 0			# 0-100 milliseconds (0 disables)
+#checksum_cost_page = 10		# 0-10000 credits
+#checksum_cost_limit = 200		# 1-10000 credits
+
 # - Cost-Based Vacuum Delay -
 
 #vacuum_cost_delay = 0			# 0-100 milliseconds (0 disables)
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 687509ba92..e0afc7a1f0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10938,6 +10938,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 72e3352398..44c473995f 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -247,6 +247,13 @@ extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
+extern int	ChecksumCostPage;
+extern int	ChecksumCostLimit;
+extern double ChecksumCostDelay;
+
+extern int	ChecksumCostBalance;
+extern bool ChecksumCostActive;
+
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
 extern int	VacuumCostPageDirty;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0dfbac46b4..b9ff8392b8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -902,7 +902,8 @@ typedef enum
 	WAIT_EVENT_PG_SLEEP,
 	WAIT_EVENT_RECOVERY_APPLY_DELAY,
 	WAIT_EVENT_RECOVERY_RETRIEVE_RETRY_INTERVAL,
-	WAIT_EVENT_VACUUM_DELAY
+	WAIT_EVENT_VACUUM_DELAY,
+	WAIT_EVENT_CHECK_DELAY
 } WaitEventTimeout;
 
 /* ----------
diff --git a/src/include/utils/checksumfuncs.h b/src/include/utils/checksumfuncs.h
new file mode 100644
index 0000000000..f3e13b64b1
--- /dev/null
+++ b/src/include/utils/checksumfuncs.h
@@ -0,0 +1,31 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfunc.h
+ *	  Checksum verification implementation for data pages.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/checksumfunc.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKSUMFUNC_H
+#define CHECKSUMFUNC_H
+
+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
+
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+							BlockNumber blkno, uint16 *chk_expected,
+							uint16 *chk_found);
+
+#endif							/* CHECKSUMFUNC_H */
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 04431d0eb2..4ed2a09600 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -62,6 +62,7 @@ enum config_group
 	RESOURCES_MEM,
 	RESOURCES_DISK,
 	RESOURCES_KERNEL,
+	RESOURCES_CHECKSUM_DELAY,
 	RESOURCES_VACUUM_DELAY,
 	RESOURCES_BGWRITER,
 	RESOURCES_ASYNCHRONOUS,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 89e1b39036..0fd0f66050 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -50,7 +50,8 @@ my @contrib_excludes = (
 	'pgcrypto',         'sepgsql',
 	'brin',             'test_extensions',
 	'test_misc',        'test_pg_dump',
-	'snapshot_too_old', 'unsafe_tests');
+	'snapshot_too_old', 'unsafe_tests',
+	'check_relation');
 
 # Set of variables for frontend modules
 my $frontend_defines = { 'initdb' => 'FRONTEND' };
-- 
2.20.1

#44

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Julien Rouhaud (#43)

Re: Online checksums verification in the backend

On Thu, Sep 10, 2020 at 09:47:23AM +0200, Julien Rouhaud wrote:

On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote:

On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote:

I'll do some becnhmarking and see if I can get some figures, but it'll probably
take some time. In the meantime I'm attaching v11 of the patch that should
address all other comments.

I just realized that I forgot to update one of the Makefile when moving the TAP
test folder. v12 attached should fix that.

And the cfbot just reported a new error for Windows build. Attached v13 should
fix that.

I did some benchmarking using the following environnment:

- 16MB shared buffers
- 490MB table (10M rows)
- synchronized_seqscan to off
- table in OS cache

I don't have a big machine so I went with a very small shared_buffers and a
small table, to make sure that all data is in OS cache but the table more than
an order bigger than the shared_buffers, to simulate some plausible environment.

I used a simple read only query that performs a sequential scan of the table (a
simple SELECT * FROM table), run using 10 concurrent connections, 5 runs of 700
seconds. I did that without any other activity, with a \watch of the original
pg_check_relation function using \watch .1, and a modified version of that
function without the optimisation, still with a \watch .1

The TPS is obviously overall extremely bad, but I can see that the submitted
version added an overhead of ~3.9% (average of 5 runs), while the version
without the optimisation added an overhead of ~6.57%.

This is supposed to be a relatively fair benchmark as all the data are cached
on the OS side, so IO done while holding the bufmapping lock aren't too long,
but we can see that we already get a non negligible benefit from this
optimisation. Should I do additional benchmarking, like dropping the OS cache
and/or adding some write activity? This would probably only make the
unoptimized version perform even worse.

#45

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Julien Rouhaud (#44)

Re: Online checksums verification in the backend

On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote:

The TPS is obviously overall extremely bad, but I can see that the submitted
version added an overhead of ~3.9% (average of 5 runs), while the version
without the optimisation added an overhead of ~6.57%.

Okay, so that stands as a difference. I am planning to run some
benchmarks on my end as well, and see if I can see a clear
difference.

This is supposed to be a relatively fair benchmark as all the data are cached
on the OS side, so IO done while holding the bufmapping lock aren't too long,
but we can see that we already get a non negligible benefit from this
optimisation. Should I do additional benchmarking, like dropping the OS cache
and/or adding some write activity? This would probably only make the
unoptimized version perform even worse.

It would be also interesting to see the case where the pages are not
in the OS cache and see how bad it can get. For the read-write case,
I am not sure as we may have some different overhead that hide the
noise. Also, did you run your tests with the functions scanning at
full speed, with (ChecksumCostDelay < 0) so as there is no throttling?
--
Michael

#46

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Michael Paquier (#45)

Re: Online checksums verification in the backend

On Fri, Sep 11, 2020 at 9:34 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote:

The TPS is obviously overall extremely bad, but I can see that the submitted
version added an overhead of ~3.9% (average of 5 runs), while the version
without the optimisation added an overhead of ~6.57%.

Okay, so that stands as a difference. I am planning to run some
benchmarks on my end as well, and see if I can see a clear
difference.

Thanks!

This is supposed to be a relatively fair benchmark as all the data are cached
on the OS side, so IO done while holding the bufmapping lock aren't too long,
but we can see that we already get a non negligible benefit from this
optimisation. Should I do additional benchmarking, like dropping the OS cache
and/or adding some write activity? This would probably only make the
unoptimized version perform even worse.

It would be also interesting to see the case where the pages are not
in the OS cache and see how bad it can get. For the read-write case,
I am not sure as we may have some different overhead that hide the
noise. Also, did you run your tests with the functions scanning at
full speed, with (ChecksumCostDelay < 0) so as there is no throttling?

I used all default settings, but by default checksum_cost_delay is 0
so there shouldn't be any throttling.

#47

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Julien Rouhaud (#46)

Re: Online checksums verification in the backend

On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:

Thanks!

I got some numbers out of my pocket, using the following base
configuration:
wal_level = minimal
fsync = off
shared_buffers = '300MB' # also tested with 30MB and 60MB
checksum_cost_delay = 0 # default in patch

And for the test I have used pgbench initialized at a scale of 250, to
have close to 3.5GB of data, so as it gives us a sort of 90% buffer
eviction, with all the data cached in the OS (we may want to look as
well at the case where the OS cache does not hold all the relation
pages). I have also run some tests with 30MB and 60MB of shared
buffers, for similar results.

I also applied some prewarming on the cluster:
create extension pg_prewarm
select pg_prewarm(oid) from pg_class where oid > 16000;

Then, I have done five runs using that:
pgbench -S -M prepared -c 64/128/256 -n -T 60
In parallel of that, I got this stuff running in parallel all the
time:
select pg_check_relation('pgbench_accounts');
\watch 0.1

Here are some TPS numbers with the execution time of pg_check_relation.
In the five runs, I removed the highest and lowest ones, then took an
average of the remaining three.  I have also tested two modes: with
and without the optimization, that requires a one-liner in checksum.c
as per your latest patch:
--- a/src/backend/storage/page/checksum.c
+++ b/src/backend/storage/page/checksum.c
@@ -84,7 +84,7 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno,
                 uint16 *chk_expected, uint16 *chk_found)
 {
     char        buffer[BLCKSZ];
-    bool        force_lock = false;
+    bool        force_lock = true;
     bool        found_in_sb;

Within parenthesis is the amount of time taken by pg_relation_check()
for a single check. This is not completely exact and I saw some
variations, just to give an impression:
Conns 64 128 256
force_lock=true 60590 (7~8s) 55652 (10.2~10.5s) 46812 (9~10s)
force_lock=false 58637 (5s) 54131 (6~7s) 37091 (1.1~1.2s)

For connections higher than 128, I was kind of surprised to see
pg_relation_check being more aggressive without the optimization, with
much less contention on the buffer mapping LWLock actually, but that's
an impression from looking at pg_stat_activity.

Looking at the wait events for each run, I saw much more hiccups with
the buffer mapping LWLock when forcing the lock rather than not, still
I was able to see some contention when also not forcing the lock. Not
surprising as this rejects a bunch of pages from shared buffers.

I used all default settings, but by default checksum_cost_delay is 0
so there shouldn't be any throttling.

Thanks, so did I. From what I can see, there could be as well
benefits in not using the optimization by default so as the relation
check applies some natural throttling by making the checks actually
slower (there is a link between the individual runtime of
pg_relation_time and the TPS). So it also seems to me that the
throttling parameters would be beneficial, but it looks to me that
there is as well a point to not include any throttling in a first
version if the optimization to go full speed is not around. Using
three new GUCs for those function calls is still too much IMO, so
there is also the argument to move all this stuff into a new contrib/
module, and have a bgworker implementation as part of it as it would
need shared_preload_libraries anyway.

Also, I have been putting some thoughts into the APIs able to fetch a
buffer without going through the shared buffers. And neither
checksum.c, because it should be a place that those APIs depends on
and include only the most-internal logic for checksum algorithm and
computation, nor checksumfuncs.c, because we need to think about the
case of a background worker as well (that could spawn a set of dynamic
workers connecting to different databases able to do checksum
verifications?). It would be good to keep the handling of the buffer
mapping lock as well as the calls to smgrread() into a single place.
ReadBuffer_common() is a natural place for that, though it means for
our use case the addition of three new options:
- Being able to pass down directly a buffer pointer to save the page
contents.
- Being able to not verify directly a page, leaving the verification
to the caller upthread.
- Addition of a new mode, that I am calling here RBM_PRIVATE, where we
actually read the page from disk if not yet in shared buffers, except
that we fill in the contents of the page using the pointer given by
the caller. That's different than the use of local buffers, as we
don't need this much amount of complications like temporary tables of
course for per-page checks.

Another idea would be to actually just let ReadBuffer_common just do
the check by itself, with a different mode like a kind of
RBM_VALIDATE, where we just return a verification state of the page
that can be consumed by callers.

This also comes with some more advantages:
- Tracking of reads from disk with track_io_timing.
- Addition of some private stats dedicated to this private mode, with
new fields in pgBufferUsage, all in a single place
--
Michael

#48

Julien Rouhaud

rjuju123@gmail.com

over 5 years ago

In reply to: Michael Paquier (#47)

Re: Online checksums verification in the backend

On Wed, Sep 16, 2020 at 11:46 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote:

Thanks!

I got some numbers out of my pocket, using the following base
configuration:
[...]

Within parenthesis is the amount of time taken by pg_relation_check()
for a single check. This is not completely exact and I saw some
variations, just to give an impression:
Conns 64 128 256
force_lock=true 60590 (7~8s) 55652 (10.2~10.5s) 46812 (9~10s)
force_lock=false 58637 (5s) 54131 (6~7s) 37091 (1.1~1.2s)

For connections higher than 128, I was kind of surprised to see
pg_relation_check being more aggressive without the optimization, with
much less contention on the buffer mapping LWLock actually, but that's
an impression from looking at pg_stat_activity.

Looking at the wait events for each run, I saw much more hiccups with
the buffer mapping LWLock when forcing the lock rather than not, still
I was able to see some contention when also not forcing the lock. Not
surprising as this rejects a bunch of pages from shared buffers.

I used all default settings, but by default checksum_cost_delay is 0
so there shouldn't be any throttling.

Thanks, so did I. From what I can see, there could be as well
benefits in not using the optimization by default so as the relation
check applies some natural throttling by making the checks actually
slower (there is a link between the individual runtime of
pg_relation_time and the TPS).

Thanks a lot for the tests! I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster. Do you have
any idea why that's the case?

So it also seems to me that the
throttling parameters would be beneficial, but it looks to me that
there is as well a point to not include any throttling in a first
version if the optimization to go full speed is not around. Using
three new GUCs for those function calls is still too much IMO

I'm assuming that you prefer to remove both the optimization and the
throttling part? I'll do that with the next version unless there's
objections.

, so
there is also the argument to move all this stuff into a new contrib/
module, and have a bgworker implementation as part of it as it would
need shared_preload_libraries anyway.

Also, I have been putting some thoughts into the APIs able to fetch a
buffer without going through the shared buffers. And neither
checksum.c, because it should be a place that those APIs depends on
and include only the most-internal logic for checksum algorithm and
computation, nor checksumfuncs.c, because we need to think about the
case of a background worker as well (that could spawn a set of dynamic
workers connecting to different databases able to do checksum
verifications?). It would be good to keep the handling of the buffer
mapping lock as well as the calls to smgrread() into a single place.
ReadBuffer_common() is a natural place for that, though it means for
our use case the addition of three new options:
- Being able to pass down directly a buffer pointer to save the page
contents.
- Being able to not verify directly a page, leaving the verification
to the caller upthread.
- Addition of a new mode, that I am calling here RBM_PRIVATE, where we
actually read the page from disk if not yet in shared buffers, except
that we fill in the contents of the page using the pointer given by
the caller. That's different than the use of local buffers, as we
don't need this much amount of complications like temporary tables of
course for per-page checks.

Another idea would be to actually just let ReadBuffer_common just do
the check by itself, with a different mode like a kind of
RBM_VALIDATE, where we just return a verification state of the page
that can be consumed by callers.

This also comes with some more advantages:
- Tracking of reads from disk with track_io_timing.
- Addition of some private stats dedicated to this private mode, with
new fields in pgBufferUsage, all in a single place

I agree that putting the code nearby ReadBuffer_common() would be a
good idea. However, that means that I can't move all the code to
contrib/ I'm wondering what you'd like to see going there. I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which is maybe not a good
idea for something as critical as this function.

What do you think?

#49

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Julien Rouhaud (#48)

Re: Online checksums verification in the backend

On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:

Thanks a lot for the tests! I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster. Do you have
any idea why that's the case?

That's still a bit unclear to me, but I have not spent much time
thinking about this particular point either.

I'm assuming that you prefer to remove both the optimization and the
throttling part? I'll do that with the next version unless there's
objections.

Yeah, any tests I have done tends to show that. It would be good to
also check some perf profiles here, at least for the process running
the relation check in a loop.

I agree that putting the code nearby ReadBuffer_common() would be a
good idea. However, that means that I can't move all the code to
contrib/ I'm wondering what you'd like to see going there. I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

Good point. This could make the user experience worse.

Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which is maybe not a good
idea for something as critical as this function.

What do you think?

Yeah, I have been looking at ReadBuffer_common() and it is true that
it is complicated enough so we may not really need an extra mode or
more options, for a final logic that is actually different than what a
buffer read does: we just want to know if a page has a valid checksum
or not. An idea that I got here would be to add a new, separate
function to do the page check directly in bufmgr.c, but that's what
you mean. Now only the prefetch routine and ReadBuffer_common use
partition locks, but getting that done in the same file looks like a
good compromise to me. It would be also possible to keep the BLCKSZ
buffer used to check the page directly in this routine, so as any
caller willing to do a check don't need to worry about any
allocation.
--
Michael

#50

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Michael Paquier (#49)

2 attachment(s)

Re: Online checksums verification in the backend

On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:

Thanks a lot for the tests! I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster. Do you have
any idea why that's the case?

That's still a bit unclear to me, but I have not spent much time
thinking about this particular point either.

I'm assuming that you prefer to remove both the optimization and the
throttling part? I'll do that with the next version unless there's
objections.

Yeah, any tests I have done tends to show that. It would be good to
also check some perf profiles here, at least for the process running
the relation check in a loop.

I agree that putting the code nearby ReadBuffer_common() would be a
good idea. However, that means that I can't move all the code to
contrib/ I'm wondering what you'd like to see going there. I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

Good point. This could make the user experience worse.

Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which is maybe not a good
idea for something as critical as this function.

What do you think?

Yeah, I have been looking at ReadBuffer_common() and it is true that
it is complicated enough so we may not really need an extra mode or
more options, for a final logic that is actually different than what a
buffer read does: we just want to know if a page has a valid checksum
or not. An idea that I got here would be to add a new, separate
function to do the page check directly in bufmgr.c, but that's what
you mean. Now only the prefetch routine and ReadBuffer_common use
partition locks, but getting that done in the same file looks like a
good compromise to me. It would be also possible to keep the BLCKSZ
buffer used to check the page directly in this routine, so as any
caller willing to do a check don't need to worry about any
allocation.

I made all the suggested modifications in attached v14:

- moved the C code in bufmgr.c nearby ReadBuffer
- removed the GUC and throttling options
- removed the dubious optimization

All documentation and comments are updated to reflect those changes.
I also split the commit in two, one for the backend infrastructure and
one for the SQL wrappers.

Attachments:

v14-0001-Add-backend-infrastructure-to-check-the-validity.patchapplication/octet-stream; name=v14-0001-Add-backend-infrastructure-to-check-the-validity.patchDownload

From c926abcb1171cc99fec36e189d46b27afda3cc92 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v14 1/2] Add backend infrastructure to check the validity of
 an on-disk block.

A new CheckBuffer function is introduced.  It takes care of the various locking
aspects to make sure that no false positive can be returned.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 216 ++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |   8 ++
 2 files changed, 224 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..e81e899594 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -45,7 +45,9 @@
 #include "postmaster/bgwriter.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/checksum.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -480,6 +482,42 @@ static int	buffertag_comparator(const void *p1, const void *p2);
 static int	ckpt_buforder_comparator(const void *pa, const void *pb);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, the
+ * following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the block is read from disk
+ * while holding the buffer pool partition mapping LWLock.
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer);
+
 
 /*
  * Implementation of PrefetchBuffer() for shared buffers.
@@ -4583,3 +4621,181 @@ TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
 				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
 				 errmsg("snapshot too old")));
 }
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion.
+ *
+ * Note that the given buffer will be retrieved in a private memory.
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read  while the buffer
+ * mapping partition LWLock is still being held.  Reading with this lock is to
+ * avoid the unlikely but possible case that a buffer wasn't present in shared
+ * buffers when we checked but it was then alloc'ed in shared_buffers, modified
+ * and flushed concurrently when we later try to read it, leading to false
+ * positives due to a torn page.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+		}
+	}
+	else
+	{
+		/*
+		 * Simply read the buffer.  There's no risk of modification on it as we
+		 * kept the buffer pool partition mapping lock.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+	}
+
+	LWLockRelease(partLock);
+
+	return checkit;
+}
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+CheckBuffer(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	if (!check_get_buffer(relation, forknum, blkno, buffer))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/* A corruption is detected. */
+	return false;
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..24aa102175 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -244,6 +244,14 @@ extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+extern bool CheckBuffer(Relation relation, ForkNumber forknum,
+						BlockNumber blkno, uint16 *chk_expected,
+						uint16 *chk_found);
+
 
 /* inline functions */
 
-- 
2.20.1

v14-0002-Add-a-pg_check_relation-SQL-function.patchapplication/octet-stream; name=v14-0002-Add-a-pg_check_relation-SQL-function.patchDownload

From 090dab37f8d3fa5a9b872d938e211bf8d69adab5 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Thu, 15 Oct 2020 13:20:17 +0800
Subject: [PATCH v14 2/2] Add a pg_check_relation() SQL function

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.k

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/func.sgml                        |  51 ++++
 src/backend/utils/adt/Makefile                |   1 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/check_relation/.gitignore    |   2 +
 src/test/modules/check_relation/Makefile      |  14 +
 src/test/modules/check_relation/README        |  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 ++++++++++++++++++
 src/tools/msvc/Mkvcbuild.pm                   |   3 +-
 9 files changed, 386 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e7cff980dd..28663107c8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26217,6 +26217,57 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..603f63afb6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 22340baf1c..9f4514d60f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10958,6 +10958,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 90594bd41b..adbabba3c4 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -50,7 +50,8 @@ my @contrib_excludes = (
 	'pgcrypto',         'sepgsql',
 	'brin',             'test_extensions',
 	'test_misc',        'test_pg_dump',
-	'snapshot_too_old', 'unsafe_tests');
+	'snapshot_too_old', 'unsafe_tests',
+	'check_relation');
 
 # Set of variables for frontend modules
 my $frontend_defines = { 'initdb' => 'FRONTEND' };
-- 
2.20.1

#51

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Julien Rouhaud (#50)

2 attachment(s)

Re: Online checksums verification in the backend

On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:

Thanks a lot for the tests! I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster. Do you have
any idea why that's the case?

That's still a bit unclear to me, but I have not spent much time
thinking about this particular point either.

I'm assuming that you prefer to remove both the optimization and the
throttling part? I'll do that with the next version unless there's
objections.

Yeah, any tests I have done tends to show that. It would be good to
also check some perf profiles here, at least for the process running
the relation check in a loop.

I agree that putting the code nearby ReadBuffer_common() would be a
good idea. However, that means that I can't move all the code to
contrib/ I'm wondering what you'd like to see going there. I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

Good point. This could make the user experience worse.

Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which is maybe not a good
idea for something as critical as this function.

What do you think?

Yeah, I have been looking at ReadBuffer_common() and it is true that
it is complicated enough so we may not really need an extra mode or
more options, for a final logic that is actually different than what a
buffer read does: we just want to know if a page has a valid checksum
or not. An idea that I got here would be to add a new, separate
function to do the page check directly in bufmgr.c, but that's what
you mean. Now only the prefetch routine and ReadBuffer_common use
partition locks, but getting that done in the same file looks like a
good compromise to me. It would be also possible to keep the BLCKSZ
buffer used to check the page directly in this routine, so as any
caller willing to do a check don't need to worry about any
allocation.

I made all the suggested modifications in attached v14:

- moved the C code in bufmgr.c nearby ReadBuffer
- removed the GUC and throttling options
- removed the dubious optimization

All documentation and comments are updated to reflect those changes.
I also split the commit in two, one for the backend infrastructure and
one for the SQL wrappers.

And I did miss a reference in the sgml documentation, fixed in v15.

Attachments:

v15-0002-Add-a-pg_check_relation-SQL-function.patchapplication/octet-stream; name=v15-0002-Add-a-pg_check_relation-SQL-function.patchDownload

From df62bc6fc25effa4d3f5c59c9347c1638a0f53d0 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Thu, 15 Oct 2020 13:20:17 +0800
Subject: [PATCH v15 2/2] Add a pg_check_relation() SQL function

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.k

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/func.sgml                        |  51 ++++
 src/backend/utils/adt/Makefile                |   1 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/check_relation/.gitignore    |   2 +
 src/test/modules/check_relation/Makefile      |  14 +
 src/test/modules/check_relation/README        |  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 ++++++++++++++++++
 src/tools/msvc/Mkvcbuild.pm                   |   3 +-
 9 files changed, 386 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e7cff980dd..28663107c8 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26217,6 +26217,57 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  See <xref
+    linkend="runtime-config-resource-checksum-verification-cost"/> for
+    information on how to configure cost-based verification delay.  You must be
+    a member of the <literal>pg_read_all_stats</literal> role to use this
+    function.  It can only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..603f63afb6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 22340baf1c..9f4514d60f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10958,6 +10958,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 90594bd41b..adbabba3c4 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -50,7 +50,8 @@ my @contrib_excludes = (
 	'pgcrypto',         'sepgsql',
 	'brin',             'test_extensions',
 	'test_misc',        'test_pg_dump',
-	'snapshot_too_old', 'unsafe_tests');
+	'snapshot_too_old', 'unsafe_tests',
+	'check_relation');
 
 # Set of variables for frontend modules
 my $frontend_defines = { 'initdb' => 'FRONTEND' };
-- 
2.20.1

v15-0001-Add-backend-infrastructure-to-check-the-validity.patchapplication/octet-stream; name=v15-0001-Add-backend-infrastructure-to-check-the-validity.patchDownload

From c926abcb1171cc99fec36e189d46b27afda3cc92 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v15 1/2] Add backend infrastructure to check the validity of
 an on-disk block.

A new CheckBuffer function is introduced.  It takes care of the various locking
aspects to make sure that no false positive can be returned.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 216 ++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |   8 ++
 2 files changed, 224 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..e81e899594 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -45,7 +45,9 @@
 #include "postmaster/bgwriter.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/checksum.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -480,6 +482,42 @@ static int	buffertag_comparator(const void *p1, const void *p2);
 static int	ckpt_buforder_comparator(const void *pa, const void *pb);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, the
+ * following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the block is read from disk
+ * while holding the buffer pool partition mapping LWLock.
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer);
+
 
 /*
  * Implementation of PrefetchBuffer() for shared buffers.
@@ -4583,3 +4621,181 @@ TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
 				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
 				 errmsg("snapshot too old")));
 }
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion.
+ *
+ * Note that the given buffer will be retrieved in a private memory.
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read  while the buffer
+ * mapping partition LWLock is still being held.  Reading with this lock is to
+ * avoid the unlikely but possible case that a buffer wasn't present in shared
+ * buffers when we checked but it was then alloc'ed in shared_buffers, modified
+ * and flushed concurrently when we later try to read it, leading to false
+ * positives due to a torn page.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+		}
+	}
+	else
+	{
+		/*
+		 * Simply read the buffer.  There's no risk of modification on it as we
+		 * kept the buffer pool partition mapping lock.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+	}
+
+	LWLockRelease(partLock);
+
+	return checkit;
+}
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+CheckBuffer(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	if (!check_get_buffer(relation, forknum, blkno, buffer))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/* A corruption is detected. */
+	return false;
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..24aa102175 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -244,6 +244,14 @@ extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+extern bool CheckBuffer(Relation relation, ForkNumber forknum,
+						BlockNumber blkno, uint16 *chk_expected,
+						uint16 *chk_found);
+
 
 /* inline functions */
 
-- 
2.20.1

#52

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Julien Rouhaud (#51)

2 attachment(s)

Re: Online checksums verification in the backend

On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:

Thanks a lot for the tests! I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster. Do you have
any idea why that's the case?

That's still a bit unclear to me, but I have not spent much time
thinking about this particular point either.

I'm assuming that you prefer to remove both the optimization and the
throttling part? I'll do that with the next version unless there's
objections.

Yeah, any tests I have done tends to show that. It would be good to
also check some perf profiles here, at least for the process running
the relation check in a loop.

I agree that putting the code nearby ReadBuffer_common() would be a
good idea. However, that means that I can't move all the code to
contrib/ I'm wondering what you'd like to see going there. I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

Good point. This could make the user experience worse.

Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which is maybe not a good
idea for something as critical as this function.

What do you think?

Yeah, I have been looking at ReadBuffer_common() and it is true that
it is complicated enough so we may not really need an extra mode or
more options, for a final logic that is actually different than what a
buffer read does: we just want to know if a page has a valid checksum
or not. An idea that I got here would be to add a new, separate
function to do the page check directly in bufmgr.c, but that's what
you mean. Now only the prefetch routine and ReadBuffer_common use
partition locks, but getting that done in the same file looks like a
good compromise to me. It would be also possible to keep the BLCKSZ
buffer used to check the page directly in this routine, so as any
caller willing to do a check don't need to worry about any
allocation.

I made all the suggested modifications in attached v14:

- moved the C code in bufmgr.c nearby ReadBuffer
- removed the GUC and throttling options
- removed the dubious optimization

All documentation and comments are updated to reflect those changes.
I also split the commit in two, one for the backend infrastructure and
one for the SQL wrappers.

And I did miss a reference in the sgml documentation, fixed in v15.

I forgot to add the modified file in the previous attachment, sorry
for the noise.

Attachments:

v16-0001-Add-backend-infrastructure-to-check-the-validity.patchapplication/octet-stream; name=v16-0001-Add-backend-infrastructure-to-check-the-validity.patchDownload

From 2f2b6b3006be047f730530b0fdf4d219bcc77222 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v16 1/2] Add backend infrastructure to check the validity of
 an on-disk block.

A new CheckBuffer function is introduced.  It takes care of the various locking
aspects to make sure that no false positive can be returned.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 216 ++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |   8 ++
 2 files changed, 224 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..e81e899594 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -45,7 +45,9 @@
 #include "postmaster/bgwriter.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/checksum.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -480,6 +482,42 @@ static int	buffertag_comparator(const void *p1, const void *p2);
 static int	ckpt_buforder_comparator(const void *pa, const void *pb);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, the
+ * following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the block is read from disk
+ * while holding the buffer pool partition mapping LWLock.
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer);
+
 
 /*
  * Implementation of PrefetchBuffer() for shared buffers.
@@ -4583,3 +4621,181 @@ TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
 				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
 				 errmsg("snapshot too old")));
 }
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion.
+ *
+ * Note that the given buffer will be retrieved in a private memory.
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read  while the buffer
+ * mapping partition LWLock is still being held.  Reading with this lock is to
+ * avoid the unlikely but possible case that a buffer wasn't present in shared
+ * buffers when we checked but it was then alloc'ed in shared_buffers, modified
+ * and flushed concurrently when we later try to read it, leading to false
+ * positives due to a torn page.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+		}
+	}
+	else
+	{
+		/*
+		 * Simply read the buffer.  There's no risk of modification on it as we
+		 * kept the buffer pool partition mapping lock.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+	}
+
+	LWLockRelease(partLock);
+
+	return checkit;
+}
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+CheckBuffer(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	if (!check_get_buffer(relation, forknum, blkno, buffer))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/* A corruption is detected. */
+	return false;
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..24aa102175 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -244,6 +244,14 @@ extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+extern bool CheckBuffer(Relation relation, ForkNumber forknum,
+						BlockNumber blkno, uint16 *chk_expected,
+						uint16 *chk_found);
+
 
 /* inline functions */
 
-- 
2.20.1

v16-0002-Add-a-pg_check_relation-SQL-function.patchapplication/octet-stream; name=v16-0002-Add-a-pg_check_relation-SQL-function.patchDownload

From ff76c3d51d60e334ef8d4eeddf8c4aceecb28413 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Thu, 15 Oct 2020 13:20:17 +0800
Subject: [PATCH v16 2/2] Add a pg_check_relation() SQL function

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.k

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/func.sgml                        |  49 ++++
 src/backend/utils/adt/Makefile                |   1 +
 src/include/catalog/pg_proc.dat               |  16 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/check_relation/.gitignore    |   2 +
 src/test/modules/check_relation/Makefile      |  14 +
 src/test/modules/check_relation/README        |  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 ++++++++++++++++++
 src/tools/msvc/Mkvcbuild.pm                   |   3 +-
 9 files changed, 384 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e7cff980dd..a800ef33f6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26217,6 +26217,55 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..603f63afb6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 22340baf1c..9f4514d60f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10958,6 +10958,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 90594bd41b..adbabba3c4 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -50,7 +50,8 @@ my @contrib_excludes = (
 	'pgcrypto',         'sepgsql',
 	'brin',             'test_extensions',
 	'test_misc',        'test_pg_dump',
-	'snapshot_too_old', 'unsafe_tests');
+	'snapshot_too_old', 'unsafe_tests',
+	'check_relation');
 
 # Set of variables for frontend modules
 my $frontend_defines = { 'initdb' => 'FRONTEND' };
-- 
2.20.1

#53

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Julien Rouhaud (#52)

2 attachment(s)

Re: Online checksums verification in the backend

On Fri, Oct 16, 2020 at 8:59 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote:

Thanks a lot for the tests! I'm not surprised that forcing the lock
will slow down the pg_check_relation() execution, but I'm a bit
surprised that holding the buffer mapping lock longer in a workload
that has a lot of evictions actually makes things faster. Do you have
any idea why that's the case?

That's still a bit unclear to me, but I have not spent much time
thinking about this particular point either.

I'm assuming that you prefer to remove both the optimization and the
throttling part? I'll do that with the next version unless there's
objections.

Yeah, any tests I have done tends to show that. It would be good to
also check some perf profiles here, at least for the process running
the relation check in a loop.

I agree that putting the code nearby ReadBuffer_common() would be a
good idea. However, that means that I can't move all the code to
contrib/ I'm wondering what you'd like to see going there. I can see
some values in also having the SQL functions available in core rather
than contrib, e.g. if you need to quickly check a relation on a
standby, so without requiring to create the extension on the primary
node first.

Good point. This could make the user experience worse.

Then, I'm a bit worried about adding this code in ReadBuffer_common.
What this code does is quite different, and I'm afraid that it'll make
ReadBuffer_common more complex than needed, which is maybe not a good
idea for something as critical as this function.

What do you think?

Yeah, I have been looking at ReadBuffer_common() and it is true that
it is complicated enough so we may not really need an extra mode or
more options, for a final logic that is actually different than what a
buffer read does: we just want to know if a page has a valid checksum
or not. An idea that I got here would be to add a new, separate
function to do the page check directly in bufmgr.c, but that's what
you mean. Now only the prefetch routine and ReadBuffer_common use
partition locks, but getting that done in the same file looks like a
good compromise to me. It would be also possible to keep the BLCKSZ
buffer used to check the page directly in this routine, so as any
caller willing to do a check don't need to worry about any
allocation.

I made all the suggested modifications in attached v14:

- moved the C code in bufmgr.c nearby ReadBuffer
- removed the GUC and throttling options
- removed the dubious optimization

All documentation and comments are updated to reflect those changes.
I also split the commit in two, one for the backend infrastructure and
one for the SQL wrappers.

And I did miss a reference in the sgml documentation, fixed in v15.

I forgot to add the modified file in the previous attachment, sorry
for the noise.

And Michael just told me that I also missed adding one of the C files
while splitting the patch into two.

Attachments:

v17-0001-Add-backend-infrastructure-to-check-the-validity.patchapplication/octet-stream; name=v17-0001-Add-backend-infrastructure-to-check-the-validity.patchDownload

From 2f2b6b3006be047f730530b0fdf4d219bcc77222 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <jrouhaud@vmware.com>
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v17 1/2] Add backend infrastructure to check the validity of
 an on-disk block.

A new CheckBuffer function is introduced.  It takes care of the various locking
aspects to make sure that no false positive can be returned.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 216 ++++++++++++++++++++++++++++
 src/include/storage/bufmgr.h        |   8 ++
 2 files changed, 224 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..e81e899594 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -45,7 +45,9 @@
 #include "postmaster/bgwriter.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
+#include "storage/checksum.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -480,6 +482,42 @@ static int	buffertag_comparator(const void *p1, const void *p2);
 static int	ckpt_buforder_comparator(const void *pa, const void *pb);
 static int	ts_ckpt_progress_comparator(Datum a, Datum b, void *arg);
 
+/* ----------------
+ * The rest of this module provides a set of functions that can be used to
+ * safely check all checksums on a running cluster.
+ *
+ * Please note that those only perform standard buffered reads, and don't try
+ * to bypass or discard the operating system cache.  If you want to check the
+ * actual storage, you have to discard the operating system cache before
+ * running those functions.
+ *
+ * To avoid torn pages and possible false positives when reading data, the
+ * following heuristics are used:
+ *
+ * - a shared LWLock is taken on the target buffer pool partition mapping, and
+ *   we detect if a block is in shared_buffers or not.  See check_get_buffer()
+ *   comments for more details about the locking strategy.
+ *
+ * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to
+ *   disk either before the end of the next checkpoint or during recovery in
+ *   case of unsafe shutdown
+ *
+ * - if a block is otherwise found in shared_buffers, an IO lock is taken on
+ *   the block and the block is then read from storage, ignoring the block in
+ *   shared_buffers
+ *
+ * - if a block is not found in shared_buffers, the block is read from disk
+ * while holding the buffer pool partition mapping LWLock.
+ *
+ * The check can be performed using an SQL function, returning the list of
+ * problematic blocks.
+ * ----------------
+ */
+static bool check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+							   uint16 *chk_found);
+static bool check_get_buffer(Relation relation, ForkNumber forknum,
+							 BlockNumber blkno, char *buffer);
+
 
 /*
  * Implementation of PrefetchBuffer() for shared buffers.
@@ -4583,3 +4621,181 @@ TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
 				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
 				 errmsg("snapshot too old")));
 }
+
+/*
+ * Perform a checksum check on the passed page.  Return True iff the page is
+ * valid or not, and assign the expected and found checksum in chk_expected and
+ * chk_found, respectively.  Note that a page can look like new but could be
+ * the result of corruption.  We still check for this case, but we can't
+ * compute its checksum as pg_checksum_page() is explicitly checking for
+ * non-new pages, so NoComputedChecksum will be set in chk_found.
+ */
+static bool
+check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected,
+				   uint16 *chk_found)
+{
+	Page		page = (Page) buffer;
+	PageHeader	hdr = (PageHeader) page;
+
+	Assert(chk_expected && chk_found);
+
+	if (PageIsNew(page))
+	{
+		/*
+		 * Check if the page is really new or if there's corruption that
+		 * affected PageIsNew detection.  Note that PageIsVerified won't try to
+		 * detect checksum corruption in this case, so there's no risk of
+		 * duplicated corruption report.
+		 */
+		if (PageIsVerified(page, blkno))
+		{
+			/* No corruption. */
+			return true;
+		}
+
+		/*
+		 * There's corruption, but since this affects PageIsNew, we
+		 * can't compute a checksum, so set NoComputedChecksum for the
+		 * expected checksum.
+		 */
+		*chk_expected = NoComputedChecksum;
+		*chk_found = hdr->pd_checksum;
+		return false;
+	}
+
+	*chk_expected = pg_checksum_page(buffer, blkno);
+	*chk_found = hdr->pd_checksum;
+
+	return (*chk_expected == *chk_found);
+}
+
+/*
+ *-------------------------
+ * Safely read the wanted buffer from disk, dealing with possible concurrency
+ * issue.  Note that if a buffer is found dirty in shared_buffers, no read will
+ * be performed and the caller will be informed that no check should be done.
+ * We can safely ignore such buffers as they'll be written before next
+ * checkpoint's completion.
+ *
+ * Note that the given buffer will be retrieved in a private memory.
+ *
+ * The following locks can be used in this function:
+ *
+ *   - shared LWLock on the target buffer pool partition mapping.
+ *   - IOLock on the buffer
+ *
+ * The IOLock is taken when reading the buffer from disk if it exists in
+ * shared_buffers, to avoid torn pages.
+ *
+ * If the buffer isn't in shared_buffers, it'll be read  while the buffer
+ * mapping partition LWLock is still being held.  Reading with this lock is to
+ * avoid the unlikely but possible case that a buffer wasn't present in shared
+ * buffers when we checked but it was then alloc'ed in shared_buffers, modified
+ * and flushed concurrently when we later try to read it, leading to false
+ * positives due to a torn page.
+ *
+ * Caller should hold an AccessShareLock on the Relation
+ *-------------------------
+ */
+static bool
+check_get_buffer(Relation relation, ForkNumber forknum,
+				 BlockNumber blkno, char *buffer)
+{
+	bool		checkit = true;
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(buf_tag, relation->rd_smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We'll keep the shared lightweight lock on the target
+		 * buffer mapping partition, so this buffer can't be evicted, and
+		 * we'll acquire an IOLock on the buffer if we need to read the
+		 * content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/*
+		 * Dirty pages are ignored as they'll be flushed soon. Invalid buffers
+		 * are also skipped.
+		 */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+			checkit = false;
+
+		/*
+		 * Read the buffer from disk, taking an IO lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		if (checkit)
+		{
+			LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+			smgrread(relation->rd_smgr, forknum, blkno, buffer);
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+		}
+	}
+	else
+	{
+		/*
+		 * Simply read the buffer.  There's no risk of modification on it as we
+		 * kept the buffer pool partition mapping lock.
+		 */
+		smgrread(relation->rd_smgr, forknum, blkno, buffer);
+
+		/* The buffer will have to be checked. */
+		Assert(checkit);
+	}
+
+	LWLockRelease(partLock);
+
+	return checkit;
+}
+
+/*
+ * Check data sanity for a specific block in the given fork of the given
+ * relation, always retrieved locally with smgrread even if a version exists in
+ * shared_buffers.  Returns false if the block appears to be corrupted, true
+ * otherwise.  Note that dirty and invalid blocks won't be checked.  Caller
+ * must hold at least an AccessShareLock on the relation.
+ */
+bool
+CheckBuffer(Relation relation, ForkNumber forknum, BlockNumber blkno,
+				uint16 *chk_expected, uint16 *chk_found)
+{
+	char		buffer[BLCKSZ];
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+	Assert(blkno < RelationGetNumberOfBlocksInFork(relation, forknum));
+	Assert(smgrexists(relation->rd_smgr, forknum));
+
+	*chk_expected = *chk_found = NoComputedChecksum;
+
+	if (!check_get_buffer(relation, forknum, blkno, buffer))
+		return true;
+
+	if (check_buffer(buffer, blkno, chk_expected, chk_found))
+		return true;
+
+	/* A corruption is detected. */
+	return false;
+}
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..24aa102175 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -244,6 +244,14 @@ extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
+/*
+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum	0
+
+extern bool CheckBuffer(Relation relation, ForkNumber forknum,
+						BlockNumber blkno, uint16 *chk_expected,
+						uint16 *chk_found);
+
 
 /* inline functions */
 
-- 
2.20.1

v17-0002-Add-a-pg_check_relation-SQL-function.patchapplication/octet-stream; name=v17-0002-Add-a-pg_check_relation-SQL-function.patchDownload

From eff1266c373f56a3ea6c0060fc7ca4d58e2d255b Mon Sep 17 00:00:00 2001
From: Julien Rouhaud <julien.rouhaud@free.fr>
Date: Thu, 15 Oct 2020 13:20:17 +0800
Subject: [PATCH v17 2/2] Add a pg_check_relation() SQL function

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.k

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/func.sgml                        |  49 ++++
 src/backend/utils/adt/Makefile                |   1 +
 src/backend/utils/adt/checksumfuncs.c         | 213 ++++++++++++++
 src/include/catalog/pg_proc.dat               |  16 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/check_relation/.gitignore    |   2 +
 src/test/modules/check_relation/Makefile      |  14 +
 src/test/modules/check_relation/README        |  23 ++
 .../check_relation/t/001_checksums_check.pl   | 276 ++++++++++++++++++
 src/tools/msvc/Mkvcbuild.pm                   |   3 +-
 10 files changed, 597 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/modules/check_relation/.gitignore
 create mode 100644 src/test/modules/check_relation/Makefile
 create mode 100644 src/test/modules/check_relation/README
 create mode 100644 src/test/modules/check_relation/t/001_checksums_check.pl

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e7cff980dd..a800ef33f6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26217,6 +26217,55 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide a means to check for health of a data file in a cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_check_relation(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type>])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Validate the checksum for all blocks of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_check_relation</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_check_relation">
+    <function>pg_check_relation</function> iterates over all blocks of a
+    given relation and verifies their checksums.  If passed,
+    <replaceable>fork</replaceable> specifies that only checksums of the given
+    fork are to be verified.  Fork should be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.
+    The function returns a list of blocks for which the computed and stored
+    checksums don't match.  You must be a member of the
+    <literal>pg_read_all_stats</literal> role to use this function.  It can
+    only be used if data checksums are enabled.  See <xref
+    linkend="app-initdb-data-checksums"/> for more information.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..603f63afb6 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	bool.o \
 	cash.o \
 	char.o \
+	checksumfuncs.o \
 	cryptohashes.o \
 	date.o \
 	datetime.o \
diff --git a/src/backend/utils/adt/checksumfuncs.c b/src/backend/utils/adt/checksumfuncs.c
new file mode 100644
index 0000000000..82d3a912f9
--- /dev/null
+++ b/src/backend/utils/adt/checksumfuncs.c
@@ -0,0 +1,213 @@
+/*-------------------------------------------------------------------------
+ *
+ * checksumfuncs.c
+ *	  Functions for checksum related feature such as online verification
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/checksumfuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "catalog/pg_authid_d.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "storage/checksum.h"
+#include "storage/smgr.h"
+#include "utils/acl.h"
+#include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/lsyscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+									  Relation relation, ForkNumber forknum);
+static void pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid,
+									   Oid forknum);
+
+
+Datum
+pg_check_relation(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+
+	pg_check_relation_internal(fcinfo, relid, InvalidForkNumber);
+
+	return (Datum) 0;
+}
+
+Datum
+pg_check_relation_fork(PG_FUNCTION_ARGS)
+{
+	Oid			relid = InvalidOid;
+	const char *forkname;
+	ForkNumber	forknum;
+
+	forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+	forknum = forkname_to_number(forkname);
+
+	pg_check_relation_internal(fcinfo, relid, forknum);
+
+	return (Datum) 0;
+}
+
+/* Common code for all versions of pg_check_relation() */
+static void
+pg_check_relation_internal(FunctionCallInfo fcinfo, Oid relid, Oid forknum)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+
+	if (!DataChecksumsEnabled())
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("data checksums are not enabled in cluster")));
+
+	if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_STAT_SCAN_TABLES))
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 errmsg("only superuser or a member of the pg_stat_scan_tables role may use this function")));
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	relid = PG_GETARG_OID(0);
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+
+	tuplestore_donestoring(tupstore);
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation = NULL;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
+		ereport(ERROR,
+				(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+				 errmsg("relation \"%s\" does not have storage to be checked",
+				 RelationGetRelationName(relation))));
+
+	if (RELATION_IS_OTHER_TEMP(relation))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("cannot verify temporary tables of other sessions")));
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);	/* release the lock */
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+						  Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+#define PG_CHECK_RELATION_COLS			5	/* Number of output arguments in the SRF */
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		uint16		chk_expected,
+					chk_found;
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Check the given buffer */
+		if (CheckBuffer(relation, forknum, blkno, &chk_expected,
+							&chk_found))
+		{
+			/* Buffer not corrupted or not worth checking, continue */
+			continue;
+		}
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = ObjectIdGetDatum(relation->rd_id);
+		values[i++] = Int32GetDatum(forknum);
+		values[i++] = UInt32GetDatum(blkno);
+		/*
+		 * This can happen if corruption makes the block appears as
+		 * PageIsNew() but isn't a new page.
+		 */
+		if (chk_expected == NoComputedChecksum)
+			nulls[i++] = true;
+		else
+			values[i++] = UInt16GetDatum(chk_expected);
+		values[i++] = UInt16GetDatum(chk_found);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Report the failure to the stat collector and the logs. */
+		pgstat_report_checksum_failure();
+		ereport(WARNING,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("invalid page in block %u of relation %s",
+						blkno,
+						relpath(relation->rd_smgr->smgr_rnode, forknum))));
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 22340baf1c..9f4514d60f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10958,6 +10958,22 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check data integrity for all forks of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass',
+  proallargtypes => '{regclass,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{relation,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation' },
+{ oid => '9148', descr => 'check data integrity for one fork of a relation',
+  proname => 'pg_check_relation', procost => '10000',
+  prorows => '20', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}',
+  proargmodes => '{i,i,o,o,o,o,o}',
+  proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}',
+  prosrc => 'pg_check_relation_fork' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..2a3f2880ea
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 59;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_check_relation('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_check_relation(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+		. " relid::regclass::text, forknum, failed_blocknum,"
+		. " expected_checksum, found_checksum"
+		. " FROM (SELECT (pg_catalog.pg_check_relation(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"The data written at offset $offset should be the one"
+		. "	we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, qr/invalid page in block $blkno/);
+
+	my $expected_checksum;
+	my $found_checksum = get_uint16_from_page($new_block,
+		$CHECKSUM_UINT16_OFFSET);
+	if ($offset == $PD_UPPER_UINT16_OFFSET)
+	{
+		# A checksum can't be computed if it's detected as PageIsNew(), so the
+		# function returns NULL for the computed checksum
+		$expected_checksum = '';
+	}
+	else
+	{
+		$expected_checksum = get_uint16_from_page($original_block,
+			$CHECKSUM_UINT16_OFFSET);
+	}
+
+	my $det = get_checksums_errors($node, 1, qr/invalid page in block $blkno/);
+	is($det, "t1|0|0|$expected_checksum|$found_checksum",
+		"The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	CREATE INDEX t1_id_idx ON public.t1 (id);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CREATE VIEW public.v1 AS SELECT * FROM t1;
+	CREATE MATERIALIZED VIEW public.mv1 AS SELECT * FROM t1;
+	CREATE SEQUENCE public.s1;
+	CREATE UNLOGGED TABLE public.u_t1(id integer);
+	CREATE INDEX u_t1_id_idx ON public.u_t1 (id);
+	INSERT INTO public.u_t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Check sane behavior on various objects type, including those that don't have
+# a storage.
+is(check_checksums_call($node, 't1'), '1', 'Can check a table');
+is(check_checksums_call($node, 't1_id_idx'), '1', 'Can check an index');
+is(check_checksums_call($node, 'v1'), '', 'Cannot check a view');
+is(check_checksums_call($node, 'mv1'), '1', 'Can check a materialized view');
+is(check_checksums_call($node, 's1'), '1', 'Can check a sequence');
+is(check_checksums_call($node, 'u_t1'), '1', 'Can check an unlogged table');
+is(check_checksums_call($node, 'u_t1_id_idx'), '1', 'Can check an unlogged index');
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+	$fake_uint16);
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16);
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 90594bd41b..adbabba3c4 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -50,7 +50,8 @@ my @contrib_excludes = (
 	'pgcrypto',         'sepgsql',
 	'brin',             'test_extensions',
 	'test_misc',        'test_pg_dump',
-	'snapshot_too_old', 'unsafe_tests');
+	'snapshot_too_old', 'unsafe_tests',
+	'check_relation');
 
 # Set of variables for frontend modules
 my $frontend_defines = { 'initdb' => 'FRONTEND' };
-- 
2.20.1

#54

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Julien Rouhaud (#53)

Re: Online checksums verification in the backend

On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote:

And Michael just told me that I also missed adding one of the C files
while splitting the patch into two.

+   if (PageIsNew(page))
+   {
+       /*
+        * Check if the page is really new or if there's corruption that
+        * affected PageIsNew detection.  Note that PageIsVerified won't try to
+        * detect checksum corruption in this case, so there's no risk of
+        * duplicated corruption report.
+        */
+       if (PageIsVerified(page, blkno))
+       {
+           /* No corruption. */
+           return true;
+       }
Please note that this part of your patch overlaps with a proposal for
a bug fix related to zero-only pages with the checksum verification of
base backups:
https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru

Your patch is trying to adapt itself to the existing logic we have in
PageIsVerified() so as you don't get a duplicated report, as does the
base backup part. Note that I cannot find in the wild any open code
making use of PageIsVerified(), but I'd like to believe that it is
rather handy to use for table AMs at least (?), so if we can avoid any
useless ABI breakage, it may be better to have a new
PageIsVerifiedExtended() able to take additional options, one to
report to pgstat and one to generate this elog(WARNING). And then
this patch could just make use of it?

+       /*
+        * There's corruption, but since this affects PageIsNew, we
+        * can't compute a checksum, so set NoComputedChecksum for the
+        * expected checksum.
+        */
+       *chk_expected = NoComputedChecksum;
+       *chk_found = hdr->pd_checksum;
+       return false;
[...]
+       /*
+        * This can happen if corruption makes the block appears as
+        * PageIsNew() but isn't a new page.
+        */
+       if (chk_expected == NoComputedChecksum)
+           nulls[i++] = true;
+       else
+           values[i++] = UInt16GetDatum(chk_expected);
Somewhat related to the first point, NoComputedChecksum exists
because, as the current patch is shaped, we need to report an existing
checksum to the user even for the zero-only case.  PageIsVerified() is
not that flexible so we could change it to report a status depending
on the error faced (checksum, header or zero-only) on top of getting a
checksum.  Now, I am not completely sure either that it is worth the
complication to return in the SRF of the check function the expected
checksum.  So, wouldn't it be better to just rely on PageIsVerified()
(well it's rather-soon-to-be extended flavor) for the checksum check,
the header sanity check and the zero-only check?  My point is to keep
a single entry point for all the page sanity checks, so as base
backups, your patch, and the buffer manager apply the same things.
Base backups got that partially wrong because the base backup code
wants to keep control of the number of failures and the error
reports.  Your patch actually wishes to report a failure, but you want
to add more context with the fork name and such.  Another option we
could use here is to add an error context so as PageIsVerified()
reports the WARNING, but the SQL function provides more context with
the block number and the relation involved in the check.
--
Michael

#55

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Michael Paquier (#54)

Re: Online checksums verification in the backend

On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote:

And Michael just told me that I also missed adding one of the C files
while splitting the patch into two.
+   if (PageIsNew(page))
+   {
+       /*
+        * Check if the page is really new or if there's corruption that
+        * affected PageIsNew detection.  Note that PageIsVerified won't try to
+        * detect checksum corruption in this case, so there's no risk of
+        * duplicated corruption report.
+        */
+       if (PageIsVerified(page, blkno))
+       {
+           /* No corruption. */
+           return true;
+       }
Please note that this part of your patch overlaps with a proposal for
a bug fix related to zero-only pages with the checksum verification of
base backups:
https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru
Your patch is trying to adapt itself to the existing logic we have in
PageIsVerified() so as you don't get a duplicated report, as does the
base backup part. Note that I cannot find in the wild any open code
making use of PageIsVerified(), but I'd like to believe that it is
rather handy to use for table AMs at least (?), so if we can avoid any
useless ABI breakage, it may be better to have a new
PageIsVerifiedExtended() able to take additional options, one to
report to pgstat and one to generate this elog(WARNING). And then
this patch could just make use of it?

Indeed, that would be great.

+       /*
+        * There's corruption, but since this affects PageIsNew, we
+        * can't compute a checksum, so set NoComputedChecksum for the
+        * expected checksum.
+        */
+       *chk_expected = NoComputedChecksum;
+       *chk_found = hdr->pd_checksum;
+       return false;
[...]
+       /*
+        * This can happen if corruption makes the block appears as
+        * PageIsNew() but isn't a new page.
+        */
+       if (chk_expected == NoComputedChecksum)
+           nulls[i++] = true;
+       else
+           values[i++] = UInt16GetDatum(chk_expected);
Somewhat related to the first point, NoComputedChecksum exists
because, as the current patch is shaped, we need to report an existing
checksum to the user even for the zero-only case.

I'm not sure that I understand your point. The current patch only
returns something to users when there's a corruption. If by
"zero-only case" you mean "page corrupted in a way that PageIsNew()
returns true while not being all zero", then it's a corrupted page and
then obviously yes it needs to be returned to users.

PageIsVerified() is
not that flexible so we could change it to report a status depending
on the error faced (checksum, header or zero-only) on top of getting a
checksum. Now, I am not completely sure either that it is worth the
complication to return in the SRF of the check function the expected
checksum.

It seemed to me that it could be something useful to get with this
kind of tool. You may be able to recover a corrupted page from
backup/WAL if the checksum itself wasn't corrupted so that you know
what to look for. There would be a lot of caveats and low level work,
but if you're desperate enough that may save you a bit of time.

So, wouldn't it be better to just rely on PageIsVerified()
(well it's rather-soon-to-be extended flavor) for the checksum check,
the header sanity check and the zero-only check? My point is to keep
a single entry point for all the page sanity checks, so as base
backups, your patch, and the buffer manager apply the same things.
Base backups got that partially wrong because the base backup code
wants to keep control of the number of failures and the error
reports.

I'm fine with it.

Your patch actually wishes to report a failure, but you want
to add more context with the fork name and such. Another option we
could use here is to add an error context so as PageIsVerified()
reports the WARNING, but the SQL function provides more context with
the block number and the relation involved in the check.

Also, returning actual data rather than a bunch of warnings is way
easier to process for client code. And as mentioned previously having
an API that returns a list of corrupted blocks could be useful for a
single-page recovery feature.

#56

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Julien Rouhaud (#55)

Re: Online checksums verification in the backend

On Mon, Oct 19, 2020 at 11:16:38AM +0800, Julien Rouhaud wrote:

On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote:

Somewhat related to the first point, NoComputedChecksum exists
because, as the current patch is shaped, we need to report an existing
checksum to the user even for the zero-only case.

I'm not sure that I understand your point. The current patch only
returns something to users when there's a corruption. If by
"zero-only case" you mean "page corrupted in a way that PageIsNew()
returns true while not being all zero", then it's a corrupted page and
then obviously yes it needs to be returned to users.

Sorry for the confusion, this previous paragraph was confusing. I
meant that the reason why NoComputedChecksum exists is that we give up
on attempting to calculate the checksum if we detect that the page is
new, but failed the zero-only test, and that we want the users to know
about this special case by setting this expected checksum to NULL for
the SRF.

So, wouldn't it be better to just rely on PageIsVerified()
(well it's rather-soon-to-be extended flavor) for the checksum check,
the header sanity check and the zero-only check? My point is to keep
a single entry point for all the page sanity checks, so as base
backups, your patch, and the buffer manager apply the same things.
Base backups got that partially wrong because the base backup code
wants to keep control of the number of failures and the error
reports.

I'm fine with it.

Thanks.

Your patch actually wishes to report a failure, but you want
to add more context with the fork name and such. Another option we
could use here is to add an error context so as PageIsVerified()
reports the WARNING, but the SQL function provides more context with
the block number and the relation involved in the check.

Also, returning actual data rather than a bunch of warnings is way
easier to process for client code. And as mentioned previously having
an API that returns a list of corrupted blocks could be useful for a
single-page recovery feature.

No issues with reporting the block number and the fork type in the SRF
at least of course as this is helpful to detect the position of the
broken blocks. For the checksum found in the header and the one
calculated (named expected in the patch), I am less sure how to put a
clear definition on top of that but we could always consider that
later and extend the SRF as needed. Once the user knows that both do
not match, he/she knows that there is a problem.
--
Michael

#57

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Michael Paquier (#56)

1 attachment(s)

Re: Online checksums verification in the backend

On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote:

No issues with reporting the block number and the fork type in the SRF
at least of course as this is helpful to detect the position of the
broken blocks. For the checksum found in the header and the one
calculated (named expected in the patch), I am less sure how to put a
clear definition on top of that but we could always consider that
later and extend the SRF as needed. Once the user knows that both do
not match, he/she knows that there is a problem.

So, I have reviewed your patch set, and heavily reworked the logic to
be more consistent on many thinks, resulting in a largely simplified
patch without sacrificing its usefulness:
- The logic of the core routine of bufmgr.c is unchanged. I have
simplified it a bit though by merging the subroutines that were part
of the patch. SMgrRelation is used as argument instead of a
Relation. That's more consistent with the surroundings. The initial
read of a page without locks is still on the table as an extra
optimization, though I am not completely sure if this should be part
of CheckBuffer() or not. I also thought about the previous benchmarks
and I think that not using the most-optimized improved performance,
because it reduced the successive runes of the SQL functions, reducing
the back-pressure on the partition locks (we held on of them at the
same time for a longer period, meaning that the other 127 ran free for
a longer time). Please note that this part still referred to a
"module", which was incorrect.
- Removal of the expected and found checksums from the SRF of the
function. Based on my recent business with the page checks for base
backups, I have arrived at the conclusion that the SRF should return
data that we can absolutely trust, and the minimum I think we have to
trust here is if a given page is thought as safe or not, considering
all the sanity checks done by PageIsVerified() as the main entry
point for everything. This has led to a bit of confusion with the
addition of NoComputedChecksum for a page that was empty as of the
initial of the patch, so it happens that we don't need it anymore.
- Removal of the dependency with checksums for this feature. While
simplifying the code, I have noticed that this feature can also be
beneficial for clusters that do not have have data checksums, as
PageIsVerified() is perfectly able to run some page header checks and
the zero-page case. That's of course less useful than having the
checksums, but there is no need to add a dependency here. The file
for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c.
- The function is changed to return no tuples if the relkind is not
supported, and the same applies for temporary relations. That's more
consistent with other system functions like the ones in charge of
partition information, and this makes full scans of pg_class much
easier to work with. Temporary tables were not handled correctly
anyway as these are in local buffers, but the use case of this
function in this case is not really obvious to me.
- Having the forknum in the SRF is kind of confusing, as the user
would need to map a number with the physical on-disk name. Instead, I
have made the choice to return the *path* of the corrupted file with a
block number. This way, an operator can know immediately where a
problem comes from. The patch does not append the segment number, and
I am not sure if we should actually do that, but adding it is
straight-forward as we have the block number. There is a dependency
with table AMs here as well, as this goes down to fd.c, explaining why
I have not added it and just.
- I really don't know what kind of default ACL should apply for such
functions, but I am sure that SCAN_TABLES is not what we are looking
for here, and there is nothing preventing us from having a safe
default from the start of times, so I moved the function to be
superuser-only by default, and GRANT can be used to allow its
execution to other roles. We could relax that in the future, of
course, this can be discussed separately.
- The WARNING report for each block found as corrupted gains an error
context, as a result of a switch to PageIsVerified(), giving a user
all the information needed in the logs on top of the result in the
SRF. That's useful as well if combined with CHECK_FOR_INTERRUPTS(),
and I got to wonder if we should have some progress report for this
stuff, though that's a separate discussion.
- The function is renamed to something less generic,
pg_relation_check_pages(), and I have reduced the number of functions
from two to one, where the user can specify the fork name with a new
option. The default of NULL means that all the forks of a relation
are checked.
- The TAP tests are rather bulky. I have moved all the non-corruption
test cases into a new SQL test file. That's useful for people willing
to do some basic sanity checks with a non-default table AM. At least
it provides a minimum coverage. I have not completely finished its
review, but I have done some work. Doing some debugging of
corrupt_and_test_block() was proving to be rather difficult as the
same test names are assigned multiple times. I am tempted to move
this test suite to src/test/recovery/ instead.
- Reworked the docs and some comments.

That's quite a lot of changes, and I think that most of the C code,
the main tests in src/test/regress/ and the docs are getting in a
rather committable state. The TAP tests emulating corruptions still
need a closer lookup (say, check_pg_stat_database_nb_error() should
have an error prefix at least). The portions in bufmgr.c and the rest
should of course be split into two separate commits, that can easily
be done. And the code needs an indentation run and a catalog bump.
--
Michael

Attachments:

online-checksum-v18.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bbcac69d48..459b2ad4ae 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10958,6 +10958,14 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check pages of a relation',
+  proname => 'pg_relation_check_pages', procost => '10000',
+  prorows => '20', proisstrict => 'f', proretset => 't', proparallel => 'r',
+  provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,text,int8}',
+  proargmodes => '{i,i,o,o}',
+  proargnames => '{relation,fork,path,failed_block_num}',
+  prosrc => 'pg_relation_check_pages' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..a21cab2eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -240,6 +240,9 @@ extern void AtProcExit_LocalBuffers(void);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 
+extern bool CheckBuffer(struct SMgrRelationData *smgr, ForkNumber forknum,
+						BlockNumber blkno);
+
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 85cd147e21..c6dd084fbc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1300,6 +1300,14 @@ LANGUAGE INTERNAL
 STRICT VOLATILE
 AS 'pg_create_logical_replication_slot';
 
+CREATE OR REPLACE FUNCTION pg_relation_check_pages(
+    IN relation regclass, IN fork text DEFAULT NULL,
+    OUT path text, OUT failed_block_num bigint)
+RETURNS SETOF record
+LANGUAGE internal
+VOLATILE PARALLEL RESTRICTED
+AS 'pg_relation_check_pages';
+
 CREATE OR REPLACE FUNCTION
   make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
                 days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
@@ -1444,6 +1452,7 @@ AS 'unicode_is_normalized';
 -- can later change who can access these functions, or leave them as only
 -- available to superuser / cluster owner, if they choose.
 --
+REVOKE EXECUTE ON FUNCTION pg_relation_check_pages(regclass, text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_start_backup(text, boolean, boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stop_backup() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stop_backup(boolean, boolean) FROM public;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e549fa1d30..427177e9ce 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4583,3 +4583,100 @@ TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
 				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
 				 errmsg("snapshot too old")));
 }
+
+
+/*
+ * CheckBuffer
+ *
+ * Check the state of a buffer without loading it into the shared buffers. To
+ * avoid torn pages and possible false positives when reading data, a shared
+ * LWLock is taken on the target buffer pool partition mapping, and we check
+ * if the page is in shared_buffers or not.
+ *
+ * If the page is found as dirty in the shared buffers, it is ignored as
+ * it will be flushed to disk either before the end of the next checkpoint
+ * or during recovery in case of unsafe shutdown
+ *
+ * If the page is found in the shared buffers but is not dirty, we still
+ * check the state of its data on disk, as it could be possible that the
+ * page stayed in shared buffers for a rather long time but its on-disk data
+ * got corrupted.  An I/O lock is taken on the block and the block is then
+ * read from storage to prevent any concurrent activity from happening.
+ *
+ * If the page is not found in shared_buffers, the block is read from disk
+ * while holding the buffer pool partition mapping LWLock.
+ *
+ * The page data is stored a private memory area local to this function.
+ *
+ * The caller of this function should hold an AccessShareLock on the relation
+ * check whose page is checked.
+ */
+bool
+CheckBuffer(SMgrRelation smgr, ForkNumber forknum, BlockNumber blkno)
+{
+	char		buffer[BLCKSZ];
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	Assert(smgrexists(smgr, forknum));
+
+	/* create a tag so we can look after the buffer */
+	INIT_BUFFERTAG(buf_tag, smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool or not */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much
+		 * as possible.  We keep the shared lightweight lock on the target
+		 * buffer mapping partition for noe, so this buffer cannot be evicted,
+		 * and we will acquire an I/O Lock on the buffer if we need to read
+		 * the content on disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* If the page is dirty or invalid, skip it */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+		{
+			LWLockRelease(partLock);
+			return true;
+		}
+
+		/*
+		 * Read the buffer from disk, taking an I/O lock to prevent torn-page
+		 * reads, in the unlikely event that it was concurrently dirtied and
+		 * flushed.
+		 */
+		LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+		smgrread(smgr, forknum, blkno, buffer);
+		LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+	}
+	else
+	{
+		/*
+		 * Simply read the buffer.  There's no risk of modification on it as
+		 * we are holding the buffer pool partition mapping lock.
+		 */
+		smgrread(smgr, forknum, blkno, buffer);
+	}
+
+	/* buffer lookup done, so now do its check */
+	LWLockRelease(partLock);
+
+	return PageIsVerified(buffer, blkno);
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..e2279af1e5 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -69,6 +69,7 @@ OBJS = \
 	oid.o \
 	oracle_compat.o \
 	orderedsetaggs.o \
+	pagefuncs.o \
 	partitionfuncs.o \
 	pg_locale.o \
 	pg_lsn.o \
diff --git a/src/backend/utils/adt/pagefuncs.c b/src/backend/utils/adt/pagefuncs.c
new file mode 100644
index 0000000000..7de415fc59
--- /dev/null
+++ b/src/backend/utils/adt/pagefuncs.c
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * pagefuncs.c
+ *	  Functions for page related features.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/pagefuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+#include "utils/builtins.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+							   Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+								Relation relation, ForkNumber forknum);
+
+/*
+ * callback arguments for check_pages_error_callback()
+ */
+typedef struct CheckPagesErrorInfo
+{
+	char       *path;
+	BlockNumber	blkno;
+} CheckPagesErrorInfo;
+
+/*
+ * Error callback specific to check_relation_fork().
+ */
+static void
+check_pages_error_callback(void *arg)
+{
+	CheckPagesErrorInfo *errinfo = (CheckPagesErrorInfo *) arg;
+
+	errcontext("while checking page %u of path %s",
+			   errinfo->blkno, errinfo->path);
+}
+
+/*
+ * pg_relation_check_pages
+ *
+ * Check the state of all the pages for one or more fork types in the given
+ * relation.
+ */
+Datum
+pg_relation_check_pages(PG_FUNCTION_ARGS)
+{
+	Oid			relid;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum;
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* handle arguments */
+	if (PG_ARGISNULL(0))
+	{
+		/* Just leave if nothing is defined */
+		PG_RETURN_VOID();
+	}
+
+	/* By default all the forks of a relation are checked */
+	if (PG_ARGISNULL(1))
+		forknum = InvalidForkNumber;
+	else
+	{
+		const char *forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+		forknum = forkname_to_number(forkname);
+	}
+
+	relid = PG_GETARG_OID(0);
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					 Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation;
+	ForkNumber	forknum;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/* sanity checks, returning no results if no support */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind) ||
+		relation->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+	{
+		relation_close(relation, AccessShareLock);
+		return;
+	}
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	SMgrRelation smgr = relation->rd_smgr;
+	char	   *path;
+	CheckPagesErrorInfo	errinfo;
+	ErrorContextCallback errcallback;
+
+	/* Number of output arguments in the SRF */
+#define PG_CHECK_RELATION_COLS			2
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock on
+	 * the relation, we know that it won't be truncated while we're iterating
+	 * over the blocks.  Any block added after this function started won't be
+	 * checked, but this is out of scope as such pages will be flushed before
+	 * the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+	path = relpathbackend(smgr->smgr_rnode.node,
+						  smgr->smgr_rnode.backend,
+						  forknum);
+
+	/*
+	 * Error context to print some information about blocks and relations
+	 * impacted by corruptions.
+	 */
+	errinfo.path = pstrdup(path);
+	errinfo.blkno = 0;
+	errcallback.callback = check_pages_error_callback;
+	errcallback.arg = (void *) &errinfo;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Update block number for the error context */
+		errinfo.blkno = blkno;
+
+		/* Check the given buffer */
+		if (CheckBuffer(smgr, forknum, blkno))
+			continue;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = CStringGetTextDatum(path);
+		values[i++] = UInt32GetDatum(blkno);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+
+		pfree(path);
+	}
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index a6d2ffbf9e..a845af71fd 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -7,6 +7,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = \
 		  brin \
 		  commit_ts \
+		  check_relation \
 		  delay_execution \
 		  dummy_index_am \
 		  dummy_seclabel \
diff --git a/src/test/modules/check_relation/.gitignore b/src/test/modules/check_relation/.gitignore
new file mode 100644
index 0000000000..871e943d50
--- /dev/null
+++ b/src/test/modules/check_relation/.gitignore
@@ -0,0 +1,2 @@
+# Generated by test suite
+/tmp_check/
diff --git a/src/test/modules/check_relation/Makefile b/src/test/modules/check_relation/Makefile
new file mode 100644
index 0000000000..a540cdece2
--- /dev/null
+++ b/src/test/modules/check_relation/Makefile
@@ -0,0 +1,14 @@
+# src/test/modules/check_relation/Makefile
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/check_relation
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/check_relation/README b/src/test/modules/check_relation/README
new file mode 100644
index 0000000000..415c4b21ad
--- /dev/null
+++ b/src/test/modules/check_relation/README
@@ -0,0 +1,23 @@
+src/test/check_relation/README
+
+Regression tests for online checksums verification
+==================================================
+
+This directory contains a test suite for online checksums verification.
+
+Running the tests
+=================
+
+NOTE: You must have given the --enable-tap-tests argument to configure.
+
+Run
+    make check
+or
+    make installcheck
+You can use "make installcheck" if you previously did "make install".
+In that case, the code in the installation tree is tested.  With
+"make check", a temporary installation tree is built from the current
+sources and then tested.
+
+Either way, this test initializes, starts, and stops a test Postgres
+cluster.
diff --git a/src/test/modules/check_relation/t/001_checksums_check.pl b/src/test/modules/check_relation/t/001_checksums_check.pl
new file mode 100644
index 0000000000..73a7f2d849
--- /dev/null
+++ b/src/test/modules/check_relation/t/001_checksums_check.pl
@@ -0,0 +1,246 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 52;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+sub get_block
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if not defined $success;
+
+	close($infile);
+
+	return($block);
+}
+
+sub overwrite_block
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode ($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+}
+
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness
+	vec($block, 2*$offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2*$offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+sub check_checksums_call
+{
+	my ($node, $relname) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+		. " FROM pg_catalog.pg_relation_check_pages('$relname')"
+	);
+
+	 return ($stderr eq '');
+}
+
+sub check_checksums_nb_error
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT COUNT(*)"
+        . " FROM (SELECT pg_catalog.pg_relation_check_pages(oid, 'main')"
+        . "   FROM pg_class WHERE relkind in ('r', 'i', 'm')) AS s"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	like($stderr, $pattern, 'Error output should match expectations');
+	is($stdout, $nb, "Should have $nb error");
+
+	$TOTAL_NB_ERR += $nb;
+}
+
+sub check_pg_stat_database_nb_error
+{
+	my ($node) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. " sum(checksum_failures)"
+		. " FROM pg_catalog.pg_stat_database"
+	);
+
+	is($cmdret, 0, 'Function should run successfully');
+	is($stderr, '', 'Function should run successfully');
+	is($stdout, $TOTAL_NB_ERR, "Should have $TOTAL_NB_ERR error");
+}
+
+sub get_checksums_errors
+{
+	my ($node, $nb, $pattern) = @_;
+
+	my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT "
+		. "relname, failed_block_num"
+		. " FROM (SELECT relname, (pg_catalog.pg_relation_check_pages(oid)).*"
+        . "   FROM pg_class WHERE relkind in ('r','i', 'm')) AS s"
+	);
+
+	 is($cmdret, '0', 'Function should run successfully');
+	 like($stderr, $pattern, 'Error output should match expectations');
+
+	 $TOTAL_NB_ERR += $nb;
+
+	 return $stdout;
+}
+
+# This function will perform various test by modifying the specified block at
+# the specified uint16 offset, checking that the corruption is correctly
+# detected, and finally restore the specified block to its original content.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $fake_data, $error_pattern, $test_prefix) = @_;
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+
+	check_pg_stat_database_nb_error($node);
+
+	$node->stop();
+
+	my $original_block = get_block($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"$test_prefix: The fake data at offset $offset should be different"
+		. " from the existing one");
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt($original_data, get_uint16_from_page($new_block, $offset),
+		"$test_prefix: The fake data at offset $offset should have been changed in memory");
+
+	overwrite_block($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(get_block($filename, 0), $offset);
+	isnt($original_data, $written_data,
+		"$test_prefix: The data written at offset $offset should be different"
+		. " from the original one");
+	is(get_uint16_from_page($new_block, $offset), $written_data,
+		"$test_prefix: The data written at offset $offset should be the same"
+		. " as the one in memory");
+	is($written_data, $fake_data,
+		"$test_prefix: The data written at offset $offset should be the one"
+		. " we wanted to write");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 1, $error_pattern);
+
+	my $det = get_checksums_errors($node, 1, $error_pattern);
+	is($det, "t1|0",
+		"$test_prefix: The checksums error for modification at offset $offset"
+		. " should be detected");
+
+	$node->stop();
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is($original_data, get_uint16_from_page($new_block, $offset),
+		"$test_prefix: The data at offset $offset should have been restored in memory");
+
+	overwrite_block($filename, $new_block, 0);
+	is($original_data, get_uint16_from_page(get_block($filename, $blkno),
+			$offset),
+		"$test_prefix: The data at offset $offset should have been restored on disk");
+
+	$node->start();
+
+	check_checksums_nb_error($node, 0, qr/^$/);
+}
+
+if (exists $ENV{MY_PG_REGRESS})
+{
+	$ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
+
+my $node = get_new_node('main');
+
+my %params;
+$params{'extra'} = ['--data-checksums'];
+$node->init(%params);
+
+$node->start();
+
+$ENV{PGOPTIONS} = '--client-min-messages=WARNING';
+
+my ($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_checksums')");
+
+is($stdout, 'on', 'Data checksums should be enabled');
+
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+# Basic schema to corrupt and check
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# get the underlying heap absolute path
+($cmdret, $stdout, $stderr) = $node->psql('postgres', "SELECT"
+	. " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+isnt($stdout, '', 'A relfilenode should be returned');
+
+my $filename = $stdout;
+
+check_checksums_nb_error($node, 0, qr/^$/);
+
+check_pg_stat_database_nb_error($node);
+
+my $fake_uint16 = hex '0x0000';
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET,
+		       $fake_uint16, qr/WARNING.*page verification failed/s,
+		       'corrupted page: ');
+
+# Test corruption making the block looks like it's PageIsNew().  This should
+# pass.
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET,
+	$fake_uint16, qr/^$/, 'new page: ');
diff --git a/src/test/regress/expected/pagefuncs.out b/src/test/regress/expected/pagefuncs.out
new file mode 100644
index 0000000000..38a72b01b3
--- /dev/null
+++ b/src/test/regress/expected/pagefuncs.out
@@ -0,0 +1,72 @@
+--
+-- Tests for functions related to relation pages
+--
+-- Restricted to superusers by default
+CREATE ROLE regress_pgfunc_user;
+SET ROLE regress_pgfunc_user;
+SELECT pg_relation_check_pages('pg_class'); -- error
+ERROR:  permission denied for function pg_relation_check_pages
+SELECT pg_relation_check_pages('pg_class', 'main'); -- error
+ERROR:  permission denied for function pg_relation_check_pages
+RESET ROLE;
+DROP ROLE regress_pgfunc_user;
+-- NULL and simple sanity checks
+SELECT pg_relation_check_pages(NULL); -- empty result
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages(NULL, NULL); -- empty result
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages('pg_class', 'invalid_fork'); -- error
+ERROR:  invalid fork name
+HINT:  Valid fork names are "main", "fsm", "vm", and "init".
+-- Relation types that are supported
+CREATE TABLE pgfunc_test_tab (id int);
+CREATE INDEX pgfunc_test_ind ON pgfunc_test_tab(id);
+INSERT INTO pgfunc_test_tab VALUES (generate_series(1,1000));
+SELECT pg_relation_check_pages('pgfunc_test_tab');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages('pgfunc_test_ind');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP TABLE pgfunc_test_tab;
+CREATE MATERIALIZED VIEW pgfunc_test_matview AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_matview');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP MATERIALIZED VIEW pgfunc_test_matview;
+CREATE SEQUENCE pgfunc_test_seq;
+SELECT pg_relation_check_pages('pgfunc_test_seq');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP SEQUENCE pgfunc_test_seq;
+-- pg_relation_check_pages() returns no results if passed relations that
+-- do not support the operation, like relations without storage or temporary
+-- relations.
+CREATE TEMPORARY TABLE pgfunc_test_temp AS SELECT generate_series(1,10) AS a;
+SELECT pg_relation_check_pages('pgfunc_test_temp');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP TABLE pgfunc_test_temp;
+CREATE VIEW pgfunc_test_view AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_view');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP VIEW pgfunc_test_view;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index ae89ed7f0b..7a46a13252 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -112,7 +112,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # ----------
 # Another group of parallel tests
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain pagefuncs
 
 # event triggers cannot run concurrently with any test that runs DDL
 test: event_trigger
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 525bdc804f..9a80b80f73 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -197,6 +197,7 @@ test: hash_part
 test: indexing
 test: partition_aggregate
 test: partition_info
+test: pagefuncs
 test: tuplesort
 test: explain
 test: event_trigger
diff --git a/src/test/regress/sql/pagefuncs.sql b/src/test/regress/sql/pagefuncs.sql
new file mode 100644
index 0000000000..12d32eeae4
--- /dev/null
+++ b/src/test/regress/sql/pagefuncs.sql
@@ -0,0 +1,41 @@
+--
+-- Tests for functions related to relation pages
+--
+
+-- Restricted to superusers by default
+CREATE ROLE regress_pgfunc_user;
+SET ROLE regress_pgfunc_user;
+SELECT pg_relation_check_pages('pg_class'); -- error
+SELECT pg_relation_check_pages('pg_class', 'main'); -- error
+RESET ROLE;
+DROP ROLE regress_pgfunc_user;
+
+-- NULL and simple sanity checks
+SELECT pg_relation_check_pages(NULL); -- empty result
+SELECT pg_relation_check_pages(NULL, NULL); -- empty result
+SELECT pg_relation_check_pages('pg_class', 'invalid_fork'); -- error
+
+-- Relation types that are supported
+CREATE TABLE pgfunc_test_tab (id int);
+CREATE INDEX pgfunc_test_ind ON pgfunc_test_tab(id);
+INSERT INTO pgfunc_test_tab VALUES (generate_series(1,1000));
+SELECT pg_relation_check_pages('pgfunc_test_tab');
+SELECT pg_relation_check_pages('pgfunc_test_ind');
+DROP TABLE pgfunc_test_tab;
+
+CREATE MATERIALIZED VIEW pgfunc_test_matview AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_matview');
+DROP MATERIALIZED VIEW pgfunc_test_matview;
+CREATE SEQUENCE pgfunc_test_seq;
+SELECT pg_relation_check_pages('pgfunc_test_seq');
+DROP SEQUENCE pgfunc_test_seq;
+
+-- pg_relation_check_pages() returns no results if passed relations that
+-- do not support the operation, like relations without storage or temporary
+-- relations.
+CREATE TEMPORARY TABLE pgfunc_test_temp AS SELECT generate_series(1,10) AS a;
+SELECT pg_relation_check_pages('pgfunc_test_temp');
+DROP TABLE pgfunc_test_temp;
+CREATE VIEW pgfunc_test_view AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_view');
+DROP VIEW pgfunc_test_view;
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c99499e52b..57a45cf0d3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26219,6 +26219,56 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide ways to check the sanity of data files in the cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_relation_check_pages(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type> <literal>DEFAULT</literal> <literal>NULL</literal> ])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Check the pages of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_relation_check_pages</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_relation_check_pages">
+    <function>pg_relation_check_pages</function> iterates over all blocks of a
+    given relation and verifies if they are in a state where they can safely
+    be loaded into the shared buffers. If defined,
+    <replaceable>fork</replaceable> specifies that only the pages of the given
+    fork are to be verified. Fork can be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.  The default of
+    <literal>NULL</literal> means that all the forks of the relation are
+    checked. The function returns a list of blocks that are considered as
+    corrupted with the path of the related file. Use of this function is
+    restricted to superusers by default but access may be granted to others
+    using <command>GRANT</command>.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 90594bd41b..adbabba3c4 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -50,7 +50,8 @@ my @contrib_excludes = (
 	'pgcrypto',         'sepgsql',
 	'brin',             'test_extensions',
 	'test_misc',        'test_pg_dump',
-	'snapshot_too_old', 'unsafe_tests');
+	'snapshot_too_old', 'unsafe_tests',
+	'check_relation');
 
 # Set of variables for frontend modules
 my $frontend_defines = { 'initdb' => 'FRONTEND' };

#58

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Michael Paquier (#57)

Re: Online checksums verification in the backend

On Fri, Oct 23, 2020 at 3:28 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote:

No issues with reporting the block number and the fork type in the SRF
at least of course as this is helpful to detect the position of the
broken blocks. For the checksum found in the header and the one
calculated (named expected in the patch), I am less sure how to put a
clear definition on top of that but we could always consider that
later and extend the SRF as needed. Once the user knows that both do
not match, he/she knows that there is a problem.

So, I have reviewed your patch set, and heavily reworked the logic to
be more consistent on many thinks, resulting in a largely simplified
patch without sacrificing its usefulness:

Thanks!

- Removal of the dependency with checksums for this feature. While
simplifying the code, I have noticed that this feature can also be
beneficial for clusters that do not have have data checksums, as
PageIsVerified() is perfectly able to run some page header checks and
the zero-page case. That's of course less useful than having the
checksums, but there is no need to add a dependency here. The file
for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c.

I agree. However I'm assuming that this refactor is relying on the
not yet committed patch (in the nearby thread dealing with base backup
checksums check) to also refactor PageIsVerified? As all the logic
you removed was done to avoid spamming a lot of warnings when calling
the function.

- The function is changed to return no tuples if the relkind is not
supported, and the same applies for temporary relations. That's more
consistent with other system functions like the ones in charge of
partition information, and this makes full scans of pg_class much
easier to work with. Temporary tables were not handled correctly
anyway as these are in local buffers, but the use case of this
function in this case is not really obvious to me.

Agreed

- Having the forknum in the SRF is kind of confusing, as the user
would need to map a number with the physical on-disk name. Instead, I
have made the choice to return the *path* of the corrupted file with a
block number. This way, an operator can know immediately where a
problem comes from. The patch does not append the segment number, and
I am not sure if we should actually do that, but adding it is
straight-forward as we have the block number. There is a dependency
with table AMs here as well, as this goes down to fd.c, explaining why
I have not added it and just.

That's a clear improvement, thanks!

- I really don't know what kind of default ACL should apply for such
functions, but I am sure that SCAN_TABLES is not what we are looking
for here, and there is nothing preventing us from having a safe
default from the start of times, so I moved the function to be
superuser-only by default, and GRANT can be used to allow its
execution to other roles. We could relax that in the future, of
course, this can be discussed separately.

I don't have a strong opinion here, SCAN_TABLES was maybe not ideal.
No objections.

- The WARNING report for each block found as corrupted gains an error
context, as a result of a switch to PageIsVerified(), giving a user
all the information needed in the logs on top of the result in the
SRF. That's useful as well if combined with CHECK_FOR_INTERRUPTS(),
and I got to wonder if we should have some progress report for this
stuff, though that's a separate discussion.

Mmm, is it really an improvement to report warnings during this
function execution? Note also that PageIsVerified as-is won't report
a warning if a page is found as PageIsNew() but isn't actually all
zero, while still being reported as corrupted by the SRF.

Have you also considered that it's possible to execute
pg_relation_check_pages with ignore_checksum_failure = on? That's
evidently a bad idea, but doing so would report some of the data
corruption as warnings while still not reporting anything in the SRF.

Having some progress report would be nice to have, but +1 to have a
separate discussion for that.

- The function is renamed to something less generic,
pg_relation_check_pages(), and I have reduced the number of functions
from two to one, where the user can specify the fork name with a new
option. The default of NULL means that all the forks of a relation
are checked.

Ok.

- The TAP tests are rather bulky. I have moved all the non-corruption
test cases into a new SQL test file. That's useful for people willing
to do some basic sanity checks with a non-default table AM. At least
it provides a minimum coverage.

Agreed

#59

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Julien Rouhaud (#58)

Re: Online checksums verification in the backend

On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote:

I agree. However I'm assuming that this refactor is relying on the
not yet committed patch (in the nearby thread dealing with base backup
checksums check) to also refactor PageIsVerified? As all the logic
you removed was done to avoid spamming a lot of warnings when calling
the function.

Yeah, it should use a refactored version, but I was as well in the
mood of looking at version based on what we have now on HEAD. Even if
I am not completely clear where the patch for page verification and
base backups will go, I was thinking as well to do the refactoring
introducing PageIsVerifiedExtended() first, before considering the
next steps for this thread. It seems to me that the path where we
generate no WARNINGs at all makes the whole experience more consistent
for the user with this function.

Mmm, is it really an improvement to report warnings during this
function execution? Note also that PageIsVerified as-is won't report
a warning if a page is found as PageIsNew() but isn't actually all
zero, while still being reported as corrupted by the SRF.

Yep, joining the point of above to just have no WARNINGs at all.

Have you also considered that it's possible to execute
pg_relation_check_pages with ignore_checksum_failure = on? That's
evidently a bad idea, but doing so would report some of the data
corruption as warnings while still not reporting anything in the SRF.

Yeah, I thought about that as well, but I did not see a strong
argument against preventing this behavior either, even if it sounds
a bit strange. We could always tune that later depending on the
feedback.
--
Michael

#60

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Michael Paquier (#59)

1 attachment(s)

Re: Online checksums verification in the backend

On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote:

On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote:

Mmm, is it really an improvement to report warnings during this
function execution? Note also that PageIsVerified as-is won't report
a warning if a page is found as PageIsNew() but isn't actually all
zero, while still being reported as corrupted by the SRF.

Yep, joining the point of above to just have no WARNINGs at all.

Now that we have d401c57, I got to consider more this one, and opted
for not generating a WARNING for now. Hence, PageisVerifiedExtended()
is disabled regarding that, but we still report a checksum failure in
it.

I have spent some time reviewing the tests, and as I felt this was
bulky. In the reworked version attached, I have reduced the number of
tests by half, without reducing the coverage, mainly:
- Removed all the stderr and the return code tests, as we always
expected the commands to succeed, and safe_psql() can do the job
already.
- Merged of the queries using pg_relation_check_pages into a single
routine, with the expected output (set of broken pages returned in the
SRF) in the arguments.
- Added some prefixes to the tests, to generate unique test names.
That makes debug easier.
- The query on pg_stat_database is run once at the beginning, once at
the end with the number of checksum failures correctly updated.
- Added comments to document all the routines, and renamed some of
them mostly for consistency.
- Skipped system relations from the scan of pg_class, making the test
more costly for nothing.
- I ran some tests on Windows, just-in-case.

I have also added a SearchSysCacheExists1() to double-check if the
relation is missing before opening it, added a
CHECK_FOR_INTERRUPTS() within the main check loop (where the error
context is really helpful), indented the code, bumped the catalogs
(mostly a self-reminder), etc.

So, what do you think?
--
Michael

Attachments:

online-checksum-v19.patchtext/x-diff; charset=us-asciiDownload

diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index f44a09b0c2..e522477780 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -53,6 +53,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202010201
+#define CATALOG_VERSION_NO	202010271
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index bbcac69d48..a66870bcc0 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10958,6 +10958,13 @@
   proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}',
   proargnames => '{tablespace,name,size,modification}',
   prosrc => 'pg_ls_tmpdir_1arg' },
+{ oid => '9147', descr => 'check pages of a relation',
+  proname => 'pg_relation_check_pages', procost => '10000', prorows => '20',
+  proisstrict => 'f', proretset => 't', provolatile => 'v', proparallel => 'r',
+  prorettype => 'record', proargtypes => 'regclass text',
+  proallargtypes => '{regclass,text,text,int8}', proargmodes => '{i,i,o,o}',
+  proargnames => '{relation,fork,path,failed_block_num}',
+  prosrc => 'pg_relation_check_pages' },
 
 # hash partitioning constraint function
 { oid => '5028', descr => 'hash partition CHECK constraint',
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..a21cab2eaf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -240,6 +240,9 @@ extern void AtProcExit_LocalBuffers(void);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 
+extern bool CheckBuffer(struct SMgrRelationData *smgr, ForkNumber forknum,
+						BlockNumber blkno);
+
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 85cd147e21..c6dd084fbc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1300,6 +1300,14 @@ LANGUAGE INTERNAL
 STRICT VOLATILE
 AS 'pg_create_logical_replication_slot';
 
+CREATE OR REPLACE FUNCTION pg_relation_check_pages(
+    IN relation regclass, IN fork text DEFAULT NULL,
+    OUT path text, OUT failed_block_num bigint)
+RETURNS SETOF record
+LANGUAGE internal
+VOLATILE PARALLEL RESTRICTED
+AS 'pg_relation_check_pages';
+
 CREATE OR REPLACE FUNCTION
   make_interval(years int4 DEFAULT 0, months int4 DEFAULT 0, weeks int4 DEFAULT 0,
                 days int4 DEFAULT 0, hours int4 DEFAULT 0, mins int4 DEFAULT 0,
@@ -1444,6 +1452,7 @@ AS 'unicode_is_normalized';
 -- can later change who can access these functions, or leave them as only
 -- available to superuser / cluster owner, if they choose.
 --
+REVOKE EXECUTE ON FUNCTION pg_relation_check_pages(regclass, text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_start_backup(text, boolean, boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stop_backup() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stop_backup(boolean, boolean) FROM public;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3eee86afe5..eb5c917074 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4585,3 +4585,95 @@ TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
 				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
 				 errmsg("snapshot too old")));
 }
+
+
+/*
+ * CheckBuffer
+ *
+ * Check the state of a buffer without loading it into the shared buffers. To
+ * avoid torn pages and possible false positives when reading data, a shared
+ * LWLock is taken on the target buffer pool partition mapping, and we check
+ * if the page is in shared buffers or not.  An I/O lock is taken on the block
+ * to prevent any concurrent activity from happening.
+ *
+ * If the page is found as dirty in the shared buffers, it is ignored as
+ * it will be flushed to disk either before the end of the next checkpoint
+ * or during recovery in the event of an unsafe shutdown.
+ *
+ * If the page is found in the shared buffers but is not dirty, we still
+ * check the state of its data on disk, as it could be possible that the
+ * page stayed in shared buffers for a rather long time while the on-disk
+ * data got corrupted.
+ *
+ * If the page is not found in shared buffers, the block is read from disk
+ * while holding the buffer pool partition mapping LWLock.
+ *
+ * The page data is stored a private memory area local to this function while
+ * running the checks.
+ */
+bool
+CheckBuffer(SMgrRelation smgr, ForkNumber forknum, BlockNumber blkno)
+{
+	char		buffer[BLCKSZ];
+	BufferTag	buf_tag;		/* identity of requested block */
+	uint32		buf_hash;		/* hash value for buf_tag */
+	LWLock	   *partLock;		/* buffer partition lock for the buffer */
+	BufferDesc *bufdesc;
+	int			buf_id;
+
+	Assert(smgrexists(smgr, forknum));
+
+	/* create a tag so we can look after the buffer */
+	INIT_BUFFERTAG(buf_tag, smgr->smgr_rnode.node, forknum, blkno);
+
+	/* determine its hash code and partition lock ID */
+	buf_hash = BufTableHashCode(&buf_tag);
+	partLock = BufMappingPartitionLock(buf_hash);
+
+	/* see if the block is in the buffer pool or not */
+	LWLockAcquire(partLock, LW_SHARED);
+	buf_id = BufTableLookup(&buf_tag, buf_hash);
+	if (buf_id >= 0)
+	{
+		uint32		buf_state;
+
+		/*
+		 * Found it.  Now, retrieve its state to know what to do with it, and
+		 * release the pin immediately.  We do so to limit overhead as much as
+		 * possible.  We keep the shared lightweight lock on the target buffer
+		 * mapping partition for now, so this buffer cannot be evicted, and we
+		 * acquire an I/O Lock on the buffer as we may need to read its
+		 * contents from disk.
+		 */
+		bufdesc = GetBufferDescriptor(buf_id);
+
+		LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
+		buf_state = LockBufHdr(bufdesc);
+		UnlockBufHdr(bufdesc, buf_state);
+
+		/* If the page is dirty or invalid, skip it */
+		if ((buf_state & BM_DIRTY) || !(buf_state & BM_TAG_VALID))
+		{
+			LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+			LWLockRelease(partLock);
+			return true;
+		}
+
+		/* Read the buffer from disk, with the I/O lock still held */
+		smgrread(smgr, forknum, blkno, buffer);
+		LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
+	}
+	else
+	{
+		/*
+		 * Simply read the buffer.  There's no risk of modification on it as
+		 * we are holding the buffer pool partition mapping lock.
+		 */
+		smgrread(smgr, forknum, blkno, buffer);
+	}
+
+	/* buffer lookup done, so now do its check */
+	LWLockRelease(partLock);
+
+	return PageIsVerifiedExtended(buffer, blkno, PIV_REPORT_STAT);
+}
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index b4d55e849b..e2279af1e5 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -69,6 +69,7 @@ OBJS = \
 	oid.o \
 	oracle_compat.o \
 	orderedsetaggs.o \
+	pagefuncs.o \
 	partitionfuncs.o \
 	pg_locale.o \
 	pg_lsn.o \
diff --git a/src/backend/utils/adt/pagefuncs.c b/src/backend/utils/adt/pagefuncs.c
new file mode 100644
index 0000000000..f34d56cf1f
--- /dev/null
+++ b/src/backend/utils/adt/pagefuncs.c
@@ -0,0 +1,230 @@
+/*-------------------------------------------------------------------------
+ *
+ * pagefuncs.c
+ *	  Functions for page related features.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/adt/pagefuncs.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/relation.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "storage/lmgr.h"
+#include "storage/smgr.h"
+#include "utils/builtins.h"
+#include "utils/syscache.h"
+
+static void check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+							   Oid relid, ForkNumber single_forknum);
+static void check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+								Relation relation, ForkNumber forknum);
+
+/*
+ * callback arguments for check_pages_error_callback()
+ */
+typedef struct CheckPagesErrorInfo
+{
+	char	   *path;
+	BlockNumber blkno;
+} CheckPagesErrorInfo;
+
+/*
+ * Error callback specific to check_relation_fork().
+ */
+static void
+check_pages_error_callback(void *arg)
+{
+	CheckPagesErrorInfo *errinfo = (CheckPagesErrorInfo *) arg;
+
+	errcontext("while checking page %u of path %s",
+			   errinfo->blkno, errinfo->path);
+}
+
+/*
+ * pg_relation_check_pages
+ *
+ * Check the state of all the pages for one or more fork types in the given
+ * relation.
+ */
+Datum
+pg_relation_check_pages(PG_FUNCTION_ARGS)
+{
+	Oid			relid;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	ForkNumber	forknum;
+
+	/* Switch into long-lived context to construct returned data structures */
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* handle arguments */
+	if (PG_ARGISNULL(0))
+	{
+		/* Just leave if nothing is defined */
+		PG_RETURN_VOID();
+	}
+
+	/* By default all the forks of a relation are checked */
+	if (PG_ARGISNULL(1))
+		forknum = InvalidForkNumber;
+	else
+	{
+		const char *forkname = TextDatumGetCString(PG_GETARG_TEXT_PP(1));
+
+		forknum = forkname_to_number(forkname);
+	}
+
+	relid = PG_GETARG_OID(0);
+
+	check_one_relation(tupdesc, tupstore, relid, forknum);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * Perform the check on a single relation, possibly filtered with a single
+ * fork.  This function will check if the given relation exists or not, as
+ * a relation could be dropped after checking for the list of relations and
+ * before getting here, and we don't want to error out in this case.
+ */
+static void
+check_one_relation(TupleDesc tupdesc, Tuplestorestate *tupstore,
+				   Oid relid, ForkNumber single_forknum)
+{
+	Relation	relation;
+	ForkNumber	forknum;
+
+	/* Check if relation exists. leaving if there is no such relation */
+	if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+		return;
+
+	relation = relation_open(relid, AccessShareLock);
+
+	/*
+	 * Sanity checks, returning no results if not support.  Temporary
+	 * relations and relations without storage are out of scope.
+	 */
+	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind) ||
+		relation->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
+	{
+		relation_close(relation, AccessShareLock);
+		return;
+	}
+
+	RelationOpenSmgr(relation);
+
+	for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+	{
+		if (single_forknum != InvalidForkNumber && single_forknum != forknum)
+			continue;
+
+		if (smgrexists(relation->rd_smgr, forknum))
+			check_relation_fork(tupdesc, tupstore, relation, forknum);
+	}
+
+	relation_close(relation, AccessShareLock);
+}
+
+/*
+ * For a given relation and fork, Do the real work of iterating over all pages
+ * and doing the check.  Caller must hold an AccessShareLock lock on the given
+ * relation.
+ */
+static void
+check_relation_fork(TupleDesc tupdesc, Tuplestorestate *tupstore,
+					Relation relation, ForkNumber forknum)
+{
+	BlockNumber blkno,
+				nblocks;
+	SMgrRelation smgr = relation->rd_smgr;
+	char	   *path;
+	CheckPagesErrorInfo errinfo;
+	ErrorContextCallback errcallback;
+
+	/* Number of output arguments in the SRF */
+#define PG_CHECK_RELATION_COLS			2
+
+	Assert(CheckRelationLockedByMe(relation, AccessShareLock, true));
+
+	/*
+	 * We remember the number of blocks here.  Since caller must hold a lock
+	 * on the relation, we know that it won't be truncated while we're
+	 * iterating over the blocks.  Any block added after this function started
+	 * won't be checked, but this is out of scope as such pages will be
+	 * flushed before the next checkpoint's completion.
+	 */
+	nblocks = RelationGetNumberOfBlocksInFork(relation, forknum);
+
+	path = relpathbackend(smgr->smgr_rnode.node,
+						  smgr->smgr_rnode.backend,
+						  forknum);
+
+	/*
+	 * Error context to print some information about blocks and relations
+	 * impacted by corruptions.
+	 */
+	errinfo.path = pstrdup(path);
+	errinfo.blkno = 0;
+	errcallback.callback = check_pages_error_callback;
+	errcallback.arg = (void *) &errinfo;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	for (blkno = 0; blkno < nblocks; blkno++)
+	{
+		Datum		values[PG_CHECK_RELATION_COLS];
+		bool		nulls[PG_CHECK_RELATION_COLS];
+		int			i = 0;
+
+		/* Update block number for the error context */
+		errinfo.blkno = blkno;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check the given buffer */
+		if (CheckBuffer(smgr, forknum, blkno))
+			continue;
+
+		memset(values, 0, sizeof(values));
+		memset(nulls, 0, sizeof(nulls));
+
+		values[i++] = CStringGetTextDatum(path);
+		values[i++] = UInt32GetDatum(blkno);
+
+		Assert(i == PG_CHECK_RELATION_COLS);
+
+		/* Save the corrupted blocks in the tuplestore. */
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+
+		pfree(path);
+	}
+
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
+}
diff --git a/src/test/recovery/t/022_page_check.pl b/src/test/recovery/t/022_page_check.pl
new file mode 100644
index 0000000000..7e1f0d1fd8
--- /dev/null
+++ b/src/test/recovery/t/022_page_check.pl
@@ -0,0 +1,234 @@
+# Emulate on-disk corruptions of relation pages and find such corruptions
+# using pg_relation_check_pages().
+
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 20;
+
+our $CHECKSUM_UINT16_OFFSET = 4;
+our $PD_UPPER_UINT16_OFFSET = 7;
+our $BLOCKSIZE;
+our $TOTAL_NB_ERR = 0;
+
+# Grab a relation page worth a size of BLOCKSIZE from given $filename.
+# $blkno is the same block number as for a relation file.
+sub read_page
+{
+	my ($filename, $blkno) = @_;
+	my $block;
+
+	open(my $infile, '<', $filename) or die;
+	binmode($infile);
+
+	my $success = read($infile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+	die($!) if !defined($success);
+
+	close($infile);
+
+	return ($block);
+}
+
+# Update an existing page of size BLOCKSIZE with new contents in given
+# $filename.  blkno is the block number assigned in the relation file.
+sub write_page
+{
+	my ($filename, $block, $blkno) = @_;
+
+	open(my $outfile, '>', $filename) or die;
+	binmode($outfile);
+
+	my $nb = syswrite($outfile, $block, $BLOCKSIZE, ($blkno * $BLOCKSIZE));
+
+	die($!) if not defined $nb;
+	die("Write error") if ($nb != $BLOCKSIZE);
+
+	$outfile->flush();
+
+	close($outfile);
+	return;
+}
+
+# Read 2 bytes from relation page at a given offset.
+sub get_uint16_from_page
+{
+	my ($block, $offset) = @_;
+
+	return (unpack("S*", $block))[$offset];
+}
+
+# Write 2 bytes to relation page at a given offset.
+sub set_uint16_to_page
+{
+	my ($block, $data, $offset) = @_;
+
+	my $pack = pack("S", $data);
+
+	# vec with 16B or more won't preserve endianness.
+	vec($block, 2 * $offset, 8) = (unpack('C*', $pack))[0];
+	vec($block, (2 * $offset) + 1, 8) = (unpack('C*', $pack))[1];
+
+	return $block;
+}
+
+# Sanity check on pg_stat_database looking after the number of checksum
+# failures.
+sub check_pg_stat_database
+{
+	my ($node, $test_prefix) = @_;
+
+	my $stdout = $node->safe_psql('postgres',
+		    "SELECT "
+		  . " sum(checksum_failures)"
+		  . " FROM pg_catalog.pg_stat_database");
+	is($stdout, $TOTAL_NB_ERR,
+		"$test_prefix: pg_stat_database should have $TOTAL_NB_ERR error");
+
+	return;
+}
+
+# Run a round of page checks for any relation present in this test run.
+# $expected_broken is the psql output marking all the pages found as
+# corrupted using relname|blkno as format for each tuple returned.  $nb
+# is the new number of added to the global counter matched later with
+# pg_stat_database.
+#
+# Note that this has no need to check system relations as these would have
+# no corruptions: this test does not manipulate them and should by no mean
+# break the cluster.
+sub run_page_checks
+{
+	my ($node, $num_checksum, $expected_broken, $test_prefix) = @_;
+
+	my $stdout = $node->safe_psql('postgres',
+		    "SELECT relname, failed_block_num"
+		  . " FROM (SELECT relname, (pg_catalog.pg_relation_check_pages(oid)).*"
+		  . "   FROM pg_class "
+		  . "   WHERE relkind in ('r','i', 'm') AND oid >= 16384) AS s");
+
+	# Check command result
+	is($stdout, $expected_broken,
+		"$test_prefix: output mismatch with pg_relation_check_pages()");
+
+	$TOTAL_NB_ERR += $num_checksum;
+	return;
+}
+
+# Perform various test that modify a specified block at the specified
+# offset, checking that the page corruption is correctly detected.  The
+# original contents of the page are restored back once done.
+# $broken_pages is the set of pages that are expected to be broken
+# as of the returned result of pg_relation_check_pages().  $num_checksum
+# is the number of checksum failures expected to be added after this
+# function is done.
+sub corrupt_and_test_block
+{
+	my ($node, $filename, $blkno, $offset, $broken_pages, $num_checksum,
+		$test_prefix)
+	  = @_;
+	my $fake_data = hex '0x0000';
+
+	# Stop the server cleanly to flush any pages, and to prevent any
+	# concurrent updates on what is going to be updated.
+	$node->stop;
+	my $original_block = read_page($filename, 0);
+	my $original_data = get_uint16_from_page($original_block, $offset);
+
+	isnt($original_data, $fake_data,
+		"$test_prefix: fake data at offset $offset should be different from the existing one"
+	);
+
+	my $new_block = set_uint16_to_page($original_block, $fake_data, $offset);
+	isnt(
+		$original_data,
+		get_uint16_from_page($new_block, $offset),
+		"$test_prefix: The data at offset $offset should have been changed in memory"
+	);
+
+	write_page($filename, $new_block, 0);
+
+	my $written_data = get_uint16_from_page(read_page($filename, 0), $offset);
+
+	# Some offline checks to validate that the corrupted data is in place.
+	isnt($original_data, $written_data,
+		"$test_prefix: data written at offset $offset should be different from the original one"
+	);
+	is( get_uint16_from_page($new_block, $offset),
+		$written_data,
+		"$test_prefix: data written at offset $offset should be the same as the one in memory"
+	);
+	is($written_data, $fake_data,
+		"$test_prefix: The data written at offset $offset should be the one we wanted to write"
+	);
+
+	# The corruption is in place, start the server to run the checks.
+	$node->start;
+	run_page_checks($node, $num_checksum, $broken_pages, $test_prefix);
+
+	# Stop the server, put the original page back in place.
+	$node->stop;
+
+	$new_block = set_uint16_to_page($original_block, $original_data, $offset);
+	is( $original_data,
+		get_uint16_from_page($new_block, $offset),
+		"$test_prefix: data at offset $offset should have been restored in memory"
+	);
+
+	write_page($filename, $new_block, 0);
+	is( $original_data,
+		get_uint16_from_page(read_page($filename, $blkno), $offset),
+		"$test_prefix: data at offset $offset should have been restored on disk"
+	);
+
+	# There should be no errors now that the contents are back in place.
+	$node->start;
+	run_page_checks($node, 0, '', $test_prefix);
+}
+
+# Data checksums are necessary for this test.
+my $node = get_new_node('main');
+$node->init(extra => ['--data-checksums']);
+$node->start;
+
+my $stdout =
+  $node->safe_psql('postgres', "SELECT" . " current_setting('block_size')");
+
+$BLOCKSIZE = $stdout;
+
+# Basic schema to corrupt and check
+$node->safe_psql(
+	'postgres', q|
+	CREATE TABLE public.t1(id integer);
+	INSERT INTO public.t1 SELECT generate_series(1, 100);
+	CHECKPOINT;
+|);
+
+# Get the path to the relation file that will get manipulated by the
+# follow-up tests with some on-disk corruptions.
+$stdout = $node->safe_psql('postgres',
+	    "SELECT"
+	  . " current_setting('data_directory') || '/' || pg_relation_filepath('t1')"
+);
+
+my $filename = $stdout;
+
+# Normal case without corruptions, this passes, with pg_stat_database
+# reporting no errors.
+check_pg_stat_database($node, 'start');
+
+# Test with a modified checksum.  We use a zero checksum here as it's the only
+# one that cannot exist on a checksummed page.  We also don't have an easy way
+# to compute what the checksum would be after a modification in a random place
+# in the block.
+corrupt_and_test_block($node, $filename, 0, $CHECKSUM_UINT16_OFFSET, 't1|0',
+	1, 'broken checksum');
+
+# Test corruption making the block looks like it's PageIsNew().
+corrupt_and_test_block($node, $filename, 0, $PD_UPPER_UINT16_OFFSET, 't1|0',
+	0, 'new page');
+
+# Check that the number of errors in pg_stat_database match what we
+# expect with the corruptions previously introduced.
+check_pg_stat_database($node, 'end');
diff --git a/src/test/regress/expected/pagefuncs.out b/src/test/regress/expected/pagefuncs.out
new file mode 100644
index 0000000000..38a72b01b3
--- /dev/null
+++ b/src/test/regress/expected/pagefuncs.out
@@ -0,0 +1,72 @@
+--
+-- Tests for functions related to relation pages
+--
+-- Restricted to superusers by default
+CREATE ROLE regress_pgfunc_user;
+SET ROLE regress_pgfunc_user;
+SELECT pg_relation_check_pages('pg_class'); -- error
+ERROR:  permission denied for function pg_relation_check_pages
+SELECT pg_relation_check_pages('pg_class', 'main'); -- error
+ERROR:  permission denied for function pg_relation_check_pages
+RESET ROLE;
+DROP ROLE regress_pgfunc_user;
+-- NULL and simple sanity checks
+SELECT pg_relation_check_pages(NULL); -- empty result
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages(NULL, NULL); -- empty result
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages('pg_class', 'invalid_fork'); -- error
+ERROR:  invalid fork name
+HINT:  Valid fork names are "main", "fsm", "vm", and "init".
+-- Relation types that are supported
+CREATE TABLE pgfunc_test_tab (id int);
+CREATE INDEX pgfunc_test_ind ON pgfunc_test_tab(id);
+INSERT INTO pgfunc_test_tab VALUES (generate_series(1,1000));
+SELECT pg_relation_check_pages('pgfunc_test_tab');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+SELECT pg_relation_check_pages('pgfunc_test_ind');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP TABLE pgfunc_test_tab;
+CREATE MATERIALIZED VIEW pgfunc_test_matview AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_matview');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP MATERIALIZED VIEW pgfunc_test_matview;
+CREATE SEQUENCE pgfunc_test_seq;
+SELECT pg_relation_check_pages('pgfunc_test_seq');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP SEQUENCE pgfunc_test_seq;
+-- pg_relation_check_pages() returns no results if passed relations that
+-- do not support the operation, like relations without storage or temporary
+-- relations.
+CREATE TEMPORARY TABLE pgfunc_test_temp AS SELECT generate_series(1,10) AS a;
+SELECT pg_relation_check_pages('pgfunc_test_temp');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP TABLE pgfunc_test_temp;
+CREATE VIEW pgfunc_test_view AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_view');
+ pg_relation_check_pages 
+-------------------------
+(0 rows)
+
+DROP VIEW pgfunc_test_view;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index ae89ed7f0b..7a46a13252 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -112,7 +112,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # ----------
 # Another group of parallel tests
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain pagefuncs
 
 # event triggers cannot run concurrently with any test that runs DDL
 test: event_trigger
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 525bdc804f..9a80b80f73 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -197,6 +197,7 @@ test: hash_part
 test: indexing
 test: partition_aggregate
 test: partition_info
+test: pagefuncs
 test: tuplesort
 test: explain
 test: event_trigger
diff --git a/src/test/regress/sql/pagefuncs.sql b/src/test/regress/sql/pagefuncs.sql
new file mode 100644
index 0000000000..12d32eeae4
--- /dev/null
+++ b/src/test/regress/sql/pagefuncs.sql
@@ -0,0 +1,41 @@
+--
+-- Tests for functions related to relation pages
+--
+
+-- Restricted to superusers by default
+CREATE ROLE regress_pgfunc_user;
+SET ROLE regress_pgfunc_user;
+SELECT pg_relation_check_pages('pg_class'); -- error
+SELECT pg_relation_check_pages('pg_class', 'main'); -- error
+RESET ROLE;
+DROP ROLE regress_pgfunc_user;
+
+-- NULL and simple sanity checks
+SELECT pg_relation_check_pages(NULL); -- empty result
+SELECT pg_relation_check_pages(NULL, NULL); -- empty result
+SELECT pg_relation_check_pages('pg_class', 'invalid_fork'); -- error
+
+-- Relation types that are supported
+CREATE TABLE pgfunc_test_tab (id int);
+CREATE INDEX pgfunc_test_ind ON pgfunc_test_tab(id);
+INSERT INTO pgfunc_test_tab VALUES (generate_series(1,1000));
+SELECT pg_relation_check_pages('pgfunc_test_tab');
+SELECT pg_relation_check_pages('pgfunc_test_ind');
+DROP TABLE pgfunc_test_tab;
+
+CREATE MATERIALIZED VIEW pgfunc_test_matview AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_matview');
+DROP MATERIALIZED VIEW pgfunc_test_matview;
+CREATE SEQUENCE pgfunc_test_seq;
+SELECT pg_relation_check_pages('pgfunc_test_seq');
+DROP SEQUENCE pgfunc_test_seq;
+
+-- pg_relation_check_pages() returns no results if passed relations that
+-- do not support the operation, like relations without storage or temporary
+-- relations.
+CREATE TEMPORARY TABLE pgfunc_test_temp AS SELECT generate_series(1,10) AS a;
+SELECT pg_relation_check_pages('pgfunc_test_temp');
+DROP TABLE pgfunc_test_temp;
+CREATE VIEW pgfunc_test_view AS SELECT 1;
+SELECT pg_relation_check_pages('pgfunc_test_view');
+DROP VIEW pgfunc_test_view;
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f7f401b534..7ef2ec9972 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26182,6 +26182,56 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 
   </sect2>
 
+  <sect2 id="functions-data-sanity">
+   <title>Data Sanity Functions</title>
+
+   <para>
+    The functions shown in <xref linkend="functions-data-sanity-table"/>
+    provide ways to check the sanity of data files in the cluster.
+   </para>
+
+   <table id="functions-data-sanity-table">
+    <title>Data Sanity Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry>
+      </row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_relation_check_pages(<parameter>relation</parameter> <type>regclass</type> [, <parameter>fork</parameter> <type>text</type> <literal>DEFAULT</literal> <literal>NULL</literal> ])</function></literal>
+       </entry>
+       <entry><type>setof record</type></entry>
+       <entry>Check the pages of a relation.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+
+   <indexterm>
+    <primary>pg_relation_check_pages</primary>
+   </indexterm>
+   <para id="functions-check-relation-note" xreflabel="pg_relation_check_pages">
+    <function>pg_relation_check_pages</function> iterates over all blocks of a
+    given relation and verifies if they are in a state where they can safely
+    be loaded into the shared buffers. If defined,
+    <replaceable>fork</replaceable> specifies that only the pages of the given
+    fork are to be verified. Fork can be <literal>'main'</literal> for the
+    main data fork, <literal>'fsm'</literal> for the free space map,
+    <literal>'vm'</literal> for the visibility map, or
+    <literal>'init'</literal> for the initialization fork.  The default of
+    <literal>NULL</literal> means that all the forks of the relation are
+    checked. The function returns a list of blocks that are considered as
+    corrupted with the path of the related file. Use of this function is
+    restricted to superusers by default but access may be granted to others
+    using <command>GRANT</command>.
+   </para>
+
+  </sect2>
+
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ff853634bc..b6acade6c6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -332,6 +332,7 @@ CatCacheHeader
 CatalogId
 CatalogIndexState
 ChangeVarNodes_context
+CheckPagesErrorInfo
 CheckPoint
 CheckPointStmt
 CheckpointStatsData

#61

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Michael Paquier (#60)

Re: Online checksums verification in the backend

On Tue, Oct 27, 2020 at 3:07 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote:

On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote:

Mmm, is it really an improvement to report warnings during this
function execution? Note also that PageIsVerified as-is won't report
a warning if a page is found as PageIsNew() but isn't actually all
zero, while still being reported as corrupted by the SRF.

Yep, joining the point of above to just have no WARNINGs at all.

Now that we have d401c57, I got to consider more this one, and opted
for not generating a WARNING for now. Hence, PageisVerifiedExtended()
is disabled regarding that, but we still report a checksum failure in
it.

Great, that's also what I had in mind.

I have spent some time reviewing the tests, and as I felt this was
bulky. In the reworked version attached, I have reduced the number of
tests by half, without reducing the coverage, mainly:
- Removed all the stderr and the return code tests, as we always
expected the commands to succeed, and safe_psql() can do the job
already.
- Merged of the queries using pg_relation_check_pages into a single
routine, with the expected output (set of broken pages returned in the
SRF) in the arguments.
- Added some prefixes to the tests, to generate unique test names.
That makes debug easier.
- The query on pg_stat_database is run once at the beginning, once at
the end with the number of checksum failures correctly updated.
- Added comments to document all the routines, and renamed some of
them mostly for consistency.
- Skipped system relations from the scan of pg_class, making the test
more costly for nothing.
- I ran some tests on Windows, just-in-case.

I have also added a SearchSysCacheExists1() to double-check if the
relation is missing before opening it, added a
CHECK_FOR_INTERRUPTS() within the main check loop (where the error
context is really helpful), indented the code, bumped the catalogs
(mostly a self-reminder), etc.

So, what do you think?

I think it's also worth noting that the IOLock is now acquired just
before getting the buffer state, and released after the read (or after
finding that the buffer is dirty). This is consistent with how it's
done elsewhere, so I'm fine.

Other than that I'm quite happy with the changes you made, thanks a lot!

#62

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Julien Rouhaud (#61)

Re: Online checksums verification in the backend

On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote:

I think it's also worth noting that the IOLock is now acquired just
before getting the buffer state, and released after the read (or after
finding that the buffer is dirty). This is consistent with how it's
done elsewhere, so I'm fine.

Consistency is the point. This API should be safe to use by design.
I have done some extra performance tests similar to what I did
upthread, and this version showed similar numbers.

Other than that I'm quite happy with the changes you made, thanks a lot!

Thanks for confirming. I have gone through the whole set today,
splitted the thing into two commits and applied them. We had
buildfarm member florican complain about a mistake in one of the
GetDatum() calls that I took care of already, and there is nothing
else on my radar.
--
Michael

#63

Shinoda, Noriyoshi (PN Japan A&PS Delivery)

noriyoshi.shinoda@hpe.com

about 5 years ago

In reply to: Michael Paquier (#62)

1 attachment(s)

RE: Online checksums verification in the backend

Hi,

I have tested this great feature in the latest commit environment on Red Hat Enterprise Linux 7.8. I modified a few blocks in a relation file to raise a checksum error. When I executed the pg_relation_check_pages function, the backend terminated abnormally. The attached file is the operation log.

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: Michael Paquier [mailto:michael@paquier.xyz]
Sent: Wednesday, October 28, 2020 2:09 PM
To: Julien Rouhaud <rjuju123@gmail.com>
Cc: Justin Pryzby <pryzby@telsasoft.com>; Masahiko Sawada <masahiko.sawada@2ndquadrant.com>; Robert Haas <robertmhaas@gmail.com>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Masahiko Sawada <sawada.mshk@gmail.com>
Subject: Re: Online checksums verification in the backend

On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote:

I think it's also worth noting that the IOLock is now acquired just
before getting the buffer state, and released after the read (or after
finding that the buffer is dirty). This is consistent with how it's
done elsewhere, so I'm fine.

Consistency is the point. This API should be safe to use by design.
I have done some extra performance tests similar to what I did upthread, and this version showed similar numbers.

Other than that I'm quite happy with the changes you made, thanks a lot!

Thanks for confirming. I have gone through the whole set today, splitted the thing into two commits and applied them. We had buildfarm member florican complain about a mistake in one of the
GetDatum() calls that I took care of already, and there is nothing else on my radar.
--
Michael

#64

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Shinoda, Noriyoshi (PN Japan A&PS Delivery) (#63)

Re: Online checksums verification in the backend

Hello,

On Thu, Oct 29, 2020 at 7:52 AM Shinoda, Noriyoshi (PN Japan A&PS
Delivery) <noriyoshi.shinoda@hpe.com> wrote:

Hi,

I have tested this great feature in the latest commit environment on Red Hat Enterprise Linux 7.8. I modified a few blocks in a relation file to raise a checksum error. When I executed the pg_relation_check_pages function, the backend terminated abnormally. The attached file is the operation log.

Thanks for the report!

As far as I can see the issue is that the pfree(path) in
check_relation_fork() should be outside the for loop.

#65

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Julien Rouhaud (#64)

Re: Online checksums verification in the backend

On Thu, Oct 29, 2020 at 08:12:42AM +0800, Julien Rouhaud wrote:

As far as I can see the issue is that the pfree(path) in
check_relation_fork() should be outside the for loop.

Yes, this would be triggered if more than one page is found as broken
in a single SRF. Fixed, thanks Shinoda-san.
--
Michael

#66

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Michael Paquier (#62)

Re: Online checksums verification in the backend

Hi,

On 2020-10-28 14:08:52 +0900, Michael Paquier wrote:

Thanks for confirming. I have gone through the whole set today,
splitted the thing into two commits and applied them. We had
buildfarm member florican complain about a mistake in one of the
GetDatum() calls that I took care of already, and there is nothing
else on my radar.

The code does IO while holding the buffer mapping lock. That seems
*entirely* unacceptable to me. That basically locks 1/128 of shared
buffers against concurrent mapping changes, while reading data that is
likely not to be on disk? Seriously?

/* see if the block is in the buffer pool or not */
LWLockAcquire(partLock, LW_SHARED);
buf_id = BufTableLookup(&buf_tag, buf_hash);
if (buf_id >= 0)
{
uint32 buf_state;

/*
* Found it. Now, retrieve its state to know what to do with it, and
* release the pin immediately. We do so to limit overhead as much as
* possible. We keep the shared LWLock on the target buffer mapping
* partition for now, so this buffer cannot be evicted, and we acquire
* an I/O Lock on the buffer as we may need to read its contents from
* disk.
*/
bufdesc = GetBufferDescriptor(buf_id);

LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
buf_state = LockBufHdr(bufdesc);
UnlockBufHdr(bufdesc, buf_state);

/* If the page is dirty or invalid, skip it */
if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0)
{
LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
LWLockRelease(partLock);
return true;
}

/* Read the buffer from disk, with the I/O lock still held */
smgrread(smgr, forknum, blkno, buffer);
LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
}
else
{
/*
* Simply read the buffer. There's no risk of modification on it as
* we are holding the buffer pool partition mapping lock.
*/
smgrread(smgr, forknum, blkno, buffer);
}

The justification in the in-shared-buffers case seems to completely
mis-judge costs too:
* Found it. Now, retrieve its state to know what to do with it, and
* release the pin immediately. We do so to limit overhead as much as
* possible. We keep the shared LWLock on the target buffer mapping
* partition for now, so this buffer cannot be evicted, and we acquire
* an I/O Lock on the buffer as we may need to read its contents from
* disk.
a pin is cheap. Holding the partition lock is not.

Also, using char[BLCKSZ] as a buffer isn't ok. This should use
PGAlignedBlock:
/*
* Use this, not "char buf[BLCKSZ]", to declare a field or local variable
* holding a page buffer, if that page might be accessed as a page and not
* just a string of bytes. Otherwise the variable might be under-aligned,
* causing problems on alignment-picky hardware. (In some places, we use
* this to declare buffers even though we only pass them to read() and
* write(), because copying to/from aligned buffers is usually faster than
* using unaligned buffers.) We include both "double" and "int64" in the
* union to ensure that the compiler knows the value must be MAXALIGN'ed
* (cf. configure's computation of MAXIMUM_ALIGNOF).
*/
typedef union PGAlignedBlock

I think this needs to be quickly reworked or reverted.

Greetings,

Andres Freund

#67

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#66)

Re: Online checksums verification in the backend

Hi,

On 2020-10-29 11:17:29 -0700, Andres Freund wrote:

LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
buf_state = LockBufHdr(bufdesc);
UnlockBufHdr(bufdesc, buf_state);

/* If the page is dirty or invalid, skip it */
if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0)

This is weird as well. What is this supposed to do? Just locking and
unlocking a buffer header doesn't do squat? There's no guarantee that
the flags haven't changed by this point, so you could just as well not
acquire the buffer header lock.

Also, why are pages without a valid tag ignored? I can follow the
argument for skipping it in the DIRTY case, but that doesn't apply for
BM_TAG_VALID?

Greetings,

Andres Freund

#68

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Andres Freund (#66)

Re: Online checksums verification in the backend

Hi,

On 2020-10-29 11:17:29 -0700, Andres Freund wrote:

The code does IO while holding the buffer mapping lock. That seems
*entirely* unacceptable to me. That basically locks 1/128 of shared
buffers against concurrent mapping changes, while reading data that is
likely not to be on disk? Seriously?

Also, uh, I don't think the locking of the buffer table provides you
with the full guarantees CheckBuffer() seems to assume:

* Check the state of a buffer without loading it into the shared buffers. To
* avoid torn pages and possible false positives when reading data, a shared
* LWLock is taken on the target buffer pool partition mapping, and we check
* if the page is in shared buffers or not. An I/O lock is taken on the block
* to prevent any concurrent activity from happening.

this doesn't actually prevent all concurrent write IO, unless you hold
an appropriate lock on the relation. There's a few places that use
smgrwrite()/smgrextend() to write out data bypassing shared buffers.

Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
for, but that'd need a pretty detailed explanation as to when it's safe
to use CheckBuffer() for which blocks.

Greetings,

Andres Freund

#69

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Andres Freund (#66)

Re: Online checksums verification in the backend

Hi,

On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:

The code does IO while holding the buffer mapping lock. That seems
*entirely* unacceptable to me. That basically locks 1/128 of shared
buffers against concurrent mapping changes, while reading data that is
likely not to be on disk? Seriously?

The initial implementation had a different approach, reading the buffer once
without holding the buffer mapping lock (which could lead to some false
positive in some unlikely scenario), and only if a corruption is detected the
read is done once again *while holding the buffer mapping lock* to ensure it's
not a false positive. Some benchmarking showed that the performance was worse,
so we dropped that optimisation. Should we go back to something like that or
do you have a better way to ensure a consistent read of a buffer which isn't in
shared buffers?

a pin is cheap. Holding the partition lock is not.

The justification in the in-shared-buffers case seems to completely
mis-judge costs too:
* Found it. Now, retrieve its state to know what to do with it, and
* release the pin immediately. We do so to limit overhead as much as
* possible. We keep the shared LWLock on the target buffer mapping
* partition for now, so this buffer cannot be evicted, and we acquire
* an I/O Lock on the buffer as we may need to read its contents from
* disk.
a pin is cheap. Holding the partition lock is not.

I clearly did a poor job in that case. Will fix.

Also, using char[BLCKSZ] as a buffer isn't ok. This should use
PGAlignedBlock:

I wasn't aware of it, I will fix.

LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED);
buf_state = LockBufHdr(bufdesc);
UnlockBufHdr(bufdesc, buf_state);

/* If the page is dirty or invalid, skip it */
if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0)

This is weird as well. What is this supposed to do? Just locking and
unlocking a buffer header doesn't do squat? There's no guarantee that
the flags haven't changed by this point, so you could just as well not
acquire the buffer header lock.

This is using the same approach as e.g. WaitIO() to get the state. I agree
that the state can change after the buffer header lock has been released, but
I think that's something out of scope. The only guarantee that we can give is
that the database (or subset of relations checked) was healthy at the time the
check was started, provided that your cluster survive the checkpoint happening
after the check ended. I don't see how we can do better than that.

Also, why are pages without a valid tag ignored? I can follow the
argument for skipping it in the DIRTY case, but that doesn't apply for
BM_TAG_VALID?

AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
Those shouldn't
be entirely initialized yet, and they'll be eventually written and flushed.

Also, uh, I don't think the locking of the buffer table provides you
with the full guarantees CheckBuffer() seems to assume:

* Check the state of a buffer without loading it into the shared buffers. To
* avoid torn pages and possible false positives when reading data, a shared
* LWLock is taken on the target buffer pool partition mapping, and we check
* if the page is in shared buffers or not. An I/O lock is taken on the block
* to prevent any concurrent activity from happening.

this doesn't actually prevent all concurrent write IO, unless you hold
an appropriate lock on the relation. There's a few places that use
smgrwrite()/smgrextend() to write out data bypassing shared buffers.

Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
for, but that'd need a pretty detailed explanation as to when it's safe
to use CheckBuffer() for which blocks.

AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
relation, during creation. Those relations shouldn't be visible to the caller
snapshot, so it should be safe. I can add a comment for that if I'm not
mistaken.

For concurrent smgrextend(), we read the relation size at the beginning of the
function, so we shouldn't read newly allocated blocks. But you're right that
it's still possible to get the size that includes a newly allocated block
that can be concurrently written. We can avoid that be holding a
LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok?

#70

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Julien Rouhaud (#69)

Re: Online checksums verification in the backend

Hi,

On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote:

On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:

The code does IO while holding the buffer mapping lock. That seems
*entirely* unacceptable to me. That basically locks 1/128 of shared
buffers against concurrent mapping changes, while reading data that is
likely not to be on disk? Seriously?

The initial implementation had a different approach, reading the buffer once
without holding the buffer mapping lock (which could lead to some false
positive in some unlikely scenario), and only if a corruption is detected the
read is done once again *while holding the buffer mapping lock* to ensure it's
not a false positive. Some benchmarking showed that the performance was worse,
so we dropped that optimisation. Should we go back to something like that or
do you have a better way to ensure a consistent read of a buffer which isn't in
shared buffers?

I suspect that you're gonna need something quite different than what the
function is doing right now. Not because such a method will be faster in
isolation, but because there's a chance to have it correct and not have
a significant performance impact onto the rest of the system.

I've not thought about it in detail yet. Is suspect you'll need to
ensure there is a valid entry in the buffer mapping table for the buffer
you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry
you're going to prevent concurrent IO from starting until your part is
done.

Also, why are pages without a valid tag ignored? I can follow the
argument for skipping it in the DIRTY case, but that doesn't apply for
BM_TAG_VALID?

AFAICT pages that aren't BM_TAG_VALID are pages newly allocated.
Those shouldn't
be entirely initialized yet, and they'll be eventually written and flushed.

When a page is being read there's a period when the buffer is without
BM_TAG_VALID. It's quite possible that the locking prevents this case
from being reachable - but in that case you shouldn't just accept it as
something to be skipped...

Also, uh, I don't think the locking of the buffer table provides you
with the full guarantees CheckBuffer() seems to assume:

* Check the state of a buffer without loading it into the shared buffers. To
* avoid torn pages and possible false positives when reading data, a shared
* LWLock is taken on the target buffer pool partition mapping, and we check
* if the page is in shared buffers or not. An I/O lock is taken on the block
* to prevent any concurrent activity from happening.

this doesn't actually prevent all concurrent write IO, unless you hold
an appropriate lock on the relation. There's a few places that use
smgrwrite()/smgrextend() to write out data bypassing shared buffers.

Maybe that isn't a problem for the uses of CheckBuffer() is envisioned
for, but that'd need a pretty detailed explanation as to when it's safe
to use CheckBuffer() for which blocks.

AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged
relation, during creation.

That may be the case right in core right now, but for one, there
definitely are extensions going through smgrwrite() without using the
buffer pool. Essentially, what you are saying is that the introduction
of CheckBuffer() altered what smgrwrite() is allowed to be used for,
without having discussed or documented that.

Before this an AM/extension could just use smgrwrite() to write data not
in shared buffers, as long as a locking scheme is used that prevents
multiple backends from doing that at the same time (trivially:
AccessExclusiveLock).

Those relations shouldn't be visible to the caller
snapshot, so it should be safe. I can add a comment for that if I'm not
mistaken.

There's no comment warning that you shouldn't use CheckBuffer() to check
every buffer in shared buffers, or every relfilenode on disk. The latter
would be quite a reasonable thing, given it'd avoid needing to connect
to every database etc.

For concurrent smgrextend(), we read the relation size at the beginning of the
function, so we shouldn't read newly allocated blocks. But you're right that
it's still possible to get the size that includes a newly allocated block
that can be concurrently written. We can avoid that be holding a
LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok?

That could possibly work - but currently CheckBuffer() doesn't get a
relation, nor are the comments explaining that it has to be a relation
in the current database or anything.

I hadn't yet looked at the caller - I just started looking at
CheckBuffer() this because it caused compilation failures after rebasing
my aio branch onto master (there's no IO locks anymore).

Looking at the caller:
- This is not a concurrency safe pattern:

/* Check if relation exists. leaving if there is no such relation */
if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
return;

relation = relation_open(relid, AccessShareLock);

there's a pretty obvious time-to-check-time-to-use danger here.

- pg_relation_check_pages()'s docs say "valid enough to safely be loaded
into the server's shared buffers". I think that's overpromising by a
lot. It sounds like it verifies that the page cannot cause a crash or
such when accessed - but it obviously does no such thing.

- Why does check_one_relation() *silently* ignore when it's being
passed a temporary table, or a relkind without storage?

- I don't think it's good that check_one_relation() releases relation
locks after access, but I know that others think that's fine (I think
it's only fine for catalog relations).

- I realize permission to pg_relation_check_pages() is not granted to
non-superusers by default, but shouldn't it still perform relation
access checks?

- why does check_relation_fork() pstrdup the path?

Greetings,

Andres Freund

#71

Julien Rouhaud

rjuju123@gmail.com

about 5 years ago

In reply to: Andres Freund (#70)

Re: Online checksums verification in the backend

On Fri, Oct 30, 2020 at 10:58 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote:

On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:

The code does IO while holding the buffer mapping lock. That seems
*entirely* unacceptable to me. That basically locks 1/128 of shared
buffers against concurrent mapping changes, while reading data that is
likely not to be on disk? Seriously?

The initial implementation had a different approach, reading the buffer once
without holding the buffer mapping lock (which could lead to some false
positive in some unlikely scenario), and only if a corruption is detected the
read is done once again *while holding the buffer mapping lock* to ensure it's
not a false positive. Some benchmarking showed that the performance was worse,
so we dropped that optimisation. Should we go back to something like that or
do you have a better way to ensure a consistent read of a buffer which isn't in
shared buffers?

I suspect that you're gonna need something quite different than what the
function is doing right now. Not because such a method will be faster in
isolation, but because there's a chance to have it correct and not have
a significant performance impact onto the rest of the system.

I've not thought about it in detail yet. Is suspect you'll need to
ensure there is a valid entry in the buffer mapping table for the buffer
you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry
you're going to prevent concurrent IO from starting until your part is
done.

So I'm assuming that the previous optimization to avoid almost every
time doing an IO while holding a buffer mapping lock isn't an option?
In that case, I don't see any other option than reverting the patch
and discussing a new approach.

#72

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Julien Rouhaud (#71)

Re: Online checksums verification in the backend

On 2020-10-30 11:58:13 +0800, Julien Rouhaud wrote:

So I'm assuming that the previous optimization to avoid almost every
time doing an IO while holding a buffer mapping lock isn't an option?
In that case, I don't see any other option than reverting the patch
and discussing a new approach.

I think its pretty much *never* OK to do IO while holding a buffer
mapping lock. You're locking a significant fraction of shared buffers
over IO. That's just not OK. Don't think there's any place doing so
currently either.

#73

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Andres Freund (#72)

Re: Online checksums verification in the backend

On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote:

I think its pretty much *never* OK to do IO while holding a buffer
mapping lock. You're locking a significant fraction of shared buffers
over IO. That's just not OK. Don't think there's any place doing so
currently either.

There is no place doing that on HEAD.

This specific point was mentioned in the first message of this thread,
7th paragraph. That was a long thread, so it is easy to miss:
/messages/by-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

I am wondering what you have in mind regarding the use of
BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some
consequences for other existing buffers in the table, like a possible
eviction? I'd like to think that we should not do any manipulation of
the buffer tables in this case. Hence, in order to prevent a
concurrent activity to load in shared buffers the page currently
checked on disk, I got to think that we would need something new here,
like a filtering hash table that would be checked each time a backend
tries to insert an entry into the buffer tables. That's something I
was wondering here:
/messages/by-id/20200316030638.GA2331@paquier.xyz
I called that a preemptive lock, but you could also call that a
discard filter or a virtual pin, just something to mean that a page
locked this way cannot be loaded into the shared buffers. I'd like to
think that this should not touch the existing buffer table, but it
would impact the performance when attempting to insert an entry in the
tables, as anything would need to be pre-checked.

Assuming that we could make this thing work without holding the
partition lock, and assuming that we only hold a share lock on the
relation, we have two cases:
1) If the buffer is in shared buffers, we have the APIs to solve that
by using a pin, unlock the partition, and then do the I/O. (Still
that's unsafe with the smgrwrite() argument?)
2) If the buffer is not in shared buffers, we don't have what it takes
to solve the problem yet. But even if we solve this problem, we will
never really be sure that this is entirely safe, as per the argument
with concurrent smgrwrite() calls. Current in-core code assumes that
this can happen only for init forks of unlogged relations which would
not be visible yet in the backend doing a page check, still it can be
really easy to break this assumption with any new code added by a new
feature.

These arguments bring down to reduce the scope of CheckBuffer() as
follows:
- Use an AEL on the relation, pass down a Relation instead of
SMgrRelation, and add on the way an assertion to make sure that the
caller holds an AEL on the relation. I wanted to study the possiblity
to use that stuff for base backups, but if you bring the concurrent
smgrwrite() calls into the set of possibilities this shuts down the
whole study from the start.
- It is still useful to check that a page is in shared buffers IMO, so
as if it is dirty we just discard it from the checks and rely on the
next checkpoint to do a flush. It is also useful to check the state
of the on-disk data is good or not if the page is not dirty, as the
page could have gone rogue on-disk while a system was up for weeks.
--
Michael

#74

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Michael Paquier (#73)

Re: Online checksums verification in the backend

Hi,

I'm a bit limited writing - one handed for a while following an injury
on Friday...

On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:

On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote:

I think its pretty much *never* OK to do IO while holding a buffer
mapping lock. You're locking a significant fraction of shared buffers
over IO. That's just not OK. Don't think there's any place doing so
currently either.

There is no place doing that on HEAD.

Err?

/* see if the block is in the buffer pool or not */
LWLockAcquire(partLock, LW_SHARED);
buf_id = BufTableLookup(&buf_tag, buf_hash);
if (buf_id >= 0)
{
...
/* Read the buffer from disk, with the I/O lock still held */
smgrread(smgr, forknum, blkno, buffer);
LWLockRelease(BufferDescriptorGetIOLock(bufdesc));
}
else
{
/*
* Simply read the buffer. There's no risk of modification on it as
* we are holding the buffer pool partition mapping lock.
*/
smgrread(smgr, forknum, blkno, buffer);
}

/* buffer lookup done, so now do its check */
LWLockRelease(partLock);

How is this not doing IO while holding a buffer mapping lock?

This specific point was mentioned in the first message of this thread,
7th paragraph. That was a long thread, so it is easy to miss:
/messages/by-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

The code clearly doesnt implement it that way.

I am wondering what you have in mind regarding the use of
BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some
consequences for other existing buffers in the table, like a possible
eviction?

You'd need exactly one empty buffer for that - it can be reused for the
next to-be-checked buffer.

I'd like to think that we should not do any manipulation of
the buffer tables in this case.

Why? Its the way we lock buffers - why is this so special that we need
to do differently?

Hence, in order to prevent a
concurrent activity to load in shared buffers the page currently
checked on disk, I got to think that we would need something new here,
like a filtering hash table that would be checked each time a backend
tries to insert an entry into the buffer tables.

Thats going to slow down everything a bit - the mapping already is a
bottleneck.

1) If the buffer is in shared buffers, we have the APIs to solve that
by using a pin, unlock the partition, and then do the I/O. (Still
that's unsafe with the smgrwrite() argument?)

Thats why you need an appropriate relation lock... Something CheckBuffer
didnt bother to document. Its a restriction, but one we probably can
live with.

2) If the buffer is not in shared buffers, we don't have what it takes
to solve the problem yet.

We do. Set up enough state for the case to be otherwise the same as the
in s_b case.

But even if we solve this problem, we will
never really be sure that this is entirely safe, as per the argument
with concurrent smgrwrite() calls. Current in-core code assumes that
this can happen only for init forks of unlogged relations which would
not be visible yet in the backend doing a page check, still it can be
really easy to break this assumption with any new code added by a new
feature.

It also happens in a few other cases than just init forks. But
visibility & relation locking can take care of that. But you need to
document that. If the locking allows concurent readers - and especially
concurrent writers, then you cant really use smgrwite for anything but
relation extension.

Greetings,

Andres Freund

#75

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Andres Freund (#74)

Re: Online checksums verification in the backend

On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote:

I'm a bit limited writing - one handed for a while following an injury
on Friday...

Oops. Take care.

On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:

There is no place doing that on HEAD.

Err?

/* see if the block is in the buffer pool or not */
LWLockAcquire(partLock, LW_SHARED);
buf_id = BufTableLookup(&buf_tag, buf_hash);

[...]

How is this not doing IO while holding a buffer mapping lock?

Well, other than the one we are discussing of course :)

This specific point was mentioned in the first message of this thread,
7th paragraph. That was a long thread, so it is easy to miss:
/messages/by-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com

The code clearly doesnt implement it that way.

I am wondering what you have in mind regarding the use of
BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some
consequences for other existing buffers in the table, like a possible
eviction?

You'd need exactly one empty buffer for that - it can be reused for the
next to-be-checked buffer.

I'd like to think that we should not do any manipulation of
the buffer tables in this case.

Why? Its the way we lock buffers - why is this so special that we need
to do differently?

Hence, in order to prevent a
concurrent activity to load in shared buffers the page currently
checked on disk, I got to think that we would need something new here,
like a filtering hash table that would be checked each time a backend
tries to insert an entry into the buffer tables.

Thats going to slow down everything a bit - the mapping already is a
bottleneck.

1) If the buffer is in shared buffers, we have the APIs to solve that
by using a pin, unlock the partition, and then do the I/O. (Still
that's unsafe with the smgrwrite() argument?)

Thats why you need an appropriate relation lock... Something CheckBuffer
didnt bother to document. Its a restriction, but one we probably can
live with.

2) If the buffer is not in shared buffers, we don't have what it takes
to solve the problem yet.

We do. Set up enough state for the case to be otherwise the same as the
in s_b case.

But even if we solve this problem, we will
never really be sure that this is entirely safe, as per the argument
with concurrent smgrwrite() calls. Current in-core code assumes that
this can happen only for init forks of unlogged relations which would
not be visible yet in the backend doing a page check, still it can be
really easy to break this assumption with any new code added by a new
feature.

It also happens in a few other cases than just init forks. But
visibility & relation locking can take care of that. But you need to
document that. If the locking allows concurent readers - and especially
concurrent writers, then you cant really use smgrwite for anything but
relation extension.

--
Michael

#76

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Michael Paquier (#75)

Re: Online checksums verification in the backend

On 2020-11-02 10:45:00 +0900, Michael Paquier wrote:

On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote:

I'm a bit limited writing - one handed for a while following an injury
on Friday...

Oops. Take care.

Thanks!

On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:

There is no place doing that on HEAD.

Err?
How is this not doing IO while holding a buffer mapping lock?

Well, other than the one we are discussing of course :)

I am not following? Were you just confirming that its not a thing we do?

Greetings,

Andres Freund

#77

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Andres Freund (#76)

Re: Online checksums verification in the backend

On Sun, Nov 01, 2020 at 05:50:06PM -0800, Andres Freund wrote:

On 2020-11-02 10:45:00 +0900, Michael Paquier wrote:

On 2020-11-02 10:05:25 +0900, Michael Paquier wrote:

There is no place doing that on HEAD.

Err?
How is this not doing IO while holding a buffer mapping lock?

Well, other than the one we are discussing of course :)

I am not following? Were you just confirming that its not a thing we do?

I meant that this is not done in any place other than the one
introduced by c780a7a. So we have one place where it happens, and
no places before c780a7a.
--
Michael

#78

Robert Haas

robertmhaas@gmail.com

about 5 years ago

In reply to: Andres Freund (#66)

Re: Online checksums verification in the backend

On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:

I think this needs to be quickly reworked or reverted.

I don't like this patch, either. In addition to what Andres mentioned,
CheckBuffer() is a completely-special case mechanism which can't be
reused by anything else. In particular, the amcheck for heap stuff
which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1)
would really like a way to examine a buffer without risking an error
if PageIsVerified() should happen to fail, but this patch is of
absolutely no use in getting there, because CheckBuffer() doesn't give
the caller any way to access the contents of the buffer. It can only
do the checks that it knows how to do, and that's it. That doesn't
seem like a good design.

I don't like the fact that CheckBuffer() silently skips dirty buffers,
either. The comment should really say that it checks the state of a
buffer without loading it into shared buffers, except sometimes it
doesn't actually check it. That doesn't seem like the behavior users
really want, and it's not clear that there is any really good reason
for it. If the buffer is in shared buffers, we could take a share-lock
on the buffer and copy the contents of the page as it exists on disk,
and then still check it.

It feels really confusing to me that the user-exposed function here is
called pg_relation_check_pages(). How is the user supposed to
understand the difference between what this function does and what the
new verify_heapam() in amcheck does? The answer is that the latter
does far more extensive checks, but this isn't obvious from the SGML
documentation, which says only that the blocks are "verified," as if
an end-user can reasonably be expected to know what that means. It
seems likely to lead users to the belief that if this function passes,
they are in good shape, which is extremely far from being true. Just
look at what PageIsVerified() checks compared to what verify_heapam()
checks.

In fact, I would argue that this functionality ought to live in
amcheck rather than core, though there could usefully be enabling
functions in core.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#79

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Robert Haas (#78)

Re: Online checksums verification in the backend

Hi,

On 2020-11-02 12:35:30 -0500, Robert Haas wrote:

On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:

I think this needs to be quickly reworked or reverted.

I think it's fairly clear by now that revert is appropriate for now.

I don't like this patch, either. In addition to what Andres mentioned,
CheckBuffer() is a completely-special case mechanism which can't be
reused by anything else. In particular, the amcheck for heap stuff
which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1)
would really like a way to examine a buffer without risking an error
if PageIsVerified() should happen to fail, but this patch is of
absolutely no use in getting there, because CheckBuffer() doesn't give
the caller any way to access the contents of the buffer. It can only
do the checks that it knows how to do, and that's it. That doesn't
seem like a good design.

Wouldn't this be better served by having a ReadBufferExtended() flag,
preventing erroring out and zeroing the buffer? I'm not sure that
handling both the case where the buffer contents need to be valid and
the one where it doesn't will make for a good API.

I don't like the fact that CheckBuffer() silently skips dirty buffers,
either. The comment should really say that it checks the state of a
buffer without loading it into shared buffers, except sometimes it
doesn't actually check it.

Yea, I don't see a good reason for that either. There's reasons for
dirty buffers that aren't WAL logged - so if the on-disk page is broken,
a standby taken outside pg_basebackup would possibly still end up with a
corrupt on-disk page. Similar with a crash restart.

If the buffer is in shared buffers, we could take a share-lock
on the buffer and copy the contents of the page as it exists on disk,
and then still check it.

Don't think we need a share lock. That still allows the buffer to be
written out (and thus a torn read). What we need is to set
BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
that. And then unset that again, without unsetting
BM_DIRTY/BM_JUST_DIRTIED.

It feels really confusing to me that the user-exposed function here is
called pg_relation_check_pages(). How is the user supposed to
understand the difference between what this function does and what the
new verify_heapam() in amcheck does? The answer is that the latter
does far more extensive checks, but this isn't obvious from the SGML
documentation, which says only that the blocks are "verified," as if
an end-user can reasonably be expected to know what that means. It
seems likely to lead users to the belief that if this function passes,
they are in good shape, which is extremely far from being true. Just
look at what PageIsVerified() checks compared to what verify_heapam()
checks.

Yea I had similar thoughts, it should just be called
pg_checksum_verify_relation() or something.

In fact, I would argue that this functionality ought to live in
amcheck rather than core, though there could usefully be enabling
functions in core.

I'm not really convinced by this though. It's not really AM
specific - works for all types of relations with storage; don't really
object either...

Greetings,

Andres Freund

#80

Magnus Hagander

magnus@hagander.net

about 5 years ago

In reply to: Andres Freund (#79)

Re: Online checksums verification in the backend

On Mon, Nov 2, 2020 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-11-02 12:35:30 -0500, Robert Haas wrote:

It feels really confusing to me that the user-exposed function here is
called pg_relation_check_pages(). How is the user supposed to
understand the difference between what this function does and what the
new verify_heapam() in amcheck does? The answer is that the latter
does far more extensive checks, but this isn't obvious from the SGML
documentation, which says only that the blocks are "verified," as if
an end-user can reasonably be expected to know what that means. It
seems likely to lead users to the belief that if this function passes,
they are in good shape, which is extremely far from being true. Just
look at what PageIsVerified() checks compared to what verify_heapam()
checks.

Yea I had similar thoughts, it should just be called
pg_checksum_verify_relation() or something.

+1.

In fact, I would argue that this functionality ought to live in
amcheck rather than core, though there could usefully be enabling
functions in core.

I'm not really convinced by this though. It's not really AM
specific - works for all types of relations with storage; don't really
object either...

Yeah, I'm not sure about that one either. Also what would happen
if/when we get checksums on things that aren't even relations? (though
maybe that goes for other parts of amcheck at some point as well?)

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#81

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Magnus Hagander (#80)

Re: Online checksums verification in the backend

On Tue, Nov 03, 2020 at 09:31:20AM +0100, Magnus Hagander wrote:

On Mon, Nov 2, 2020 at 8:35 PM Andres Freund <andres@anarazel.de> wrote:

On 2020-11-02 12:35:30 -0500, Robert Haas wrote:

It feels really confusing to me that the user-exposed function here is
called pg_relation_check_pages(). How is the user supposed to
understand the difference between what this function does and what the
new verify_heapam() in amcheck does? The answer is that the latter
does far more extensive checks, but this isn't obvious from the SGML
documentation, which says only that the blocks are "verified," as if
an end-user can reasonably be expected to know what that means. It
seems likely to lead users to the belief that if this function passes,
they are in good shape, which is extremely far from being true. Just
look at what PageIsVerified() checks compared to what verify_heapam()
checks.

The cases of verify_heapam() are much wider as they target only one
AM, while this stuff should remain more general. There seems to be
some overlap in terms of the basic checks done by bufmgr.c, and
the fact that you may not want to be that much intrusive with the
existing buffer pool as well when running the AM checks. It also
seems to me that the use cases are quite different for both, the
original goal of this thread is to detect physical corruptions for all
AMs, while verify_heapam() looks after logical corruptions in the way
heap is handled.

Yea I had similar thoughts, it should just be called
pg_checksum_verify_relation() or something.

+1.

I mentioned that upthread, is there really a dependency with checksums
here? There are two cases where we can still apply some checks on a
page, without any need of checksums:
- The state of the page header.
- Zeroed page if pd_upper is 0. Those pages are valid, and don't have
a checksum computed.
So it seems to me that when it comes to relation pages, then the
check of a page should answer to the question: is this page loadable
in shared buffers, or not?

In fact, I would argue that this functionality ought to live in
amcheck rather than core, though there could usefully be enabling
functions in core.

I'm not really convinced by this though. It's not really AM
specific - works for all types of relations with storage; don't really
object either...

Yeah, I'm not sure about that one either. Also what would happen
if/when we get checksums on things that aren't even relations? (though
maybe that goes for other parts of amcheck at some point as well?)

I also thought about amcheck when looking at this thread, but it did
not seem the right place as this applies to any AM able that could
load stuff into the shared buffers.
--
Michael

#82

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Andres Freund (#79)

Re: Online checksums verification in the backend

On Mon, Nov 02, 2020 at 11:34:57AM -0800, Andres Freund wrote:

On 2020-11-02 12:35:30 -0500, Robert Haas wrote:

On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:

I think this needs to be quickly reworked or reverted.

I think it's fairly clear by now that revert is appropriate for now.

Yep, that's clear. I'll deal with that tomorrow. That's more than a
simple rework.

I don't like this patch, either. In addition to what Andres mentioned,
CheckBuffer() is a completely-special case mechanism which can't be
reused by anything else. In particular, the amcheck for heap stuff
which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1)
would really like a way to examine a buffer without risking an error
if PageIsVerified() should happen to fail, but this patch is of
absolutely no use in getting there, because CheckBuffer() doesn't give
the caller any way to access the contents of the buffer. It can only
do the checks that it knows how to do, and that's it. That doesn't
seem like a good design.

Wouldn't this be better served by having a ReadBufferExtended() flag,
preventing erroring out and zeroing the buffer? I'm not sure that
handling both the case where the buffer contents need to be valid and
the one where it doesn't will make for a good API.

If you grep for ReadBuffer_common() is some of the emails I sent.. I
was rather interested in something like that.

I don't like the fact that CheckBuffer() silently skips dirty buffers,
either. The comment should really say that it checks the state of a
buffer without loading it into shared buffers, except sometimes it
doesn't actually check it.

Yea, I don't see a good reason for that either. There's reasons for
dirty buffers that aren't WAL logged - so if the on-disk page is broken,
a standby taken outside pg_basebackup would possibly still end up with a
corrupt on-disk page. Similar with a crash restart.

Er, if you don't skip dirty buffers, wouldn't you actually report some
pages as broken if attempting to run those in a standby who may have
some torn pages from a previous base backup? You could still run into
problems post-promotion, until the first checkpoint post-recovery
happens, no?

If the buffer is in shared buffers, we could take a share-lock
on the buffer and copy the contents of the page as it exists on disk,
and then still check it.

Don't think we need a share lock. That still allows the buffer to be
written out (and thus a torn read). What we need is to set
BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
that. And then unset that again, without unsetting
BM_DIRTY/BM_JUST_DIRTIED.

If that can work, we could make use of some of that for base backups
for a single retry of a page that initially failed.
--
Michael

#83

Robert Haas

robertmhaas@gmail.com

about 5 years ago

In reply to: Andres Freund (#79)

Re: Online checksums verification in the backend

On Mon, Nov 2, 2020 at 2:35 PM Andres Freund <andres@anarazel.de> wrote:

Wouldn't this be better served by having a ReadBufferExtended() flag,
preventing erroring out and zeroing the buffer? I'm not sure that
handling both the case where the buffer contents need to be valid and
the one where it doesn't will make for a good API.

I'm not sure. The goal I had in mind was giving a caller a way to get
a copy of a buffer even if it's one we wouldn't normally admit into
shared_buffers. I think it's probably a bad idea to allow for a back
door where things that fail PageIsVerified() can nevertheless escape
into the buffer, but that doesn't mean a checker or recovery tool
shouldn't be allowed to see them.

If the buffer is in shared buffers, we could take a share-lock
on the buffer and copy the contents of the page as it exists on disk,
and then still check it.

Don't think we need a share lock. That still allows the buffer to be
written out (and thus a torn read). What we need is to set
BM_IO_IN_PROGRESS on the buffer in question - only one backend can set
that. And then unset that again, without unsetting
BM_DIRTY/BM_JUST_DIRTIED.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#84

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Michael Paquier (#82)

Re: Online checksums verification in the backend

On Tue, Nov 03, 2020 at 06:46:12PM +0900, Michael Paquier wrote:

Yep, that's clear. I'll deal with that tomorrow. That's more than a
simple rework.

This part is now done as of e152506a.
--
Michael