datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

Started by Noah Mischover 1 year ago5 messages

noah@leadboat.com

over 1 year ago

3 attachment(s)

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Trouble is possible, I bet, if the system crashes between the inplace-update
memcpy() and XLogInsert(). See the new XXX comment below the memcpy().

That comment:

/*----------
* XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
*
* ["D" is a VACUUM (ONLY_DATABASE_STATS)]
* ["R" is a VACUUM tbl]
* D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
* D: systable_getnext() returns pg_class tuple of tbl
* R: memcpy() into pg_class tuple of tbl
* D: raise pg_database.datfrozenxid, XLogInsert(), finish
* [crash]
* [recovery restores datfrozenxid w/o relfrozenxid]
*/

Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
finally issuing memcpy() into the buffer.

That fix worked. Along with that, I'm attaching a not-for-commit patch with a
test case and one with the fix rebased on that test case. Apply on top of the
v2 patch stack from /messages/by-id/20240617235854.f8.nmisch@google.com.
This gets key testing from 027_stream_regress.pl; when I commented out some
memcpy lines of the heapam.c change, that test caught it.

This resolves the last inplace update defect known to me.

Thanks,
nm

Attachments:

inplace180-datfrozenxid-overtakes-relfrozenxid-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    WAL-log inplace update before revealing it to other sessions.
    
    A buffer lock won't stop a reader having already checked tuple
    visibility.  If a vac_update_datfrozenid() and then a crash happened
    during inplace update of a relfrozenxid value, datfrozenxid could
    overtake relfrozenxid.  Back-patch to v12 (all supported versions).
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
index fb06ff2..aec8dcc 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -201,6 +201,4 @@ Inplace updates create an exception to the rule that tuple data won't change
 under a reader holding a pin.  A reader of a heap_fetch() result tuple may
 witness a torn read.  Current inplace-updated fields are aligned and are no
 wider than four bytes, and current readers don't need consistency across
-fields.  Hence, they get by with just fetching each field once.  XXX such a
-caller may also read a value that has not reached WAL; see
-heap_inplace_update_finish().
+fields.  Hence, they get by with just fetching each field once.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d7e417f..2a5fea5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6305,6 +6305,8 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	Relation	relation = scan->heap_rel;
 	uint32		oldlen;
 	uint32		newlen;
+	char	   *dst;
+	char	   *src;
 	int			nmsgs = 0;
 	SharedInvalidationMessage *invalMessages = NULL;
 	bool		RelcacheInitFileInval = false;
@@ -6315,6 +6317,9 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	dst = (char *) htup + htup->t_hoff;
+	src = (char *) tuple->t_data + tuple->t_data->t_hoff;
+
 	/*
 	 * Construct shared cache inval if necessary.  Note that because we only
 	 * pass the new version of the tuple, this mustn't be used for any
@@ -6338,15 +6343,15 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	 */
 	PreInplace_Inval();
 
-	/* NO EREPORT(ERROR) from here till changes are logged */
-	START_CRIT_SECTION();
-
-	memcpy((char *) htup + htup->t_hoff,
-		   (char *) tuple->t_data + tuple->t_data->t_hoff,
-		   newlen);
-
 	/*----------
-	 * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
+	 * NO EREPORT(ERROR) from here till changes are complete
+	 *
+	 * Our buffer lock won't stop a reader having already pinned and checked
+	 * visibility for this tuple.  Hence, we write WAL first, then mutate the
+	 * buffer.  Like in MarkBufferDirtyHint() or RecordTransactionCommit(),
+	 * checkpoint delay makes that acceptable.  With the usual order of
+	 * changes, a crash after memcpy() and before XLogInsert() could allow
+	 * datfrozenxid to overtake relfrozenxid:
 	 *
 	 * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
 	 * ["R" is a VACUUM tbl]
@@ -6356,14 +6361,28 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	 * D: raise pg_database.datfrozenxid, XLogInsert(), finish
 	 * [crash]
 	 * [recovery restores datfrozenxid w/o relfrozenxid]
+	 *
+	 * Like in MarkBufferDirtyHint() subroutine XLogSaveBufferForHint(), copy
+	 * the buffer to the stack before logging.  Here, that facilitates a FPI
+	 * of the post-mutation block before we accept other sessions seeing it.
 	 */
-
-	MarkBufferDirty(buffer);
+	Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+	START_CRIT_SECTION();
+	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
 
 	/* XLOG stuff */
 	if (RelationNeedsWAL(relation))
 	{
 		xl_heap_inplace xlrec;
+		PGAlignedBlock copied_buffer;
+		char	   *origdata = (char *) BufferGetBlock(buffer);
+		Page		page = BufferGetPage(buffer);
+		uint16		lower = ((PageHeader) page)->pd_lower;
+		uint16		upper = ((PageHeader) page)->pd_upper;
+		uintptr_t	dst_offset_in_block;
+		RelFileLocator rlocator;
+		ForkNumber	forkno;
+		BlockNumber blkno;
 		XLogRecPtr	recptr;
 
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
@@ -6378,16 +6397,28 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 			XLogRegisterData((char *) invalMessages,
 							 nmsgs * sizeof(SharedInvalidationMessage));
 
-		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-		XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
+		/* register block matching what buffer will look like after changes */
+		memcpy(copied_buffer.data, origdata, lower);
+		memcpy(copied_buffer.data + upper, origdata + upper, BLCKSZ - upper);
+		dst_offset_in_block = dst - origdata;
+		memcpy(copied_buffer.data + dst_offset_in_block, src, newlen);
+		BufferGetTag(buffer, &rlocator, &forkno, &blkno);
+		Assert(forkno == MAIN_FORKNUM);
+		XLogRegisterBlock(0, &rlocator, forkno, blkno, copied_buffer.data,
+						  REGBUF_STANDARD);
+		XLogRegisterBufData(0, src, newlen);
 
 		/* inplace updates aren't decoded atm, don't log the origin */
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
 
-		PageSetLSN(BufferGetPage(buffer), recptr);
+		PageSetLSN(page, recptr);
 	}
 
+	memcpy(dst, src, newlen);
+
+	MarkBufferDirty(buffer);
+
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 	/*
@@ -6400,6 +6431,7 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	 */
 	AtInplace_Inval();
 
+	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
 	END_CRIT_SECTION();
 	UnlockTuple(relation, &tuple->t_self, InplaceUpdateTupleLock);
 	systable_endscan(scan);

demo-inplace170-datfrozenxid-overtakes-relfrozenxid-test-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    [not for commit] demonstrate datfrozenxid overtaking relfrozenxid
    
    Components that might be separate patches if committing:
    - Allow injection points in critical sections
    - Emit DEBUG1 before waiting on injection point, so TAP test can poll for it
    - BackgroundPsql: add feature to wait for stderr content, not just stdout

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d7e417f..c4d18ad 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6344,6 +6344,7 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	memcpy((char *) htup + htup->t_hoff,
 		   (char *) tuple->t_data + tuple->t_data->t_hoff,
 		   newlen);
+	INJECTION_POINT("inplace-after-mempcy");
 
 	/*----------
 	 * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d299a25..c4020e3 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -37,6 +37,7 @@
 #include "catalog/index.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_database.h"
+#include "catalog/pg_enum_d.h"
 #include "catalog/pg_inherits.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
@@ -56,6 +57,7 @@
 #include "utils/fmgroids.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
+#include "utils/injection_point.h"
 #include "utils/memutils.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
@@ -1575,7 +1577,7 @@ vac_update_datfrozenxid(void)
 {
 	HeapTuple	tuple;
 	Form_pg_database dbform;
-	Relation	relation;
+	Relation	relation, dbrelation;
 	SysScanDesc scan;
 	HeapTuple	classTup;
 	TransactionId newFrozenXid;
@@ -1629,11 +1631,24 @@ vac_update_datfrozenxid(void)
 	scan = systable_beginscan(relation, InvalidOid, false,
 							  NULL, 0, NULL);
 
+	/*
+	 * This may process invals that need pg_class buffer locks, so get it out
+	 * of the way.
+	 */
+	dbrelation = table_open(DatabaseRelationId, RowExclusiveLock);
+
 	while ((classTup = systable_getnext(scan)) != NULL)
 	{
 		volatile FormData_pg_class *classForm = (Form_pg_class) GETSTRUCT(classTup);
-		TransactionId relfrozenxid = classForm->relfrozenxid;
-		TransactionId relminmxid = classForm->relminmxid;
+		TransactionId relfrozenxid;
+		TransactionId relminmxid;
+
+#ifdef USE_INJECTION_POINTS
+		if (classForm->oid == EnumRelationId)
+			INJECTION_POINT("update_datfrozenxid-before-fetch-relfrozenxid");
+#endif
+		relfrozenxid = classForm->relfrozenxid;
+		relminmxid = classForm->relminmxid;
 
 		/*
 		 * Only consider relations able to hold unfrozen XIDs (anything else
@@ -1706,7 +1721,8 @@ vac_update_datfrozenxid(void)
 	Assert(MultiXactIdIsValid(newMinMulti));
 
 	/* Now fetch the pg_database tuple we need to update. */
-	relation = table_open(DatabaseRelationId, RowExclusiveLock);
+	/* relation = table_open(DatabaseRelationId, RowExclusiveLock); */
+	relation = dbrelation;
 
 	/*
 	 * Fetch a copy of the tuple to scribble on.  We could check the syscache
diff --git a/src/backend/utils/misc/injection_point.c b/src/backend/utils/misc/injection_point.c
index 5c2a0d2..64720f3 100644
--- a/src/backend/utils/misc/injection_point.c
+++ b/src/backend/utils/misc/injection_point.c
@@ -291,11 +291,18 @@ void
 InjectionPointRun(const char *name)
 {
 #ifdef USE_INJECTION_POINTS
+	bool		reset_allow = false;
 	InjectionPointEntry *entry_by_name;
 	bool		found;
 	InjectionPointCallback injection_callback;
 	const void *private_data;
 
+	if (CritSectionCount > 0 && !MemoryContextAllowAllInCriticalSection)
+	{
+		reset_allow = true;
+		MemoryContextAllowAllInCriticalSection = true;
+	}
+
 	LWLockAcquire(InjectionPointLock, LW_SHARED);
 	entry_by_name = (InjectionPointEntry *)
 		hash_search(InjectionPointHash, name,
@@ -309,7 +316,7 @@ InjectionPointRun(const char *name)
 	if (!found)
 	{
 		injection_point_cache_remove(name);
-		return;
+		goto out;
 	}
 
 	/*
@@ -344,6 +351,11 @@ InjectionPointRun(const char *name)
 	/* Now loaded, so get it. */
 	injection_callback = injection_point_cache_get(name, &private_data);
 	injection_callback(name, private_data);
+
+out:
+	/* elog(ERROR) would have become PANIC, so we never miss this reset */
+	if (reset_allow)
+		MemoryContextAllowAllInCriticalSection = false;
 #else
 	elog(ERROR, "Injection points are not supported by this build");
 #endif
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index b42ecec..dda3dcb 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -157,6 +157,17 @@ MemoryContext CurTransactionContext = NULL;
 /* This is a transient link to the active portal's memory context: */
 MemoryContext PortalContext = NULL;
 
+/*
+ * If true, suppress all assertions about allocations in critical sections.
+ * Behave as though MemoryContextAllowInCriticalSection() had been called for
+ * every context, and permit new contexts.  Caller is responsible for
+ * restoring value on error.  The intended use case is debugging aids that
+ * can't use a single-purpose context.  For example, LockAcquire() allocates
+ * in TopMemoryContext, so a debugging aid that calls it has no clean way to
+ * redirect that allocation.
+ */
+bool		MemoryContextAllowAllInCriticalSection = false;
+
 static void MemoryContextDeleteOnly(MemoryContext context);
 static void MemoryContextCallResetCallbacks(MemoryContext context);
 static void MemoryContextStatsInternal(MemoryContext context, int level,
@@ -173,7 +184,8 @@ static void MemoryContextStatsPrint(MemoryContext context, void *passthru,
  * rule, the allocation functions Assert that.
  */
 #define AssertNotInCriticalSection(context) \
-	Assert(CritSectionCount == 0 || (context)->allowInCritSection)
+	Assert(CritSectionCount == 0 || (context)->allowInCritSection || \
+		   MemoryContextAllowAllInCriticalSection)
 
 /*
  * Call the given function in the MemoryContextMethods for the memory context
@@ -1103,8 +1115,7 @@ MemoryContextCreate(MemoryContext node,
 					MemoryContext parent,
 					const char *name)
 {
-	/* Creating new memory contexts is not allowed in a critical section */
-	Assert(CritSectionCount == 0);
+	Assert(CritSectionCount == 0 || MemoryContextAllowAllInCriticalSection);
 
 	/* Initialize all standard fields of memory context header */
 	node->type = tag;
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index cd9596f..9889225 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -93,6 +93,8 @@ extern void MemoryContextStatsDetail(MemoryContext context,
 extern void MemoryContextAllowInCriticalSection(MemoryContext context,
 												bool allow);
 
+extern PGDLLIMPORT bool MemoryContextAllowAllInCriticalSection;
+
 #ifdef MEMORY_CONTEXT_CHECKING
 extern void MemoryContextCheck(MemoryContext context);
 #endif
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 2ffd2f7..4b75d27 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,6 +10,7 @@ REGRESS = injection_points
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = inplace
+TAP_TESTS = 1
 
 # The injection points are cluster-wide, so disable installcheck
 NO_INSTALLCHECK = 1
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index 1b695a1..1b1cb56 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -238,6 +238,8 @@ injection_wait(const char *name, const void *private_data)
 		elog(ERROR, "could not find free slot for wait of injection point %s ",
 			 name);
 
+	elog(DEBUG1, "waiting at injection point %s", name);
+
 	/* And sleep.. */
 	ConditionVariablePrepareToSleep(&inj_state->wait_point);
 	for (;;)
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 3c23c14..86639c5 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -42,4 +42,9 @@ tests += {
       'inplace',
     ],
   },
+  'tap': {
+    'tests': [
+      't/001_inplace.pl',
+    ],
+  },
 }
diff --git a/src/test/modules/injection_points/t/001_inplace.pl b/src/test/modules/injection_points/t/001_inplace.pl
new file mode 100644
index 0000000..fc9380d
--- /dev/null
+++ b/src/test/modules/injection_points/t/001_inplace.pl
@@ -0,0 +1,129 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test race between vac_update_datfrozenxid() pg_class scan and inplace update
+# of relfrozenxid:
+#
+# ["D" is a VACUUM (ONLY_DATABASE_STATS)]
+# ["R" is a VACUUM tbl]
+# D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
+# D: systable_getnext() returns pg_class tuple of tbl
+# R: memcpy() into pg_class tuple of tbl
+# D: raise pg_database.datfrozenxid, XLogInsert(), finish
+# [crash]
+# [recovery restores datfrozenxid w/o relfrozenxid]
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('only');
+# We won't use a replica, but this is handy for (a) pg_waldump on the archive
+# and (b) pg_waldump showing the logged invalidations.
+$node->init(has_archiving => 1, allows_streaming => 1);
+$node->start;
+
+# Choice of table doesn't greatly matter, but injection point triggers on this
+# one.  Using a catalog gives the injection point a trivial OID-based test.
+my $frozenxid_table = 'pg_enum';
+my $dat_inj = 'update_datfrozenxid-before-fetch-relfrozenxid';
+my $inplace_inj = 'inplace-after-mempcy';
+
+# session locks one table, as a way to VACUUM all database tables but this one
+my $lock_table = $node->background_psql('postgres');
+# session waits in vac_update_datfrozenxid()
+my $update_dat = $node->background_psql('postgres');
+# session waits in inplace update of vac_update_relstats()
+my $update_rel = $node->background_psql('postgres');
+# session detaches another session
+my $detacher = $node->background_psql('postgres', on_error_stop => 0);
+
+my $start_id;
+$node->psql(
+	'postgres', qq(
+	SELECT relfrozenxid, datfrozenxid
+	FROM pg_class, pg_database
+	WHERE relname = '$frozenxid_table' AND datname = current_catalog;
+), stdout => \$start_id);
+print STDOUT "relfrozenxid, datfrozenxid: $start_id\n";
+
+# populate syscaches; query fails, but that's fine
+$detacher->query(
+	qq(
+	CREATE EXTENSION injection_points;
+	SELECT injection_points_detach('$dat_inj');
+));
+
+# marker in WAL stream, for pg_waldump convenience
+$node->safe_psql(
+	'postgres', qq(
+	SELECT pg_logical_emit_message(false, 'before', '.', true);
+));
+
+$lock_table->query_safe(
+	"BEGIN; LOCK TABLE $frozenxid_table IN SHARE UPDATE EXCLUSIVE MODE");
+$update_rel->query_safe("SELECT txid_current()");
+# update relfrozenxid for all tables but $frozenxid_table
+$update_rel->query_safe("VACUUM (FREEZE, SKIP_LOCKED, SKIP_DATABASE_STATS);");
+# start ONLY_DATABASE_STATS and wait for it to reach injection point
+$update_dat->query_until(
+	qr/at injection point/, qq(
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('$dat_inj', 'wait');
+	SET client_min_messages = debug1;
+	VACUUM (ONLY_DATABASE_STATS);
+));
+
+# start one-table VACUUM and wait for it to reach injection point
+$lock_table->quit;
+$update_rel->query_until(
+	qr/at injection point/, qq(
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('$inplace_inj', 'wait');
+	SET client_min_messages = debug1;
+	VACUUM (FREEZE, SKIP_DATABASE_STATS) $frozenxid_table;
+));
+
+# complete ONLY_DATABASE_STATS
+$detacher->query(
+	qq(
+	SELECT injection_points_detach('$dat_inj');
+	SELECT injection_points_wakeup('$dat_inj');
+));
+$detacher->quit;
+# flush WAL, since VACUUM did an asynchronous commit
+$update_dat->query(
+	qq(
+	SELECT pg_logical_emit_message(false, 'after', '.', true);
+));
+$update_dat->quit;
+
+# crash with $inplace_inj still waiting
+$node->stop('immediate');
+# don't quit update_rel, which would suffer EPIPE
+#$update_rel->quit;
+$node->start;
+
+# check for corruption
+my $stdout;
+$node->psql(
+	'postgres', q(
+	SELECT pg_class.oid::regclass
+	FROM pg_class, pg_database
+	WHERE age(relfrozenxid) > age(datfrozenxid)
+		AND relkind = 'r' AND datname = current_catalog;
+), stdout => \$stdout);
+is($stdout, '', 'datfrozenxid younger than every relfrozenxid');
+
+# print more detail
+$node->psql(
+	'postgres', qq(
+	SELECT relfrozenxid, datfrozenxid
+	FROM pg_class, pg_database
+	WHERE relname = '$frozenxid_table' AND datname = current_catalog;
+), stdout => \$stdout);
+isnt($stdout, $start_id, 'frozenxid values changed');
+print STDOUT "relfrozenxid, datfrozenxid: $stdout\n";
+
+done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
index 4091c31..c3f55bb 100644
--- a/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
+++ b/src/test/perl/PostgreSQL/Test/BackgroundPsql.pm
@@ -260,10 +260,10 @@ sub query_safe
 
 =item $session->query_until(until, query)
 
-Issue C<query> and wait for C<until> appearing in the query output rather than
-waiting for query completion. C<query> needs to end with newline and semicolon
-(if applicable, interactive psql input may not require it) for psql to process
-the input.
+Issue C<query> and wait for C<until> appearing in the query stdout or stderr
+rather than waiting for query completion. C<query> needs to end with newline
+and semicolon (if applicable, interactive psql input may not require it) for
+psql to process the input.
 
 =cut
 
@@ -276,7 +276,8 @@ sub query_until
 	$self->{timeout}->start() if (defined($self->{query_timer_restart}));
 	$self->{stdin} .= $query;
 
-	pump_until($self->{run}, $self->{timeout}, \$self->{stdout}, $until);
+	pump_until($self->{run}, $self->{timeout},
+		[ \$self->{stdout}, \$self->{stderr} ], $until);
 
 	die "psql query timed out" if $self->{timeout}->is_expired;
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index 022b44b..d1791ee 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -427,29 +427,39 @@ sub run_command
 
 =item pump_until(proc, timeout, stream, until)
 
-Pump until string is matched on the specified stream, or timeout occurs.
+Pump until string is matched on the specified stream(s), or timeout occurs.
+The stream argument can be a scalar ref like, \$stdout, or a ref to an array
+of scalar refs, like [\$stdout, \$stderr].
 
 =cut
 
 sub pump_until
 {
-	my ($proc, $timeout, $stream, $until) = @_;
+	my ($proc, $timeout, $stream_spec, $until) = @_;
 	$proc->pump_nb();
-	while (1)
+	$stream_spec = [$stream_spec] if ref $stream_spec ne 'ARRAY';
+  OUTER: while (1)
 	{
-		last if $$stream =~ /$until/;
+		foreach my $stream (@$stream_spec)
+		{
+			last OUTER if $$stream =~ /$until/;
+		}
 		if ($timeout->is_expired)
 		{
 			diag(
-				"pump_until: timeout expired when searching for \"$until\" with stream: \"$$stream\""
-			);
+				"pump_until: timeout expired when searching for \"$until\" with stream(s): "
+				  . '"'
+				  . join('", "', map { $$_ } @$stream_spec)
+				  . '"');
 			return 0;
 		}
 		if (not $proc->pumpable())
 		{
 			diag(
-				"pump_until: process terminated unexpectedly when searching for \"$until\" with stream: \"$$stream\""
-			);
+				"pump_until: process terminated unexpectedly when searching for \"$until\" with stream(s): "
+				  . '"'
+				  . join('", "', map { $$_ } @$stream_spec)
+				  . '"');
 			return 0;
 		}
 		$proc->pump();

demo-inplace180-datfrozenxid-overtakes-relfrozenxid-fix-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    [not for commit] cherry-pick "WAL-log inplace update before" fix onto the demo
    
    This just resolves a merge conflict around the INJECTION_POINT() line.

diff --git a/src/backend/access/heap/README.tuplock b/src/backend/access/heap/README.tuplock
index fb06ff2..aec8dcc 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -201,6 +201,4 @@ Inplace updates create an exception to the rule that tuple data won't change
 under a reader holding a pin.  A reader of a heap_fetch() result tuple may
 witness a torn read.  Current inplace-updated fields are aligned and are no
 wider than four bytes, and current readers don't need consistency across
-fields.  Hence, they get by with just fetching each field once.  XXX such a
-caller may also read a value that has not reached WAL; see
-heap_inplace_update_finish().
+fields.  Hence, they get by with just fetching each field once.
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c4d18ad..fe7489b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6305,6 +6305,8 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	Relation	relation = scan->heap_rel;
 	uint32		oldlen;
 	uint32		newlen;
+	char	   *dst;
+	char	   *src;
 	int			nmsgs = 0;
 	SharedInvalidationMessage *invalMessages = NULL;
 	bool		RelcacheInitFileInval = false;
@@ -6315,6 +6317,9 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff)
 		elog(ERROR, "wrong tuple length");
 
+	dst = (char *) htup + htup->t_hoff;
+	src = (char *) tuple->t_data + tuple->t_data->t_hoff;
+
 	/*
 	 * Construct shared cache inval if necessary.  Note that because we only
 	 * pass the new version of the tuple, this mustn't be used for any
@@ -6338,16 +6343,15 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	 */
 	PreInplace_Inval();
 
-	/* NO EREPORT(ERROR) from here till changes are logged */
-	START_CRIT_SECTION();
-
-	memcpy((char *) htup + htup->t_hoff,
-		   (char *) tuple->t_data + tuple->t_data->t_hoff,
-		   newlen);
-	INJECTION_POINT("inplace-after-mempcy");
-
 	/*----------
-	 * XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
+	 * NO EREPORT(ERROR) from here till changes are complete
+	 *
+	 * Our buffer lock won't stop a reader having already pinned and checked
+	 * visibility for this tuple.  Hence, we write WAL first, then mutate the
+	 * buffer.  Like in MarkBufferDirtyHint() or RecordTransactionCommit(),
+	 * checkpoint delay makes that acceptable.  With the usual order of
+	 * changes, a crash after memcpy() and before XLogInsert() could allow
+	 * datfrozenxid to overtake relfrozenxid:
 	 *
 	 * ["D" is a VACUUM (ONLY_DATABASE_STATS)]
 	 * ["R" is a VACUUM tbl]
@@ -6357,14 +6361,28 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	 * D: raise pg_database.datfrozenxid, XLogInsert(), finish
 	 * [crash]
 	 * [recovery restores datfrozenxid w/o relfrozenxid]
+	 *
+	 * Like in MarkBufferDirtyHint() subroutine XLogSaveBufferForHint(), copy
+	 * the buffer to the stack before logging.  Here, that facilitates a FPI
+	 * of the post-mutation block before we accept other sessions seeing it.
 	 */
-
-	MarkBufferDirty(buffer);
+	Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
+	START_CRIT_SECTION();
+	MyProc->delayChkptFlags |= DELAY_CHKPT_START;
 
 	/* XLOG stuff */
 	if (RelationNeedsWAL(relation))
 	{
 		xl_heap_inplace xlrec;
+		PGAlignedBlock copied_buffer;
+		char	   *origdata = (char *) BufferGetBlock(buffer);
+		Page		page = BufferGetPage(buffer);
+		uint16		lower = ((PageHeader) page)->pd_lower;
+		uint16		upper = ((PageHeader) page)->pd_upper;
+		uintptr_t	dst_offset_in_block;
+		RelFileLocator rlocator;
+		ForkNumber	forkno;
+		BlockNumber blkno;
 		XLogRecPtr	recptr;
 
 		xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
@@ -6379,16 +6397,29 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 			XLogRegisterData((char *) invalMessages,
 							 nmsgs * sizeof(SharedInvalidationMessage));
 
-		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-		XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
+		/* register block matching what buffer will look like after changes */
+		memcpy(copied_buffer.data, origdata, lower);
+		memcpy(copied_buffer.data + upper, origdata + upper, BLCKSZ - upper);
+		dst_offset_in_block = dst - origdata;
+		memcpy(copied_buffer.data + dst_offset_in_block, src, newlen);
+		BufferGetTag(buffer, &rlocator, &forkno, &blkno);
+		Assert(forkno == MAIN_FORKNUM);
+		XLogRegisterBlock(0, &rlocator, forkno, blkno, copied_buffer.data,
+						  REGBUF_STANDARD);
+		XLogRegisterBufData(0, src, newlen);
 
 		/* inplace updates aren't decoded atm, don't log the origin */
 
 		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
 
-		PageSetLSN(BufferGetPage(buffer), recptr);
+		PageSetLSN(page, recptr);
 	}
 
+	memcpy(dst, src, newlen);
+	INJECTION_POINT("inplace-after-mempcy");
+
+	MarkBufferDirty(buffer);
+
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 	/*
@@ -6401,6 +6432,7 @@ heap_inplace_update_finish(void *state, HeapTuple tuple)
 	 */
 	AtInplace_Inval();
 
+	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
 	END_CRIT_SECTION();
 	UnlockTuple(relation, &tuple->t_self, InplaceUpdateTupleLock);
 	systable_endscan(scan);

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Noah Misch (#1)

Re: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

On 20 Jun 2024, at 06:29, Noah Misch <noah@leadboat.com> wrote:

This resolves the last inplace update defect known to me.

That’s a huge amount of work, thank you!

Do I get it right, that inplace updates are catalog-specific and some other OOM corruptions [0]/messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru and Standby corruptions [1]/messages/by-id/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG=tsnn49-A@mail.gmail.com are not related to this fix. Bot cases we observed on regular tables.
Or that might be effect of vacuum deepening corruption after observing wrong datfrozenxid?

Best regards, Andrey Borodin.

[0]: /messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru
[1]: /messages/by-id/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG=tsnn49-A@mail.gmail.com

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: Andrey M. Borodin (#2)

Re: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

On Thu, Jun 20, 2024 at 12:17:44PM +0500, Andrey M. Borodin wrote:

On 20 Jun 2024, at 06:29, Noah Misch <noah@leadboat.com> wrote:

This resolves the last inplace update defect known to me.

That’s a huge amount of work, thank you!

Do I get it right, that inplace updates are catalog-specific and some other OOM corruptions [0] and Standby corruptions [1] are not related to this fix. Bot cases we observed on regular tables.

In core code, inplace updates are specific to pg_class and pg_database.
Adding PGXN modules, only the citus extension uses them on some other table.
[0]: definitely looks unrelated.

Or that might be effect of vacuum deepening corruption after observing wrong datfrozenxid?

Wrong datfrozenxid can cause premature clog truncation, which can cause "could
not access status of transaction". While $SUBJECT could cause that, I think
it would happen on both primary and standby. [1] seems to be about a standby
lacking clog present on the primary, which is unrelated.

Show quoted text

[0] /messages/by-id/67EADE8F-AEA6-4B73-8E38-A69E5D48BAFE@yandex-team.ru
[1] /messages/by-id/CAFj8pRBEFMxxFSCVOSi-4n0jHzSaxh6Ze_cZid5eG=tsnn49-A@mail.gmail.com

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Noah Misch (#1)

Re: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

On Wed, Jun 19, 2024 at 06:29:08PM -0700, Noah Misch wrote:

/messages/by-id/20240512232923.aa.nmisch@google.com wrote:

Separable, nontrivial things not fixed in the attached patch stack:

- Trouble is possible, I bet, if the system crashes between the inplace-update
memcpy() and XLogInsert(). See the new XXX comment below the memcpy().

That comment:

/*----------
* XXX A crash here can allow datfrozenxid() to get ahead of relfrozenxid:
*
* ["D" is a VACUUM (ONLY_DATABASE_STATS)]
* ["R" is a VACUUM tbl]
* D: vac_update_datfrozenid() -> systable_beginscan(pg_class)
* D: systable_getnext() returns pg_class tuple of tbl
* R: memcpy() into pg_class tuple of tbl
* D: raise pg_database.datfrozenxid, XLogInsert(), finish
* [crash]
* [recovery restores datfrozenxid w/o relfrozenxid]
*/

Might solve this by inplace update setting DELAY_CHKPT, writing WAL, and
finally issuing memcpy() into the buffer.

That fix worked.

I pushed that as commit 8e7e672 (2024-10-25). I now think DELAY_CHKPT_START
is superfluous here, per this proc.h comment:

* (In the
* extremely common case where the data being modified is in shared buffers
* and we acquire an exclusive content lock on the relevant buffers before
* writing WAL, this mechanism is not needed, because phase 2 will block
* until we release the content lock and then flush the modified data to
* disk.)

heap_inplace_update_and_unlock() meets those conditions. Its closest
precedent is XLogSaveBufferForHint(), which does need DELAY_CHKPT_START due to
having only BUFFER_LOCK_SHARED. heap_inplace_update_and_unlock() has
BUFFER_LOCK_EXCLUSIVE, so it doesn't need DELAY_CHKPT_START.

I'm considering two options:

1. Stop using DELAY_CHKPT_START in heap_inplace_update_and_unlock().

2. Add a comment. BUFFER_LOCK_EXCLUSIVE entitles
heap_inplace_update_and_unlock() to omit DELAY_CHKPT_START. Instead, the
function elects to keep itself aligned with XLogSaveBufferForHint(), to
make analysis simpler.

I'm leaning toward (2), since inplace update isn't frequent enough that it
needs to pursue small optimizations. Would anyone strongly prefer (1)?

Noah Misch

noah@leadboat.com

9 months ago

In reply to: Noah Misch (#4)

1 attachment(s)

Re: datfrozenxid > relfrozenxid w/ crash before XLOG_HEAP_INPLACE

On Sun, Apr 06, 2025 at 11:00:54AM -0700, Noah Misch wrote:

I pushed that as commit 8e7e672 (2024-10-25). I now think DELAY_CHKPT_START
is superfluous here, per this proc.h comment:

* (In the
* extremely common case where the data being modified is in shared buffers
* and we acquire an exclusive content lock on the relevant buffers before
* writing WAL, this mechanism is not needed, because phase 2 will block
* until we release the content lock and then flush the modified data to
* disk.)

heap_inplace_update_and_unlock() meets those conditions. Its closest
precedent is XLogSaveBufferForHint(), which does need DELAY_CHKPT_START due to
having only BUFFER_LOCK_SHARED.

True so far, but ...

heap_inplace_update_and_unlock() has
BUFFER_LOCK_EXCLUSIVE, so it doesn't need DELAY_CHKPT_START.

... not so, because we've not yet done MarkBufferDirty(). transam/README
cites SyncOneBuffer(), which says:

* Check whether buffer needs writing.
*
* We can make this check without taking the buffer content lock so long
* as we mark pages dirty in access methods *before* logging changes with
* XLogInsert(): if someone marks the buffer dirty just after our check we
* don't worry because our checkpoint.redo points before log record for
* upcoming changes and so we are not required to write such dirty buffer.

The attached patch updates the aforementioned proc.h comment and the
heap_inplace_update_and_unlock() comment that my last message proposed.

Attachments:

inplace280-comment-delay-v1.patchtext/plain; charset=us-asciiDownload

From: Noah Misch <noah@leadboat.com>

Comment on need to MarkBufferDirty() if omitting DELAY_CHKPT_START.

Blocking checkpoint phase 2 requires MarkBufferDirty() and
BUFFER_LOCK_EXCLUSIVE; neither suffices by itself.  transam/README documents
this, citing SyncOneBuffer().  Update the DELAY_CHKPT_START documentation to
say this.  Expand the heap_inplace_update_and_unlock() comment that cites
XLogSaveBufferForHint() as precedent, since heap_inplace_update_and_unlock()
could have opted not to use DELAY_CHKPT_START.

Commit 8e7e672cdaa6bfec85d4d5dd9be84159df23bb41 added DELAY_CHKPT_START to
heap_inplace_update_and_unlock().  Since commit
bc6bad88572501aecaa2ac5d4bc900ac0fd457d5 reverted it in non-master branches,
no back-patch.

Discussion: https://postgr.es/m/20250406180054.26.nmisch@google.com

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed2e302..c1a4de1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6507,9 +6507,17 @@ heap_inplace_update_and_unlock(Relation relation,
 	 * [crash]
 	 * [recovery restores datfrozenxid w/o relfrozenxid]
 	 *
-	 * Like in MarkBufferDirtyHint() subroutine XLogSaveBufferForHint(), copy
-	 * the buffer to the stack before logging.  Here, that facilitates a FPI
-	 * of the post-mutation block before we accept other sessions seeing it.
+	 * Mimic MarkBufferDirtyHint() subroutine XLogSaveBufferForHint().
+	 * Specifically, use DELAY_CHKPT_START, and copy the buffer to the stack.
+	 * The stack copy facilitates a FPI of the post-mutation block before we
+	 * accept other sessions seeing it.  DELAY_CHKPT_START allows us to
+	 * XLogInsert() before MarkBufferDirty().  Since XLogSaveBufferForHint()
+	 * can operate under BUFFER_LOCK_SHARED, it can't avoid DELAY_CHKPT_START.
+	 * This function, however, likely could avoid it with the following order
+	 * of operations: MarkBufferDirty(), XLogInsert(), memcpy().  Opt to use
+	 * DELAY_CHKPT_START here, too, as a way to have fewer distinct code
+	 * patterns to analyze.  Inplace update isn't so frequent that it should
+	 * pursue the small optimization of skipping DELAY_CHKPT_START.
 	 */
 	Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0);
 	START_CRIT_SECTION();
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f51b03d..86c5f99 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -110,10 +110,10 @@ extern PGDLLIMPORT int FastPathLockGroupsPerBackend;
  * is inserted prior to the new redo point, the corresponding data changes will
  * also be flushed to disk before the checkpoint can complete. (In the
  * extremely common case where the data being modified is in shared buffers
- * and we acquire an exclusive content lock on the relevant buffers before
- * writing WAL, this mechanism is not needed, because phase 2 will block
- * until we release the content lock and then flush the modified data to
- * disk.)
+ * and we acquire an exclusive content lock and MarkBufferDirty() on the
+ * relevant buffers before writing WAL, this mechanism is not needed, because
+ * phase 2 will block until we release the content lock and then flush the
+ * modified data to disk.  See transam/README and SyncOneBuffer().)
  *
  * Setting DELAY_CHKPT_COMPLETE prevents the system from moving from phase 2
  * to phase 3. This is useful if we are performing a WAL-logged operation that