Multixid SLRU truncation bugs at wraparound

Started by Heikki Linnakangas2 months ago3 messages
#1Heikki Linnakangas
hlinnaka@iki.fi
2 attachment(s)

While working on the reported pg_upgrade failure at multixid wraparound
[1]: /messages/by-id/CACG=ezaApSMTjd=M2Sfn5Ucuggd3FG8Z8Qte8Xq9k5-+RQis-g@mail.gmail.com
run vacuum freeze, and it advances oldestMultiXactId, and nextMulti has
just wrapped around to 0, you get this in the log:

LOG: MultiXact member wraparound protections are disabled because oldest checkpointed MultiXact 1 does not exist on disk

Culprit: TruncateMultiXact does this:

LWLockAcquire(MultiXactGenLock, LW_SHARED);
nextMulti = MultiXactState->nextMXact;
nextOffset = MultiXactState->nextOffset;
oldestMulti = MultiXactState->oldestMultiXactId;
LWLockRelease(MultiXactGenLock);
Assert(MultiXactIdIsValid(oldestMulti));

...

/*
* First, compute the safe truncation point for MultiXactMember.
This is
* the starting offset of the oldest multixact.
*
* Hopefully, find_multixact_start will always work here, because we've
* already checked that it doesn't precede the earliest MultiXact
on disk.
* But if it fails, don't truncate anything, and log a message.
*/
if (oldestMulti == nextMulti)
{
/* there are NO MultiXacts */
oldestOffset = nextOffset;
}
else if (!find_multixact_start(oldestMulti, &oldestOffset))
{
ereport(LOG,
(errmsg("oldest MultiXact %u not found, earliest
MultiXact %u, skipping truncation",
oldestMulti, earliest)));
LWLockRelease(MultiXactTruncationLock);
return;
}

Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0.
We should take the "there are NO MultiXacts" codepath in that case,
because we skip over 0 when assigning multixids. Instead, we call
find_multixact_start with oldestMulti==1, which returns false because
multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet.
There's a similar bug in SetOffsetVacuumLimit().

Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get
the log message. But I think there might be more serious variants of
this. If the SLRU segment exists but the offset for multixid 1 hasn't
been set yet, find_multixact_start() will return 0 instead, and we will
proceed with the truncation based on incorrect oldestOffset==0 value,
possibly removing SLRU segments that are still needed.

Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.

Scenario 3: I also noticed that the above code isn't prepared for the
race condition that the offset corresponding to 'oldestMulti' hasn't
been stored in the SLRU yet, even without wraparound. That could
theoretically happen if the backend executing
MultiXactIdCreateFromMembers() gets stuck for a long time between the
calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're
saved by the fact that we only create new multixids while holding a lock
on a heap page, and a system-wide VACUUM FREEZE that would advance
oldestMulti would need to lock the heap page too. It's scary though,
because it could also lead to truncating away members SLRU segments that
are still needed. The attached patch does *not* address this scenario.

[1]: /messages/by-id/CACG=ezaApSMTjd=M2Sfn5Ucuggd3FG8Z8Qte8Xq9k5-+RQis-g@mail.gmail.com
/messages/by-id/CACG=ezaApSMTjd=M2Sfn5Ucuggd3FG8Z8Qte8Xq9k5-+RQis-g@mail.gmail.com

- Heikki

Attachments:

0001-Add-test-for-vacuuming-at-multixid-wraparound.patchtext/x-patch; charset=UTF-8; name=0001-Add-test-for-vacuuming-at-multixid-wraparound.patchDownload
From 557b22e931233e336704d04defee2e19c7706d1c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 7 Nov 2025 17:21:26 +0200
Subject: [PATCH 1/2] Add test for vacuuming at multixid wraparound

This currently fails. The next commit fixes the failure.

This isn't fully polished, and I'm not sure if it's worth committing.
---
 src/test/modules/test_misc/meson.build        |   1 +
 .../test_misc/t/010_mxid_wraparound.pl        | 123 ++++++++++++++++++
 2 files changed, 124 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/010_mxid_wraparound.pl

diff --git a/src/test/modules/test_misc/meson.build b/src/test/modules/test_misc/meson.build
index f258bf1ccd9..cf57ed21dc6 100644
--- a/src/test/modules/test_misc/meson.build
+++ b/src/test/modules/test_misc/meson.build
@@ -18,6 +18,7 @@ tests += {
       't/007_catcache_inval.pl',
       't/008_replslot_single_user.pl',
       't/009_log_temp_files.pl',
+      't/010_mxid_wraparound.pl',
     ],
   },
 }
diff --git a/src/test/modules/test_misc/t/010_mxid_wraparound.pl b/src/test/modules/test_misc/t/010_mxid_wraparound.pl
new file mode 100644
index 00000000000..487cb71eacc
--- /dev/null
+++ b/src/test/modules/test_misc/t/010_mxid_wraparound.pl
@@ -0,0 +1,123 @@
+#
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub print_controldata_info
+{
+	my $node = shift;
+	my ($stdout, $stderr) = run_command([ 'pg_controldata', $node->data_dir ]);
+
+	foreach (split("\n", $stdout))
+	{
+		if ($_ =~ /^Latest checkpoint's Next\s*(.*)$/mg or
+			$_ =~ /^Latest checkpoint's oldest\s*(.*)$/mg)
+		{
+			print $_."\n";
+		}
+	}
+}
+
+sub create_mxid
+{
+	my $node = shift;
+	my $conn1 = $node->background_psql('postgres');
+	my $conn2 = $node->background_psql('postgres');
+
+	$conn1->query_safe(qq(
+		BEGIN;
+		SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+	));
+	$conn2->query_safe(qq(
+		BEGIN;
+		SELECT * FROM test_table WHERE id = 1 FOR SHARE;
+	));
+
+	$conn1->query_safe(qq(COMMIT;));
+	$conn2->query_safe(qq(COMMIT;));
+
+	$conn1->quit;
+	$conn2->quit;
+}
+
+# 1) Create test cluster
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+
+$node->start;
+
+$node->safe_psql('postgres',
+qq(
+	CREATE TABLE test_table (id integer NOT NULL PRIMARY KEY, val text);
+	INSERT INTO test_table VALUES (1, 'a');
+));
+
+create_mxid($node);
+
+$node->safe_psql('postgres', qq(UPDATE pg_database SET datallowconn = TRUE WHERE datname = 'template0';));
+$node->stop;
+
+# 2) Advance mxid to UINT32_MAX. We do it in three steps, with vacuums in between, to avoid
+# causing a situation where datminmxid has already wrapped around
+
+# Step 1
+command_ok(
+	[ 'pg_resetwal', '-m', '1492123648,1', $node->data_dir ],
+	'approaching the mxid limit');
+$node->start;
+create_mxid($node);
+$node->command_ok([ 'vacuumdb', '-a', '--freeze' ], 'vacuum all databases');
+$node->stop;
+
+print ">>> pg_controldata: \n";
+print_controldata_info($node);
+
+# Step 2
+command_ok(
+	[ 'pg_resetwal', '-m', '2984247296,1492123648', $node->data_dir ],
+	'approaching the mxid limit');
+$node->start;
+create_mxid($node);
+$node->command_ok([ 'vacuumdb', '-a', '--freeze' ], 'vacuum all databases');
+$node->stop;
+
+# Step 3. This finally gets us to UINT32_MAX.
+command_ok(
+	[ 'pg_resetwal', '-m', '4294967295,2984247296', $node->data_dir ],
+	'approaching the mxid limit');
+
+print ">>> pg_controldata: \n";
+print_controldata_info($node);
+
+# The last step advances nextMulti to value that's not at the beginning of SLRU segment,
+# Postgres expects the segment file to already exit. Create it.
+my $offsets_seg = $node->data_dir . '/pg_multixact/offsets/FFFF';
+open my $fh1, '>', $offsets_seg or BAIL_OUT($!);
+binmode $fh1;
+print $fh1 pack("x[262144]");
+close $fh1;
+
+
+$node->start;
+create_mxid($node);
+$node->command_ok([ 'vacuumdb', '-a', '--freeze' ], 'vacuum all databases');
+is($node->safe_psql('postgres', qq(TABLE test_table;)),
+	'1|a',
+	'check table contents');
+$node->stop;
+
+ok( !$node->log_contains("wraparound protections are disabled"),
+	"check that log doesn't contain 'wraparound protections are disabled'");
+
+ok( !$node->log_contains("cannot truncate up to MultiXact"),
+	"check that log doesn't contain 'cannot truncate up to MultiXact'");
+
+ok( !$node->log_contains("skipping truncation"),
+	"check that log doesn't contain 'skipping truncation'");
+
+done_testing();
-- 
2.47.3

0002-Fix-truncation-of-multixid-SLRUs-at-wraparound.patchtext/x-patch; charset=UTF-8; name=0002-Fix-truncation-of-multixid-SLRUs-at-wraparound.patchDownload
From 757483c3446dfd4566da32079d7ed45cf73ee0bc Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri, 7 Nov 2025 17:06:07 +0200
Subject: [PATCH 2/2] Fix truncation of multixid SLRUs at wraparound

SetOffsetVacuumLimit() and TruncateMultiXact() have checks for
MultiXactState->nextMXact == MultiXactState->oldestMultiXactId.
However, those checks didn't work as intended at wraparound. When the
last multixid before wraparound (UINT32_MAX) is consumed,
MultiXactState->nextMXact is advanced to 0, but because 0 is not a
valid multixid, all code that reads MultiXactState->nextMXact treats 0
as if the value was 1. Except for the checks in SetOffsetVacuumLimit()
and TruncateMultiXact().

As a result, at exactly multixid wraparound, VACUUM would fail to
truncate multixact SLRUs, or worse, it might truncate the offsets SLRU
incorrectly. I think the incorrect truncation is possible if a new
multixid is assigned concurrently just as vacuum reads the offsets
SLRU. The failure to truncate is easier to reproduce, but less
serious.

Discussion: XXX
---
 src/backend/access/transam/multixact.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..735486f9df7 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2673,6 +2673,9 @@ SetOffsetVacuumLimit(bool is_startup)
 	Assert(MultiXactState->finishedStartup);
 	LWLockRelease(MultiXactGenLock);
 
+	if (nextMXact < FirstMultiXactId)
+		nextMXact = FirstMultiXactId;
+
 	/*
 	 * Determine the offset of the oldest multixact.  Normally, we can read
 	 * the offset from the multixact itself, but there's an important special
@@ -3075,6 +3078,9 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockRelease(MultiXactGenLock);
 	Assert(MultiXactIdIsValid(oldestMulti));
 
+	if (nextMulti < FirstMultiXactId)
+		nextMulti = FirstMultiXactId;
+
 	/*
 	 * Make sure to only attempt truncation if there's values to truncate
 	 * away. In normal processing values shouldn't go backwards, but there's
-- 
2.47.3

#2Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#1)
2 attachment(s)
Re: Multixid SLRU truncation bugs at wraparound

On 07/11/2025 17:33, Heikki Linnakangas wrote:

Scenario 1: In the buggy scenario, oldestMulti is 1 and nextMulti is 0.
We should take the "there are NO MultiXacts" codepath in that case,
because we skip over 0 when assigning multixids. Instead, we call
find_multixact_start with oldestMulti==1, which returns false because
multixid 1 hasn't been assigned and the SLRU segment doesn't exist yet.
There's a similar bug in SetOffsetVacuumLimit().

Scenario 2: In scenario 1 we just fail to truncate the SLRUs and you get
the log message. But I think there might be more serious variants of
this. If the SLRU segment exists but the offset for multixid 1 hasn't
been set yet, find_multixact_start() will return 0 instead, and we will
proceed with the truncation based on incorrect oldestOffset==0 value,
possibly removing SLRU segments that are still needed.

Attached is a fix for scenarios 1 and 2, and a test case for scenario 1.

These scenarios were incidentally fixed by commit 789d65364c "Set next
multixid's offset when creating a new multixid". (On master, commit
87a350e1f2 "Never store 0 as the nextMXact" would also have fixed it,
but 789d65364c was committed first.)

Scenario 3: I also noticed that the above code isn't prepared for the
race condition that the offset corresponding to 'oldestMulti' hasn't
been stored in the SLRU yet, even without wraparound. That could
theoretically happen if the backend executing
MultiXactIdCreateFromMembers() gets stuck for a long time between the
calls to GetNewMultiXactId() and RecordNewMultiXact(), but I think we're
saved by the fact that we only create new multixids while holding a lock
on a heap page, and a system-wide VACUUM FREEZE that would advance
oldestMulti would need to lock the heap page too. It's scary though,
because it could also lead to truncating away members SLRU segments that
are still needed. The attached patch does *not* address this scenario.

I believe this is still a potential issue, although I'm still not sure
if it's possible to hit in practice. I see two avenues:

1. A long delay between GetNewMultiXactId() and RecordNewMultiXact(), as
I suggested above.

2. A crash between GetNewMultiXactId() and RecordNewMultiXact(), which
leaves behind an multixid with 0 offset.

I'm not sure if TruncateMultiXact() can be called with such an invalid
multixid. It cannot appear anywhere in the heap, so perhaps VACUUM will
never pick it as the oldest multixid. But I can't rule it out either. In
any case, we should put a safeguard in TruncateMultiXact() for this. The
consequences are bad, and it's easy to check for.

While working on this, I noticed that the truncation code does a couple
of unnecessary things:

/*
* Note we can't just plow ahead with the truncation; it's possible that
* there are no segments to truncate, which is a problem because we are
* going to attempt to read the offsets page to determine where to
* truncate the members SLRU. So we first scan the directory to determine
* the earliest offsets page number that we can read without error.
*
* When nextMXact is less than one segment away from multiWrapLimit,
* SlruScanDirCbFindEarliest can find some early segment other than the
* actual earliest. (MultiXactOffsetPagePrecedes(EARLIEST, LATEST)
* returns false, because not all pairs of entries have the same answer.)
* That can also arise when an earlier truncation attempt failed unlink()
* or returned early from this function. The only consequence is
* returning early, which wastes space that we could have liberated.
*
* NB: It's also possible that the page that oldestMulti is on has already
* been truncated away, and we crashed before updating oldestMulti.
*/
trunc.earliestExistingPage = -1;
SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
if (earliest < FirstMultiXactId)
earliest = FirstMultiXactId;

/* If there's nothing to remove, we can bail out early. */
if (MultiXactIdPrecedes(oldestMulti, earliest))
{
LWLockRelease(MultiXactTruncationLock);
return;
}

/*
* First, compute the safe truncation point for MultiXactMember. This is
* the starting offset of the oldest multixact.
*
* Hopefully, find_multixact_start will always work here, because we've
* already checked that it doesn't precede the earliest MultiXact on disk.
* But if it fails, don't truncate anything, and log a message.
*/
if (oldestMulti == nextMulti)
{
/* there are NO MultiXacts */
oldestOffset = nextOffset;
}
else if (!find_multixact_start(oldestMulti, &oldestOffset))
{
ereport(LOG,
(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
oldestMulti, earliest)));
LWLockRelease(MultiXactTruncationLock);
return;
}

1. In the above, we scan the 'offsets' SLRU to find the earliest
existing page. Per the comments, that's to avoid reading an offset that
doesn't exist. But find_multixact_start() handles that gracefully; we
check for that above and print the "oldest Multixact %u not found" log
message if it happens. What's the point of scanning the SLRU then, why
not just let find_multixact_start() detect the missing segment?

2. The above code looks up oldestOffset, which is then included in the
WAL record and passed to PerformMembersTruncation(). But
PerformMembersTruncation() doesn't use it for anything. It's not used
during WAL replay either. This is new with the 64-bit offsets; on stable
branches PerformMembersTruncation() does use it.

I'm inclined to leave this alone in stable branches, as it's not broken,
just somewhat pointless. But for 'master', I propose the attached
v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch.

For all branches, I propose
v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a
check for oldestOffset == 0. That fixes the potential for catastrophic
truncation with invalid offset 0.

- Heikki

Attachments:

v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patchtext/x-patch; charset=UTF-8; name=v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patchDownload
From 085401c0d226cd0bff176b6c70b3be37ae18f656 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 6 Jan 2026 13:34:35 +0200
Subject: [PATCH v1 1/2] Remove some unnecessary code from multixact truncation

With 64-bit multixact offsets, PerformMembersTruncation() doesn't need
the starting offset anymore. The 'oldestOffset' value that
TruncateMultiXact() calculates is no longer used for anything. Remove
it, and the code to calculate it.

'oldestOffset' was included in the WAL record as 'startTruncMemb',
which sounds nice if you e.g. look at the WAL with pg_waldump, but it
was also confusing because we didn't actually use the value for
determining what to truncate. Replaying the WAL would remove all
segments older than 'endTruncMemb', regardless of
'startTruncMemb'. The 'startTruncOff' stored in the WAL record was
similarly unnecessary even before 64-bit multixid offsets, it was
stored just for the sake of symmetry with 'startTruncMemb'. Remove
both from the WAL record, and rename the remaining 'endTruncOff' to
'oldestMulti' and 'endTruncMemb' to 'oldestOffset', for consistency
with the variable names used to calculate them.

Discussion: https://www.postgresql.org/message-id/37494009-784f-471d-b516-e03c56553d6a@iki.fi
---
 src/backend/access/rmgrdesc/mxactdesc.c |   5 +-
 src/backend/access/transam/multixact.c  | 141 +++++-------------------
 src/include/access/multixact.h          |  10 +-
 3 files changed, 32 insertions(+), 124 deletions(-)

diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 32a031d9e60..6919c7ddd2e 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -74,9 +74,8 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
-		appendStringInfo(buf, "offsets [%u, %u), members [%" PRIu64 ", %" PRIu64 ")",
-						 xlrec->startTruncOff, xlrec->endTruncOff,
-						 xlrec->startTruncMemb, xlrec->endTruncMemb);
+		appendStringInfo(buf, "oldestMulti %u, oldestOffset %" PRIu64,
+						 xlrec->oldestMulti, xlrec->oldestOffset);
 	}
 }
 
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3f423636b48..39e07d5924f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -282,9 +282,7 @@ static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
 static void SetOldestOffset(void);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
-								  MultiXactId startTruncOff,
 								  MultiXactId endTruncOff,
-								  MultiXactOffset startTruncMemb,
 								  MultiXactOffset endTruncMemb);
 
 
@@ -2558,45 +2556,22 @@ MultiXactMemberFreezeThreshold(void)
 	return Min(result, autovacuum_multixact_freeze_max_age);
 }
 
-typedef struct mxtruncinfo
-{
-	int64		earliestExistingPage;
-} mxtruncinfo;
 
 /*
- * SlruScanDirectory callback
- *		This callback determines the earliest existing page number.
- */
-static bool
-SlruScanDirCbFindEarliest(SlruCtl ctl, char *filename, int64 segpage, void *data)
-{
-	mxtruncinfo *trunc = (mxtruncinfo *) data;
-
-	if (trunc->earliestExistingPage == -1 ||
-		ctl->PagePrecedes(segpage, trunc->earliestExistingPage))
-	{
-		trunc->earliestExistingPage = segpage;
-	}
-
-	return false;				/* keep going */
-}
-
-
-/*
- * Delete members segments [oldest, newOldest)
+ * Delete members segments older than newOldestOffset
  */
 static void
-PerformMembersTruncation(MultiXactOffset oldestOffset, MultiXactOffset newOldestOffset)
+PerformMembersTruncation(MultiXactOffset newOldestOffset)
 {
 	SimpleLruTruncate(MultiXactMemberCtl,
 					  MXOffsetToMemberPage(newOldestOffset));
 }
 
 /*
- * Delete offsets segments [oldest, newOldest)
+ * Delete offsets segments older than newOldestMulti
  */
 static void
-PerformOffsetsTruncation(MultiXactId oldestMulti, MultiXactId newOldestMulti)
+PerformOffsetsTruncation(MultiXactId newOldestMulti)
 {
 	/*
 	 * We step back one multixact to avoid passing a cutoff page that hasn't
@@ -2626,10 +2601,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	MultiXactId oldestMulti;
 	MultiXactId nextMulti;
 	MultiXactOffset newOldestOffset;
-	MultiXactOffset oldestOffset;
 	MultiXactOffset nextOffset;
-	mxtruncinfo trunc;
-	MultiXactId earliest;
 
 	Assert(!RecoveryInProgress());
 	Assert(MultiXactState->finishedStartup);
@@ -2660,59 +2632,6 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 		return;
 	}
 
-	/*
-	 * Note we can't just plow ahead with the truncation; it's possible that
-	 * there are no segments to truncate, which is a problem because we are
-	 * going to attempt to read the offsets page to determine where to
-	 * truncate the members SLRU.  So we first scan the directory to determine
-	 * the earliest offsets page number that we can read without error.
-	 *
-	 * When nextMXact is less than one segment away from multiWrapLimit,
-	 * SlruScanDirCbFindEarliest can find some early segment other than the
-	 * actual earliest.  (MultiXactOffsetPagePrecedes(EARLIEST, LATEST)
-	 * returns false, because not all pairs of entries have the same answer.)
-	 * That can also arise when an earlier truncation attempt failed unlink()
-	 * or returned early from this function.  The only consequence is
-	 * returning early, which wastes space that we could have liberated.
-	 *
-	 * NB: It's also possible that the page that oldestMulti is on has already
-	 * been truncated away, and we crashed before updating oldestMulti.
-	 */
-	trunc.earliestExistingPage = -1;
-	SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, &trunc);
-	earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
-	if (earliest < FirstMultiXactId)
-		earliest = FirstMultiXactId;
-
-	/* If there's nothing to remove, we can bail out early. */
-	if (MultiXactIdPrecedes(oldestMulti, earliest))
-	{
-		LWLockRelease(MultiXactTruncationLock);
-		return;
-	}
-
-	/*
-	 * First, compute the safe truncation point for MultiXactMember. This is
-	 * the starting offset of the oldest multixact.
-	 *
-	 * Hopefully, find_multixact_start will always work here, because we've
-	 * already checked that it doesn't precede the earliest MultiXact on disk.
-	 * But if it fails, don't truncate anything, and log a message.
-	 */
-	if (oldestMulti == nextMulti)
-	{
-		/* there are NO MultiXacts */
-		oldestOffset = nextOffset;
-	}
-	else if (!find_multixact_start(oldestMulti, &oldestOffset))
-	{
-		ereport(LOG,
-				(errmsg("oldest MultiXact %u not found, earliest MultiXact %u, skipping truncation",
-						oldestMulti, earliest)));
-		LWLockRelease(MultiXactTruncationLock);
-		return;
-	}
-
 	/*
 	 * Secondly compute up to where to truncate. Lookup the corresponding
 	 * member offset for newOldestMulti for that.
@@ -2732,13 +2651,11 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	}
 
 	elog(DEBUG1, "performing multixact truncation: "
-		 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-		 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
-		 oldestMulti, newOldestMulti,
-		 MultiXactIdToOffsetSegment(oldestMulti),
+		 "oldestMulti %u (offsets segment %" PRIx64 "), "
+		 "oldestOffset %" PRIu64 " (members segment %" PRIx64 ")",
+		 newOldestMulti,
 		 MultiXactIdToOffsetSegment(newOldestMulti),
-		 oldestOffset, newOldestOffset,
-		 MXOffsetToMemberSegment(oldestOffset),
+		 newOldestOffset,
 		 MXOffsetToMemberSegment(newOldestOffset));
 
 	/*
@@ -2760,8 +2677,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
-						  oldestMulti, newOldestMulti,
-						  oldestOffset, newOldestOffset);
+						  newOldestMulti,
+						  newOldestOffset);
 
 	/*
 	 * Update in-memory limits before performing the truncation, while inside
@@ -2778,10 +2695,10 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	LWLockRelease(MultiXactGenLock);
 
 	/* First truncate members */
-	PerformMembersTruncation(oldestOffset, newOldestOffset);
+	PerformMembersTruncation(newOldestOffset);
 
 	/* Then offsets */
-	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
+	PerformOffsetsTruncation(newOldestMulti);
 
 	MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
 
@@ -2860,19 +2777,15 @@ MultiXactIdPrecedesOrEquals(MultiXactId multi1, MultiXactId multi2)
  */
 static void
 WriteMTruncateXlogRec(Oid oldestMultiDB,
-					  MultiXactId startTruncOff, MultiXactId endTruncOff,
-					  MultiXactOffset startTruncMemb, MultiXactOffset endTruncMemb)
+					  MultiXactId oldestMulti,
+					  MultiXactOffset oldestOffset)
 {
 	XLogRecPtr	recptr;
 	xl_multixact_truncate xlrec;
 
 	xlrec.oldestMultiDB = oldestMultiDB;
-
-	xlrec.startTruncOff = startTruncOff;
-	xlrec.endTruncOff = endTruncOff;
-
-	xlrec.startTruncMemb = startTruncMemb;
-	xlrec.endTruncMemb = endTruncMemb;
+	xlrec.oldestMulti = oldestMulti;
+	xlrec.oldestOffset = oldestOffset;
 
 	XLogBeginInsert();
 	XLogRegisterData(&xlrec, SizeOfMultiXactTruncate);
@@ -2943,14 +2856,12 @@ multixact_redo(XLogReaderState *record)
 			   SizeOfMultiXactTruncate);
 
 		elog(DEBUG1, "replaying multixact truncation: "
-			 "offsets [%u, %u), offsets segments [%" PRIx64 ", %" PRIx64 "), "
-			 "members [%" PRIu64 ", %" PRIu64 "), members segments [%" PRIx64 ", %" PRIx64 ")",
-			 xlrec.startTruncOff, xlrec.endTruncOff,
-			 MultiXactIdToOffsetSegment(xlrec.startTruncOff),
-			 MultiXactIdToOffsetSegment(xlrec.endTruncOff),
-			 xlrec.startTruncMemb, xlrec.endTruncMemb,
-			 MXOffsetToMemberSegment(xlrec.startTruncMemb),
-			 MXOffsetToMemberSegment(xlrec.endTruncMemb));
+			 "oldestMulti %u (offsets segment %" PRIx64 "), "
+			 "oldestOffset %" PRIu64 " (members segment %" PRIx64 ")",
+			 xlrec.oldestMulti,
+			 MultiXactIdToOffsetSegment(xlrec.oldestMulti),
+			 xlrec.oldestOffset,
+			 MXOffsetToMemberSegment(xlrec.oldestOffset));
 
 		/* should not be required, but more than cheap enough */
 		LWLockAcquire(MultiXactTruncationLock, LW_EXCLUSIVE);
@@ -2959,19 +2870,19 @@ multixact_redo(XLogReaderState *record)
 		 * Advance the horizon values, so they're current at the end of
 		 * recovery.
 		 */
-		SetMultiXactIdLimit(xlrec.endTruncOff, xlrec.oldestMultiDB);
+		SetMultiXactIdLimit(xlrec.oldestMulti, xlrec.oldestMultiDB);
 
-		PerformMembersTruncation(xlrec.startTruncMemb, xlrec.endTruncMemb);
+		PerformMembersTruncation(xlrec.oldestOffset);
 
 		/*
 		 * During XLOG replay, latest_page_number isn't necessarily set up
 		 * yet; insert a suitable value to bypass the sanity test in
 		 * SimpleLruTruncate.
 		 */
-		pageno = MultiXactIdToOffsetPage(xlrec.endTruncOff);
+		pageno = MultiXactIdToOffsetPage(xlrec.oldestMulti);
 		pg_atomic_write_u64(&MultiXactOffsetCtl->shared->latest_page_number,
 							pageno);
-		PerformOffsetsTruncation(xlrec.startTruncOff, xlrec.endTruncOff);
+		PerformOffsetsTruncation(xlrec.oldestMulti);
 
 		LWLockRelease(MultiXactTruncationLock);
 	}
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 26f6bd2890e..58ef7ad59df 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -83,13 +83,11 @@ typedef struct xl_multixact_truncate
 {
 	Oid			oldestMultiDB;
 
-	/* to-be-truncated range of multixact offsets */
-	MultiXactId startTruncOff;	/* just for completeness' sake */
-	MultiXactId endTruncOff;
+	/* truncate multixact offsets older than this  */
+	MultiXactId oldestMulti;
 
-	/* to-be-truncated range of multixact members */
-	MultiXactOffset startTruncMemb;
-	MultiXactOffset endTruncMemb;
+	/* truncate multixact members older than this */
+	MultiXactOffset oldestOffset;
 } xl_multixact_truncate;
 
 #define SizeOfMultiXactTruncate (sizeof(xl_multixact_truncate))
-- 
2.47.3

v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patchtext/x-patch; charset=UTF-8; name=v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patchDownload
From 02e8a63bded29edecffa3e94d6ab71174ef41770 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 6 Jan 2026 13:44:46 +0200
Subject: [PATCH v1 2/2] Add check for invalid offset at multixid truncation

If a multixid with zero offset is left behind after a crash, and that
multixid later becomes the oldest multixid, truncation might try to
look up its offset and read the zero value. In the worst case, we
might incorrectly use the zero offset to truncate valid SLRU segments
that are still needed. I'm not sure if that can happen in practice, or
if there are some other lower-level safeguards or incidental reasons
that prevent the caller from passing an unwritten multixid as the
oldest multi. But in any case, let's add an explicit check for it.

Discussion: https://www.postgresql.org/message-id/37494009-784f-471d-b516-e03c56553d6a@iki.fi
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 39e07d5924f..5150ab1ecbd 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2650,6 +2650,23 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 		return;
 	}
 
+	/*
+	 * On crash, MultiXactIdCreateFromMembers() can leave behind multixids
+	 * that were not yet written out and hence have zero offset on disk. If
+	 * such a multixid becomes oldestMulti, we won't be able to look up its
+	 * offset. That should be rare, so we don't try to do anything smart about
+	 * it. Just skip the truncation, and hope that by the next truncation
+	 * attempt, oldestMulti has advanced to a valid multixid.
+	 */
+	if (newOldestOffset == 0)
+	{
+		ereport(LOG,
+				(errmsg("cannot truncate up to MultiXact %u because it has invalid offset, skipping truncation",
+						newOldestMulti)));
+		LWLockRelease(MultiXactTruncationLock);
+		return;
+	}
+
 	elog(DEBUG1, "performing multixact truncation: "
 		 "oldestMulti %u (offsets segment %" PRIx64 "), "
 		 "oldestOffset %" PRIu64 " (members segment %" PRIx64 ")",
-- 
2.47.3

#3Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#2)
Re: Multixid SLRU truncation bugs at wraparound

On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

For all branches, I propose v1-0002-Add-check-for-invalid-offset-at-multixid-truncati.patch to add a check for oldestOffset == 0. That fixes the potential for catastrophic truncation with invalid offset 0.

Multixid that is used in heap is WAL-logged. WAL-logged multixact has non-zero offset.
So in non-corrupted database such as condition is impossible.
However, I observed several incidents when AI recommended pg_resetwal to users.
Proposed safeguard might be useful to prevent sprawling corruption in database.

On 6 Jan 2026, at 16:53, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

But for 'master', I propose the attached v1-0001-Remove-some-unnecessary-code-from-multixact-trunc.patch.

The patch simplifies the code while maintaining correctness.
The only issue I can think of is that clog, commit_ts and async are still using approach based on SlruScanDirectory().

Best regards, Andrey Borodin.